syszbot reports a scenario where we recurse on the completion lock
when flushing an overflow:
1 lock held by syz-executor287/6816:
#0: ffff888093cdb4d8 (&ctx->completion_lock){....}-{2:2}, at: io_cqring_overflow_flush+0xc6/0xab0 fs/io_uring.c:1333
stack backtrace:
CPU: 1 PID: 6816 Comm: syz-executor287 Not tainted 5.8.0-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x1f0/0x31e lib/dump_stack.c:118
print_deadlock_bug kernel/locking/lockdep.c:2391 [inline]
check_deadlock kernel/locking/lockdep.c:2432 [inline]
validate_chain+0x69a4/0x88a0 kernel/locking/lockdep.c:3202
__lock_acquire+0x1161/0x2ab0 kernel/locking/lockdep.c:4426
lock_acquire+0x160/0x730 kernel/locking/lockdep.c:5005
__raw_spin_lock_irq include/linux/spinlock_api_smp.h:128 [inline]
_raw_spin_lock_irq+0x67/0x80 kernel/locking/spinlock.c:167
spin_lock_irq include/linux/spinlock.h:379 [inline]
io_queue_linked_timeout fs/io_uring.c:5928 [inline]
__io_queue_async_work fs/io_uring.c:1192 [inline]
__io_queue_deferred+0x36a/0x790 fs/io_uring.c:1237
io_cqring_overflow_flush+0x774/0xab0 fs/io_uring.c:1359
io_ring_ctx_wait_and_kill+0x2a1/0x570 fs/io_uring.c:7808
io_uring_release+0x59/0x70 fs/io_uring.c:7829
__fput+0x34f/0x7b0 fs/file_table.c:281
task_work_run+0x137/0x1c0 kernel/task_work.c:135
exit_task_work include/linux/task_work.h:25 [inline]
do_exit+0x5f3/0x1f20 kernel/exit.c:806
do_group_exit+0x161/0x2d0 kernel/exit.c:903
__do_sys_exit_group+0x13/0x20 kernel/exit.c:914
__se_sys_exit_group+0x10/0x10 kernel/exit.c:912
__x64_sys_exit_group+0x37/0x40 kernel/exit.c:912
do_syscall_64+0x31/0x70 arch/x86/entry/common.c:46
entry_SYSCALL_64_after_hwframe+0x44/0xa9
Fix this by passing back the link from __io_queue_async_work(), and
then let the caller handle the queueing of the link. Take care to also
punt the submission reference put to the caller, as we're holding the
completion lock for the __io_queue_defer() case. Hence we need to mark
the io_kiocb appropriately for that case.
Reported-by: syzbot+996f91b6ec3812c48042@syzkaller.appspotmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
An earlier commit:
b7db41c9e0 ("io_uring: fix regression with always ignoring signals in io_cqring_wait()")
ensured that we didn't get stuck waiting for eventfd reads when it's
registered with the io_uring ring for event notification, but we still
have cases where the task can be waiting on other events in the kernel and
need a bigger nudge to make forward progress. Or the task could be in the
kernel and running, but on its way to blocking.
This means that TWA_RESUME cannot reliably be used to ensure we make
progress. Use TWA_SIGNAL unconditionally.
Cc: stable@vger.kernel.org # v5.7+
Reported-by: Josef <josef.grieb@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
[BUG]
Unmounting a btrfs filesystem with quota disabled will cause the
following NULL pointer dereference:
BTRFS info (device dm-5): has skinny extents
BUG: kernel NULL pointer dereference, address: 0000000000000018
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
CPU: 7 PID: 637 Comm: umount Not tainted 5.8.0-rc7-next-20200731-custom #76
RIP: 0010:kobject_del+0x6/0x20
Call Trace:
btrfs_sysfs_del_qgroups+0xac/0xf0 [btrfs]
btrfs_free_qgroup_config+0x63/0x70 [btrfs]
close_ctree+0x1f5/0x323 [btrfs]
btrfs_put_super+0x15/0x17 [btrfs]
generic_shutdown_super+0x72/0x110
kill_anon_super+0x18/0x30
btrfs_kill_super+0x17/0x30 [btrfs]
deactivate_locked_super+0x3b/0xa0
deactivate_super+0x40/0x50
cleanup_mnt+0x135/0x190
__cleanup_mnt+0x12/0x20
task_work_run+0x64/0xb0
exit_to_user_mode_prepare+0x18a/0x190
syscall_exit_to_user_mode+0x4f/0x270
do_syscall_64+0x45/0x50
entry_SYSCALL_64_after_hwframe+0x44/0xa9
---[ end trace 37b7adca5c1d5c5d ]---
[CAUSE]
Commit 079ad2fb4b ("kobject: Avoid premature parent object freeing in
kobject_cleanup()") changed kobject_del() that it no longer accepts NULL
pointer.
Before that commit, kobject_del() and kobject_put() all accept NULL
pointers and just ignore such NULL pointers.
But that mentioned commit needs to access the parent node, killing the
old NULL pointer behavior.
Unfortunately btrfs is relying on that hidden feature thus we will
trigger such NULL pointer dereference.
[FIX]
Instead of just saving several lines, do proper fs_info->qgroups_kobj
check before calling kobject_del() and kobject_put().
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The `if (!ret)` check will always be false and it may result in
ret->path being dereferenced while it is a NULL pointer.
Fixes: a37f232b7b ("btrfs: backref: introduce the skeleton of btrfs_backref_iter")
CC: stable@vger.kernel.org # 5.8+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Boleyn Su <boleynsu@google.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Convert the uses of fallthrough comments to fallthrough macro.
Signed-off-by: Hongxiang Lou <louhongxiang@huawei.com>
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
There's some inconsistency around SB_I_VERSION handling with mount and
remount. Since we don't really want it to be off ever just work around
this by making sure we don't get the flag cleared on remount.
There's a tiny cpu cost of setting the bit, otherwise all changes to
i_version also change some of the times (ctime/mtime) so the inode needs
to be synced. We wouldn't save anything by disabling it.
Reported-by: Eric Sandeen <sandeen@redhat.com>
CC: stable@vger.kernel.org # 5.4+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add perf impact analysis ]
Signed-off-by: David Sterba <dsterba@suse.com>
While logging an inode, at copy_items(), if we fail to lookup the checksums
for an extent we release the destination path, free the ins_data array and
then return immediately. However a previous iteration of the for loop may
have added checksums to the ordered_sums list, in which case we leak the
memory used by them.
So fix this by making sure we iterate the ordered_sums list and free all
its checksums before returning.
Fixes: 3650860b90 ("Btrfs: remove almost all of the BUG()'s from tree-log.c")
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Chris Murphy reported a problem where rpm ostree will bind mount a bunch
of things for whatever voodoo it's doing. But when it does this
/proc/mounts shows something like
/dev/sda /mnt/test btrfs rw,relatime,subvolid=256,subvol=/foo 0 0
/dev/sda /mnt/test/baz btrfs rw,relatime,subvolid=256,subvol=/foo/bar 0 0
Despite subvolid=256 being subvol=/foo. This is because we're just
spitting out the dentry of the mount point, which in the case of bind
mounts is the source path for the mountpoint. Instead we should spit
out the path to the actual subvol. Fix this by looking up the name for
the subvolid we have mounted. With this fix the same test looks like
this
/dev/sda /mnt/test btrfs rw,relatime,subvolid=256,subvol=/foo 0 0
/dev/sda /mnt/test/baz btrfs rw,relatime,subvolid=256,subvol=/foo 0 0
Reported-by: Chris Murphy <chris@colorremedies.com>
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Reported by Forza on IRC that remounting with compression options does
not reflect the change in level, or at least it does not appear to do so
according to the messages:
mount -o compress=zstd:1 /dev/sda /mnt
mount -o remount,compress=zstd:15 /mnt
does not print the change to the level to syslog:
[ 41.366060] BTRFS info (device vda): use zstd compression, level 1
[ 41.368254] BTRFS info (device vda): disk space caching is enabled
[ 41.390429] BTRFS info (device vda): disk space caching is enabled
What really happens is that the message is lost but the level is actualy
changed.
There's another weird output, if compression is reset to 'no':
[ 45.413776] BTRFS info (device vda): use no compression, level 4
To fix that, save the previous compression level and print the message
in that case too and use separate message for 'no' compression.
CC: stable@vger.kernel.org # 4.19+
Signed-off-by: David Sterba <dsterba@suse.com>
In try_to_merge_free_space we attempt to find entries to the left and
right of the entry we are adding to see if they can be merged. We
search for an entry past our current info (saved into right_info), and
then if right_info exists and it has a rb_prev() we save the rb_prev()
into left_info.
However there's a slight problem in the case that we have a right_info,
but no entry previous to that entry. At that point we will search for
an entry just before the info we're attempting to insert. This will
simply find right_info again, and assign it to left_info, making them
both the same pointer.
Now if right_info _can_ be merged with the range we're inserting, we'll
add it to the info and free right_info. However further down we'll
access left_info, which was right_info, and thus get a use-after-free.
Fix this by only searching for the left entry if we don't find a right
entry at all.
The CVE referenced had a specially crafted file system that could
trigger this use-after-free. However with the tree checker improvements
we no longer trigger the conditions for the UAF. But the original
conditions still apply, hence this fix.
Reference: CVE-2019-19448
Fixes: 9630308170 ("Btrfs: use hybrid extents+bitmap rb tree for free space")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
There is a bug report of NULL pointer dereference caused in
compress_file_extent():
Oops: Kernel access of bad area, sig: 11 [#1]
LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
Workqueue: btrfs-delalloc btrfs_delalloc_helper [btrfs]
NIP [c008000006dd4d34] compress_file_range.constprop.41+0x75c/0x8a0 [btrfs]
LR [c008000006dd4d1c] compress_file_range.constprop.41+0x744/0x8a0 [btrfs]
Call Trace:
[c000000c69093b00] [c008000006dd4d1c] compress_file_range.constprop.41+0x744/0x8a0 [btrfs] (unreliable)
[c000000c69093bd0] [c008000006dd4ebc] async_cow_start+0x44/0xa0 [btrfs]
[c000000c69093c10] [c008000006e14824] normal_work_helper+0xdc/0x598 [btrfs]
[c000000c69093c80] [c0000000001608c0] process_one_work+0x2c0/0x5b0
[c000000c69093d10] [c000000000160c38] worker_thread+0x88/0x660
[c000000c69093db0] [c00000000016b55c] kthread+0x1ac/0x1c0
[c000000c69093e20] [c00000000000b660] ret_from_kernel_thread+0x5c/0x7c
---[ end trace f16954aa20d822f6 ]---
[CAUSE]
For the following execution route of compress_file_range(), it's
possible to hit NULL pointer dereference:
compress_file_extent()
|- pages = NULL;
|- start = async_chunk->start = 0;
|- end = async_chunk = 4095;
|- nr_pages = 1;
|- inode_need_compress() == false; <<< Possible, see later explanation
| Now, we have nr_pages = 1, pages = NULL
|- cont:
|- ret = cow_file_range_inline();
|- if (ret <= 0) {
|- for (i = 0; i < nr_pages; i++) {
|- WARN_ON(pages[i]->mapping); <<< Crash
To enter above call execution branch, we need the following race:
Thread 1 (chattr) | Thread 2 (writeback)
--------------------------+------------------------------
| btrfs_run_delalloc_range
| |- inode_need_compress = true
| |- cow_file_range_async()
btrfs_ioctl_set_flag() |
|- binode_flags |= |
BTRFS_INODE_NOCOMPRESS |
| compress_file_range()
| |- inode_need_compress = false
| |- nr_page = 1 while pages = NULL
| | Then hit the crash
[FIX]
This patch will fix it by checking @pages before doing accessing it.
This patch is only designed as a hot fix and easy to backport.
More elegant fix may make btrfs only check inode_need_compress() once to
avoid such race, but that would be another story.
Reported-by: Luciano Chavez <chavez@us.ibm.com>
Fixes: 4d3a800ebb ("btrfs: merge nr_pages input and output parameter in compress_pages")
CC: stable@vger.kernel.org # 4.14.x: cecc8d9038: btrfs: Move free_pages_out label in inline extent handling branch in compress_file_range
CC: stable@vger.kernel.org # 4.14+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
- Support for user extended attributes on NFS (RFC 8276)
- Further reduce unnecessary NFSv4 delegation recalls
Notable fixes:
- Fix recent krb5p regression
- Address a few resource leaks and a rare NULL dereference
Other:
- De-duplicate RPC/RDMA error handling and other utility functions
- Replace storage and display of kernel memory addresses by tracepoints
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAl8oBt0ACgkQM2qzM29m
f5dTFQ/9H72E6gr1onsia0/Py0CO8F9qzLgmUBl1vVYAh2/vPqUL1ypxrC5OYrAy
TOqESTsJvmGluCFc/77XUTD7NvJY3znIWim49okwDiyee4Y14ZfRhhCxyyA6Z94E
FjJQb5TbF1Mti4X3dN8Gn7O1Y/BfTjDAAXnXGlTA1xoLcxM5idWIj+G8x0bPmeDb
2fTbgsoETu6MpS2/L6mraXVh3d5ESOJH+73YvpBl0AhYPzlNASJZMLtHtd+A/JbO
IPkMP/7UA5DuJtWGeuQ4I4D5bQNpNWMfN6zhwtih4IV5bkRC7vyAOLG1R7w9+Ufq
58cxPiorMcsg1cHnXG0Z6WVtbMEdWTP/FzmJdE5RC7DEJhmmSUG/R0OmgDcsDZET
GovPARho01yp80GwTjCIctDHRRFRL4pdPfr8PjVHetSnx9+zoRUT+D70Zeg/KSy2
99gmCxqSY9BZeHoiVPEX/HbhXrkuDjUSshwl98OAzOFmv6kbwtLntgFbWlBdE6dB
mqOxBb73zEoZ5P9GA2l2ShU3GbzMzDebHBb9EyomXHZrLejoXeUNA28VJ+8vPP5S
IVHnEwOkdJrNe/7cH4jd/B0NR6f8Da/F9kmkLiG2GNPMqQ8bnVhxTUtZkcAE+fd4
f34qLxsoht70wSSfISjBs7hP5KxEM1lOAf0w0RpycPUKJNV1FB0=
=OEpF
-----END PGP SIGNATURE-----
Merge tag 'nfsd-5.9' of git://git.linux-nfs.org/projects/cel/cel-2.6
Pull NFS server updates from Chuck Lever:
"Highlights:
- Support for user extended attributes on NFS (RFC 8276)
- Further reduce unnecessary NFSv4 delegation recalls
Notable fixes:
- Fix recent krb5p regression
- Address a few resource leaks and a rare NULL dereference
Other:
- De-duplicate RPC/RDMA error handling and other utility functions
- Replace storage and display of kernel memory addresses by tracepoints"
* tag 'nfsd-5.9' of git://git.linux-nfs.org/projects/cel/cel-2.6: (38 commits)
svcrdma: CM event handler clean up
svcrdma: Remove transport reference counting
svcrdma: Fix another Receive buffer leak
SUNRPC: Refresh the show_rqstp_flags() macro
nfsd: netns.h: delete a duplicated word
SUNRPC: Fix ("SUNRPC: Add "@len" parameter to gss_unwrap()")
nfsd: avoid a NULL dereference in __cld_pipe_upcall()
nfsd4: a client's own opens needn't prevent delegations
nfsd: Use seq_putc() in two functions
svcrdma: Display chunk completion ID when posting a rw_ctxt
svcrdma: Record send_ctxt completion ID in trace_svcrdma_post_send()
svcrdma: Introduce Send completion IDs
svcrdma: Record Receive completion ID in svc_rdma_decode_rqst
svcrdma: Introduce Receive completion IDs
svcrdma: Introduce infrastructure to support completion IDs
svcrdma: Add common XDR encoders for RDMA and Read segments
svcrdma: Add common XDR decoders for RDMA and Read segments
SUNRPC: Add helpers for decoding list discriminators symbolically
svcrdma: Remove declarations for functions long removed
svcrdma: Clean up trace_svcrdma_send_failed() tracepoint
...
Pull misc vfs updates from Al Viro:
"No common topic whatsoever in those, sorry"
* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
fs: define inode flags using bit numbers
iov_iter: Move unnecessary inclusion of crypto/hash.h
dlmfs: clean up dlmfs_file_{read,write}() a bit
Pull mount leak fix from Al Viro:
"Regression fix for the syscalls-for-init series - fix a leak of a 'struct path'"
* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
fs: fix a struct path leak in path_umount
Make sure we also put the dentry and vfsmnt in the illegal flags
and !may_umount cases.
Fixes: 41525f56e2 ("fs: refactor ksys_umount")
Reported-by: Vikas Kumar <vikas.kumar2@arm.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Pull fdpick coredump update from Al Viro:
"Switches fdpic coredumps away from original aout dumping primitives to
the same kind of regset use as regular elf coredumps do"
* 'work.fdpic' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
[elf-fdpic] switch coredump to regsets
[elf-fdpic] use elf_dump_thread_status() for the dumper thread as well
[elf-fdpic] move allocation of elf_thread_status into elf_dump_thread_status()
[elf-fdpic] coredump: don't bother with cyclic list for per-thread objects
kill elf_fpxregs_t
take fdpic-related parts of elf_prstatus out
unexport linux/elfcore.h
ext4_search_dir() and ext4_generic_delete_entry() can be called both for
standard director blocks and for inline directories stored inside inode
or inline xattr space. For the second case we didn't call
ext4_check_dir_entry() with proper constraints that could result in
accepting corrupted directory entry as well as false positive filesystem
errors like:
EXT4-fs error (device dm-0): ext4_search_dir:1395: inode #28320400:
block 113246792: comm dockerd: bad entry in directory: directory entry too
close to block end - offset=0, inode=28320403, rec_len=32, name_len=8,
size=4096
Fix the arguments passed to ext4_check_dir_entry().
Fixes: 109ba779d6 ("ext4: check for directory entries too close to block end")
CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20200731162135.8080-1-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
If a device is hot-removed --- for example, when a physical device is
unplugged from pcie slot or a nbd device's network is shutdown ---
this can result in a BUG_ON() crash in submit_bh_wbc(). This is
because the when the block device dies, the buffer heads will have
their Buffer_Mapped flag get cleared, leading to the crash in
submit_bh_wbc.
We had attempted to work around this problem in commit a17712c8
("ext4: check superblock mapped prior to committing"). Unfortunately,
it's still possible to hit the BUG_ON(!buffer_mapped(bh)) if the
device dies between when the work-around check in ext4_commit_super()
and when submit_bh_wbh() is finally called:
Code path:
ext4_commit_super
judge if 'buffer_mapped(sbh)' is false, return <== commit a17712c8
lock_buffer(sbh)
...
unlock_buffer(sbh)
__sync_dirty_buffer(sbh,...
lock_buffer(sbh)
judge if 'buffer_mapped(sbh))' is false, return <== added by this patch
submit_bh(...,sbh)
submit_bh_wbc(...,sbh,...)
[100722.966497] kernel BUG at fs/buffer.c:3095! <== BUG_ON(!buffer_mapped(bh))' in submit_bh_wbc()
[100722.966503] invalid opcode: 0000 [#1] SMP
[100722.966566] task: ffff8817e15a9e40 task.stack: ffffc90024744000
[100722.966574] RIP: 0010:submit_bh_wbc+0x180/0x190
[100722.966575] RSP: 0018:ffffc90024747a90 EFLAGS: 00010246
[100722.966576] RAX: 0000000000620005 RBX: ffff8818a80603a8 RCX: 0000000000000000
[100722.966576] RDX: ffff8818a80603a8 RSI: 0000000000020800 RDI: 0000000000000001
[100722.966577] RBP: ffffc90024747ac0 R08: 0000000000000000 R09: ffff88207f94170d
[100722.966578] R10: 00000000000437c8 R11: 0000000000000001 R12: 0000000000020800
[100722.966578] R13: 0000000000000001 R14: 000000000bf9a438 R15: ffff88195f333000
[100722.966580] FS: 00007fa2eee27700(0000) GS:ffff88203d840000(0000) knlGS:0000000000000000
[100722.966580] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[100722.966581] CR2: 0000000000f0b008 CR3: 000000201a622003 CR4: 00000000007606e0
[100722.966582] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[100722.966583] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[100722.966583] PKRU: 55555554
[100722.966583] Call Trace:
[100722.966588] __sync_dirty_buffer+0x6e/0xd0
[100722.966614] ext4_commit_super+0x1d8/0x290 [ext4]
[100722.966626] __ext4_std_error+0x78/0x100 [ext4]
[100722.966635] ? __ext4_journal_get_write_access+0xca/0x120 [ext4]
[100722.966646] ext4_reserve_inode_write+0x58/0xb0 [ext4]
[100722.966655] ? ext4_dirty_inode+0x48/0x70 [ext4]
[100722.966663] ext4_mark_inode_dirty+0x53/0x1e0 [ext4]
[100722.966671] ? __ext4_journal_start_sb+0x6d/0xf0 [ext4]
[100722.966679] ext4_dirty_inode+0x48/0x70 [ext4]
[100722.966682] __mark_inode_dirty+0x17f/0x350
[100722.966686] generic_update_time+0x87/0xd0
[100722.966687] touch_atime+0xa9/0xd0
[100722.966690] generic_file_read_iter+0xa09/0xcd0
[100722.966694] ? page_cache_tree_insert+0xb0/0xb0
[100722.966704] ext4_file_read_iter+0x4a/0x100 [ext4]
[100722.966707] ? __inode_security_revalidate+0x4f/0x60
[100722.966709] __vfs_read+0xec/0x160
[100722.966711] vfs_read+0x8c/0x130
[100722.966712] SyS_pread64+0x87/0xb0
[100722.966716] do_syscall_64+0x67/0x1b0
[100722.966719] entry_SYSCALL64_slow_path+0x25/0x25
To address this, add the check of 'buffer_mapped(bh)' to
__sync_dirty_buffer(). This also has the benefit of fixing this for
other file systems.
With this addition, we can drop the workaround in ext4_commit_supper().
[ Commit description rewritten by tytso. ]
Signed-off-by: Xianting Tian <xianting_tian@126.com>
Link: https://lore.kernel.org/r/1596211825-8750-1-git-send-email-xianting_tian@126.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
If xfs_sysfs_init is called with parent_kobj == NULL, UBSAN
shows the following warning:
UBSAN: null-ptr-deref in ./fs/xfs/xfs_sysfs.h:37:23
member access within null pointer of type 'struct xfs_kobj'
Call Trace:
dump_stack+0x10e/0x195
ubsan_type_mismatch_common+0x241/0x280
__ubsan_handle_type_mismatch_v1+0x32/0x40
init_xfs_fs+0x12b/0x28f
do_one_initcall+0xdd/0x1d0
do_initcall_level+0x151/0x1b6
do_initcalls+0x50/0x8f
do_basic_setup+0x29/0x2b
kernel_init_freeable+0x19f/0x20b
kernel_init+0x11/0x1e0
ret_from_fork+0x22/0x30
Fix it by checking parent_kobj before the code accesses its member.
Signed-off-by: Eiichi Tsukata <devel@etsukata.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
[darrick: minor whitespace edits]
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Merge misc updates from Andrew Morton:
- a few MM hotfixes
- kthread, tools, scripts, ntfs and ocfs2
- some of MM
Subsystems affected by this patch series: kthread, tools, scripts, ntfs,
ocfs2 and mm (hofixes, pagealloc, slab-generic, slab, slub, kcsan,
debug, pagecache, gup, swap, shmem, memcg, pagemap, mremap, mincore,
sparsemem, vmalloc, kasan, pagealloc, hugetlb and vmscan).
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (162 commits)
mm: vmscan: consistent update to pgrefill
mm/vmscan.c: fix typo
khugepaged: khugepaged_test_exit() check mmget_still_valid()
khugepaged: retract_page_tables() remember to test exit
khugepaged: collapse_pte_mapped_thp() protect the pmd lock
khugepaged: collapse_pte_mapped_thp() flush the right range
mm/hugetlb: fix calculation of adjust_range_if_pmd_sharing_possible
mm: thp: replace HTTP links with HTTPS ones
mm/page_alloc: fix memalloc_nocma_{save/restore} APIs
mm/page_alloc.c: skip setting nodemask when we are in interrupt
mm/page_alloc: fallbacks at most has 3 elements
mm/page_alloc: silence a KASAN false positive
mm/page_alloc.c: remove unnecessary end_bitidx for [set|get]_pfnblock_flags_mask()
mm/page_alloc.c: simplify pageblock bitmap access
mm/page_alloc.c: extract the common part in pfn_to_bitidx()
mm/page_alloc.c: replace the definition of NR_MIGRATETYPE_BITS with PB_migratetype_bits
mm/shuffle: remove dynamic reconfiguration
mm/memory_hotplug: document why shuffle_zone() is relevant
mm/page_alloc: remove nr_free_pagecache_pages()
mm: remove vm_total_pages
...
The current split between do_mmap() and do_mmap_pgoff() was introduced in
commit 1fcfd8db7f ("mm, mpx: add "vm_flags_t vm_flags" arg to
do_mmap_pgoff()") to support MPX.
The wrapper function do_mmap_pgoff() always passed 0 as the value of the
vm_flags argument to do_mmap(). However, MPX support has subsequently
been removed from the kernel and there were no more direct callers of
do_mmap(); all calls were going via do_mmap_pgoff().
Simplify the code by removing do_mmap_pgoff() and changing all callers to
directly call do_mmap(), which now no longer takes a vm_flags argument.
Signed-off-by: Peter Collingbourne <pcc@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Link: http://lkml.kernel.org/r/20200727194109.1371462-1-pcc@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "make vm_committed_as_batch aware of vm overcommit policy", v6.
When checking a performance change for will-it-scale scalability mmap test
[1], we found very high lock contention for spinlock of percpu counter
'vm_committed_as':
94.14% 0.35% [kernel.kallsyms] [k] _raw_spin_lock_irqsave
48.21% _raw_spin_lock_irqsave;percpu_counter_add_batch;__vm_enough_memory;mmap_region;do_mmap;
45.91% _raw_spin_lock_irqsave;percpu_counter_add_batch;__do_munmap;
Actually this heavy lock contention is not always necessary. The
'vm_committed_as' needs to be very precise when the strict
OVERCOMMIT_NEVER policy is set, which requires a rather small batch number
for the percpu counter.
So keep 'batch' number unchanged for strict OVERCOMMIT_NEVER policy, and
enlarge it for not-so-strict OVERCOMMIT_ALWAYS and OVERCOMMIT_GUESS
policies.
Benchmark with the same testcase in [1] shows 53% improvement on a 8C/16T
desktop, and 2097%(20X) on a 4S/72C/144T server. And for that case,
whether it shows improvements depends on if the test mmap size is bigger
than the batch number computed.
We tested 10+ platforms in 0day (server, desktop and laptop). If we lift
it to 64X, 80%+ platforms show improvements, and for 16X lift, 1/3 of the
platforms will show improvements.
And generally it should help the mmap/unmap usage,as Michal Hocko
mentioned:
: I believe that there are non-synthetic worklaods which would benefit
: from a larger batch. E.g. large in memory databases which do large
: mmaps during startups from multiple threads.
Note: There are some style complain from checkpatch for patch 4, as sysctl
handler declaration follows the similar format of sibling functions
[1] https://lore.kernel.org/lkml/20200305062138.GI5972@shao2-debian/
This patch (of 4):
Use the existing vm_memory_committed() instead, which is also convenient
for future change.
Signed-off-by: Feng Tang <feng.tang@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Qian Cai <cai@lca.pw>
Cc: Kees Cook <keescook@chromium.org>
Cc: Andi Kleen <andi.kleen@intel.com>
Cc: Tim Chen <tim.c.chen@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: kernel test robot <rong.a.chen@intel.com>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Tejun Heo <tj@kernel.org>
Link: http://lkml.kernel.org/r/1594389708-60781-1-git-send-email-feng.tang@intel.com
Link: http://lkml.kernel.org/r/1594389708-60781-2-git-send-email-feng.tang@intel.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "mm: cleanup usage of <asm/pgalloc.h>"
Most architectures have very similar versions of pXd_alloc_one() and
pXd_free_one() for intermediate levels of page table. These patches add
generic versions of these functions in <asm-generic/pgalloc.h> and enable
use of the generic functions where appropriate.
In addition, functions declared and defined in <asm/pgalloc.h> headers are
used mostly by core mm and early mm initialization in arch and there is no
actual reason to have the <asm/pgalloc.h> included all over the place.
The first patch in this series removes unneeded includes of
<asm/pgalloc.h>
In the end it didn't work out as neatly as I hoped and moving
pXd_alloc_track() definitions to <asm-generic/pgalloc.h> would require
unnecessary changes to arches that have custom page table allocations, so
I've decided to move lib/ioremap.c to mm/ and make pgalloc-track.h local
to mm/.
This patch (of 8):
In most cases <asm/pgalloc.h> header is required only for allocations of
page table memory. Most of the .c files that include that header do not
use symbols declared in <asm/pgalloc.h> and do not require that header.
As for the other header files that used to include <asm/pgalloc.h>, it is
possible to move that include into the .c file that actually uses symbols
from <asm/pgalloc.h> and drop the include from the header file.
The process was somewhat automated using
sed -i -E '/[<"]asm\/pgalloc\.h/d' \
$(grep -L -w -f /tmp/xx \
$(git grep -E -l '[<"]asm/pgalloc\.h'))
where /tmp/xx contains all the symbols defined in
arch/*/include/asm/pgalloc.h.
[rppt@linux.ibm.com: fix powerpc warning]
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Pekka Enberg <penberg@kernel.org>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> [m68k]
Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Satheesh Rajendran <sathnaga@linux.vnet.ibm.com>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: Matthew Wilcox <willy@infradead.org>
Link: http://lkml.kernel.org/r/20200627143453.31835-1-rppt@kernel.org
Link: http://lkml.kernel.org/r/20200627143453.31835-2-rppt@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently the kernel stack is being accounted per-zone. There is no need
to do that. In addition due to being per-zone, memcg has to keep a
separate MEMCG_KERNEL_STACK_KB. Make the stat per-node and deprecate
MEMCG_KERNEL_STACK_KB as memcg_stat_item is an extension of
node_stat_item. In addition localize the kernel stack stats updates to
account_kernel_stack().
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Roman Gushchin <guro@fb.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Link: http://lkml.kernel.org/r/20200630161539.1759185-1-shakeelb@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In order to prepare for per-object slab memory accounting, convert
NR_SLAB_RECLAIMABLE and NR_SLAB_UNRECLAIMABLE vmstat items to bytes.
To make it obvious, rename them to NR_SLAB_RECLAIMABLE_B and
NR_SLAB_UNRECLAIMABLE_B (similar to NR_KERNEL_STACK_KB).
Internally global and per-node counters are stored in pages, however memcg
and lruvec counters are stored in bytes. This scheme may look weird, but
only for now. As soon as slab pages will be shared between multiple
cgroups, global and node counters will reflect the total number of slab
pages. However memcg and lruvec counters will be used for per-memcg slab
memory tracking, which will take separate kernel objects in the account.
Keeping global and node counters in pages helps to avoid additional
overhead.
The size of slab memory shouldn't exceed 4Gb on 32-bit machines, so it
will fit into atomic_long_t we use for vmstats.
Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Link: http://lkml.kernel.org/r/20200623174037.3951353-4-guro@fb.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The default is still set to inode32 for backwards compatibility, but
system administrators can opt in to the new 64-bit inode numbers by
either:
1. Passing inode64 on the command line when mounting, or
2. Configuring the kernel with CONFIG_TMPFS_INODE64=y
The inode64 and inode32 names are used based on existing precedent from
XFS.
[hughd@google.com: Kconfig fixes]
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2008011928010.13320@eggly.anvils
Signed-off-by: Chris Down <chris@chrisdown.name>
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tejun Heo <tj@kernel.org>
Link: http://lkml.kernel.org/r/8b23758d0c66b5e2263e08baf9c4b6a7565cbd8f.1594661218.git.chris@chrisdown.name
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
As said by Linus:
A symmetric naming is only helpful if it implies symmetries in use.
Otherwise it's actively misleading.
In "kzalloc()", the z is meaningful and an important part of what the
caller wants.
In "kzfree()", the z is actively detrimental, because maybe in the
future we really _might_ want to use that "memfill(0xdeadbeef)" or
something. The "zero" part of the interface isn't even _relevant_.
The main reason that kzfree() exists is to clear sensitive information
that should not be leaked to other future users of the same memory
objects.
Rename kzfree() to kfree_sensitive() to follow the example of the recently
added kvfree_sensitive() and make the intention of the API more explicit.
In addition, memzero_explicit() is used to clear the memory to make sure
that it won't get optimized away by the compiler.
The renaming is done by using the command sequence:
git grep -w --name-only kzfree |\
xargs sed -i 's/kzfree/kfree_sensitive/'
followed by some editing of the kfree_sensitive() kerneldoc and adding
a kzfree backward compatibility macro in slab.h.
[akpm@linux-foundation.org: fs/crypto/inline_crypt.c needs linux/slab.h]
[akpm@linux-foundation.org: fix fs/crypto/inline_crypt.c some more]
Suggested-by: Joe Perches <joe@perches.com>
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: David Howells <dhowells@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
Cc: James Morris <jmorris@namei.org>
Cc: "Serge E. Hallyn" <serge@hallyn.com>
Cc: Joe Perches <joe@perches.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: "Jason A . Donenfeld" <Jason@zx2c4.com>
Link: http://lkml.kernel.org/r/20200616154311.12314-3-longman@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Based on what fails, function can return with nfs_sync_rwlock either
locked or unlocked. That can not be right.
Always return with lock unlocked on error.
Fixes: 4cd9973f9f ("ocfs2: avoid inode removal while nfsd is accessing it")
Signed-off-by: Pavel Machek (CIP) <pavel@denx.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Link: http://lkml.kernel.org/r/20200724124443.GA28164@duo.ucw.cz
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Rationale:
Reduces attack surface on kernel devs opening the links for MITM
as HTTPS traffic is much harder to manipulate.
Deterministic algorithm:
For each file:
If not .svg:
For each line:
If doesn't contain `xmlns`:
For each link, `http://[^# ]*(?:\w|/)`:
If neither `gnu\.org/license`, nor `mozilla\.org/MPL`:
If both the HTTP and HTTPS versions
return 200 OK and serve the same content:
Replace HTTP with HTTPS.
Signed-off-by: Alexander A. Klimov <grandmaster@al2klimov.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Link: http://lkml.kernel.org/r/20200713174456.36596-1-grandmaster@al2klimov.de
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Dan Carpenter reported the following static checker warning.
fs/ocfs2/super.c:1269 ocfs2_parse_options() warn: '(-1)' 65535 can't fit into 32767 'mopt->slot'
fs/ocfs2/suballoc.c:859 ocfs2_init_inode_steal_slot() warn: '(-1)' 65535 can't fit into 32767 'osb->s_inode_steal_slot'
fs/ocfs2/suballoc.c:867 ocfs2_init_meta_steal_slot() warn: '(-1)' 65535 can't fit into 32767 'osb->s_meta_steal_slot'
That's because OCFS2_INVALID_SLOT is (u16)-1. Slot number in ocfs2 can be
never negative, so change s16 to u16.
Fixes: 9277f8334f ("ocfs2: fix value of OCFS2_INVALID_SLOT")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: Gang He <ghe@suse.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Jun Piao <piaojun@huawei.com>
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/20200627001259.19757-1-junxiao.bi@oracle.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Drop the repeated word "is" in a comment.
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
Link: http://lkml.kernel.org/r/20200720001421.28823-1-rdunlap@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When use setfacl command to change a file's acl, the user cannot get the
latest acl information from the file via getfacl command, until remounting
the file system.
e.g.
setfacl -m u:ivan:rw /ocfs2/ivan
getfacl /ocfs2/ivan
getfacl: Removing leading '/' from absolute path names
file: ocfs2/ivan
owner: root
group: root
user::rw-
group::r--
mask::r--
other::r--
The latest acl record("u:ivan:rw") cannot be returned via getfacl
command until remounting.
Signed-off-by: Gang He <ghe@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Jun Piao <piaojun@huawei.com>
Link: http://lkml.kernel.org/r/20200717023751.9922-1-ghe@suse.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Clang's Control Flow Integrity (CFI) is a security mechanism that can help
prevent JOP chains, deployed extensively in downstream kernels used in
Android.
Its deployment is hindered by mismatches in function signatures. For this
case, we make callbacks match their intended function signature, and cast
parameters within them rather than casting the callback when passed as a
parameter.
When running `mount -t ntfs ...` we observe the following trace:
Call trace:
__cfi_check_fail+0x1c/0x24
name_to_dev_t+0x0/0x404
iget5_locked+0x594/0x5e8
ntfs_fill_super+0xbfc/0x43ec
mount_bdev+0x30c/0x3cc
ntfs_mount+0x18/0x24
mount_fs+0x1b0/0x380
vfs_kern_mount+0x90/0x398
do_mount+0x5d8/0x1a10
SyS_mount+0x108/0x144
el0_svc_naked+0x34/0x38
Signed-off-by: Luca Stefani <luca.stefani.ge1@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: freak07 <michalechner92@googlemail.com>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Reviewed-by: Nathan Chancellor <natechancellor@gmail.com>
Acked-by: Anton Altaparmakov <anton@tuxera.com>
Link: http://lkml.kernel.org/r/20200718112513.533800-1-luca.stefani.ge1@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When remounting filesystem fails late during remount handling and
block_validity mount option is also changed during the remount, we fail
to restore system zone information to a state matching the mount option.
This is mostly harmless, just the block validity checking will not match
the situation described by the mount option. Make sure these two are always
consistent.
Reported-by: Lukas Czerner <lczerner@redhat.com>
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20200728130437.7804-7-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
There's one place that fails to handle error from add_system_zone() call
and thus we can fail to protect superblock and group-descriptor blocks
properly in case of ENOMEM. Fix it.
Reported-by: Lukas Czerner <lczerner@redhat.com>
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20200728130437.7804-6-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
After the previous patch, ext4_data_block_valid_rcu() has a single
caller. Fold it into it.
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20200728130437.7804-5-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Currently, system zones just track ranges of block, that are "important"
fs metadata (bitmaps, group descriptors, journal blocks, etc.). This
however complicates how extent tree (or indirect blocks) can be checked
for inodes that actually track such metadata - currently the journal
inode but arguably we should be treating quota files or resize inode
similarly. We cannot run __ext4_ext_check() on such metadata inodes when
loading their extents as that would immediately trigger the validity
checks and so we just hack around that and special-case the journal
inode. This however leads to a situation that a journal inode which has
extent tree of depth at least one can have invalid extent tree that gets
unnoticed until ext4_cache_extents() crashes.
To overcome this limitation, track inode number each system zone belongs
to (0 is used for zones not belonging to any inode). We can then verify
inode number matches the expected one when verifying extent tree and
thus avoid the false errors. With this there's no need to to
special-case journal inode during extent tree checking anymore so remove
it.
Fixes: 0a944e8a6c ("ext4: don't perform block validity checks on the journal inode")
Reported-by: Wolfgang Frisch <wolfgang.frisch@suse.com>
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20200728130437.7804-4-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Currently, add_system_zone() just silently merges two added system zones
that overlap. However the overlap should not happen and it generally
suggests that some unrelated metadata overlap which indicates the fs is
corrupted. We should have caught such problems earlier (e.g. in
ext4_check_descriptors()) but add this check as another line of defense.
In later patch we also use this for stricter checking of journal inode
extent tree.
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20200728130437.7804-3-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
ext4_setup_system_zone() can fail. Handle the failure in ext4_remount().
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20200728130437.7804-2-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
This numbers can be analized by system automation similar to errors_count.
In ideal world it would be nice to have separate counters for different
log-levels, but this makes this patch too intrusive.
Signed-off-by: Dmitry Monakhov <dmtrmonakhov@yandex-team.ru>
Link: https://lore.kernel.org/r/20200725123313.4467-1-dmtrmonakhov@yandex-team.ru
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Currently there is a problem with mount options that can be both set by
vfs using mount flags or by a string parsing in ext4.
i_version/iversion options gets lost after remount, for example
$ mount -o i_version /dev/pmem0 /mnt
$ grep pmem0 /proc/self/mountinfo | grep i_version
310 95 259:0 / /mnt rw,relatime shared:163 - ext4 /dev/pmem0 rw,seclabel,i_version
$ mount -o remount,ro /mnt
$ grep pmem0 /proc/self/mountinfo | grep i_version
nolazytime gets ignored by ext4 on remount, for example
$ mount -o lazytime /dev/pmem0 /mnt
$ grep pmem0 /proc/self/mountinfo | grep lazytime
310 95 259:0 / /mnt rw,relatime shared:163 - ext4 /dev/pmem0 rw,lazytime,seclabel
$ mount -o remount,nolazytime /mnt
$ grep pmem0 /proc/self/mountinfo | grep lazytime
310 95 259:0 / /mnt rw,relatime shared:163 - ext4 /dev/pmem0 rw,lazytime,seclabel
Fix it by applying the SB_LAZYTIME and SB_I_VERSION flags from *flags to
s_flags before we parse the option and use the resulting state of the
same flags in *flags at the end of successful remount.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/20200723150526.19931-1-lczerner@redhat.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
For file systems where we can afford to keep the buddy bitmaps cached,
we can speed up initial writes to large file systems by starting to
load the block allocation bitmaps as soon as the file system is
mounted. This won't work well for _super_ large file systems, or
memory constrained systems, so we only enable this when it is
requested via a mount option.
Addresses-Google-Bug: 159488342
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Modify the ext4_read_block_bitmap_load tracepoint so that it tells us
whether a block bitmap is being prefetched.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Artem Blagodarenko <artem.blagodarenko@gmail.com>
Parameter gfp_mask in jbd2_journal_try_to_free_buffers() is no longer
used after commit <536fc240e7147> ("jbd2: clean up
jbd2_journal_try_to_free_buffers()"), so just remove it.
Signed-off-by: zhangyi (F) <yi.zhang@huawei.com>
Link: https://lore.kernel.org/r/20200620025427.1756360-6-yi.zhang@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
If we free a metadata buffer which has been failed to async write out
in the background, the jbd2 checkpoint procedure will not detect this
failure in jbd2_log_do_checkpoint(), so it may lead to filesystem
inconsistency after cleanup journal tail. This patch abort the journal
if free a buffer has write_io_error flag to prevent potential further
inconsistency.
Signed-off-by: zhangyi (F) <yi.zhang@huawei.com>
Link: https://lore.kernel.org/r/20200620025427.1756360-5-yi.zhang@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
There is a risk of filesystem inconsistency if we failed to async write
back metadata buffer in the background. Because of current buffer's end
io procedure is handled by end_buffer_async_write() in the block layer,
and it only clear the buffer's uptodate flag and mark the write_io_error
flag, so ext4 cannot detect such failure immediately. In most cases of
getting metadata buffer (e.g. ext4_read_inode_bitmap()), although the
buffer's data is actually uptodate, it may still read data from disk
because the buffer's uptodate flag has been cleared. Finally, it may
lead to on-disk filesystem inconsistency if reading old data from the
disk successfully and write them out again.
This patch detect bdev mapping->wb_err when getting journal's write
access and mark the filesystem error if bdev's mapping->wb_err was
increased, this could prevent further writing and potential
inconsistency.
Signed-off-by: zhangyi (F) <yi.zhang@huawei.com>
Suggested-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20200620025427.1756360-2-yi.zhang@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
- Fix some btree block pingponging problems when swapping extents
- Redesign the reflink copy loop so that we only run one remapping
operation per transaction. This helps us avoid running out of block
reservation on highly deduped filesystems.
- Take the MMAPLOCK around filemap_map_pages.
- Make inode reclaim fully async so that we avoid stalling processes on
flushing inodes to disk.
- Reduce inode cluster buffer RMW cycles by attaching the buffer to
dirty inodes so we won't let go of the cluster buffer when we know
we're going to need it soon.
- Add some more checks to the realtime bitmap file scrubber.
- Don't trip false lockdep warnings in fs freeze.
- Remove various redundant lines of code.
- Remove unnecessary calls to xfs_perag_{get,put}.
- Preserve I_VERSION state across remounts.
- Fix an unmount hang due to AIL going to sleep with a non-empty delwri
buffer list.
- Fix an error in the inode allocation space reservation macro that
caused regressions in generic/531.
- Fix a potential livelock when dquot flush fails because the dquot
buffer is locked.
- Fix a miscalculation when reserving inode quota that could cause users
to exceed a hardlimit.
- Refactor struct xfs_dquot to use native types for incore fields
instead of abusing the ondisk struct for this purpose. This will
eventually enable proper y2038+ support, but for now it merely cleans
up the quota function declarations.
- Actually increment the quota softlimit warning counter so that soft
failures turn into hard(er) failures when they exceed the softlimit
warning counter limits set by the administrator.
- Split incore dquot state flags into their own field and namespace, to
avoid mixing them with quota type flags.
- Create a new quota type flags namespace so that we can make it obvious
when a quota function takes a quota type (user, group, project) as an
argument.
- Rename the ondisk dquot flags field to type, as that more accurately
represents what we store in it.
- Drop our bespoke memory allocation flags in favor of GFP_*.
- Rearrange the xattr functions so that we no longer mix metadata
updates and transaction management (e.g. rolling complex transactions)
in the same functions. This work will prepare us for atomic xattr
operations (itself a prerequisite for directory backrefs) in future
release cycles.
- Support FS_DAX_FL (aka FS_XFLAG_DAX) via GETFLAGS/SETFLAGS.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAl8hmOsACgkQ+H93GTRK
tOvCbQ//ax1IR0Bdfz0G85ouNqOqjqpKmtjLJWl4tK0e99AtIFFyjyst0BmiPM61
M2ebqrQ4KtwGcnqPMVczxift4MRfsK1T2WmivuF6GpUTJjEcfo/qDjwPgFT7Gdfc
gVCKWozFnv7z8cOVmRxP3jQR+r32FMnc4Nf8ZZ4LO2gGAqfDySZKFJXjkywR5ETk
rE0BivsXKqldbSA0nibMwmxNIWn+tBE+Bv3rSDAd6ZWEKbBZkrxf5GW+GkBD/xon
HT+T8lKFG5F+9kmL+BRhtV2eZkcbAOmP6x6NTX0SZcXlX2BBT7ltBusjal0lK0uh
IizqZv5UG6S2cv0j69EbXm3gvXBs+okAGLRtIPAExfWXf/x0JZp6dNbsnwwI0ZSC
N1Uy3lAcyF1Wybuf/6w4bMu8zrVcty8wSOD5psL4GXnhvw9c8iASszwGOUnOUH84
jRZYJNE9jsdItjP/5hNANaidPBnapPxWY4nvqJN4H3FjqTQOakE4X2OSB0LREIDp
avAFISrlU2dl0AwMHxCDOlv56FkHsIZ8aJmCpGPQdYpGu8jjYWTUYKvUvcjH/NBW
aVL7pUgP5r3vIxawS9tJcy1t3t7JDZmC+w5oasmQ0Rsk1r5mTgNrP6xue6rZzaRg
wo6mWZvteoWtEZJnT+L4Glx7i1srMj++Dqu24cJ92o/omA+fmr4=
=nhPQ
-----END PGP SIGNATURE-----
Merge tag 'xfs-5.9-merge-7' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs updates from Darrick Wong:
"There are quite a few changes in this release, the most notable of
which is that we've made inode flushing fully asynchronous, and we no
longer block memory reclaim on this.
Furthermore, we have fixed a long-standing bug in the quota code where
soft limit warnings and inode limits were never tracked properly.
Moving further down the line, the reflink control loops have been
redesigned to behave more efficiently; and numerous small bugs have
been fixed (see below). The xattr and quota code have been extensively
refactored in preparation for more new features coming down the line.
Finally, the behavior of DAX between ext4 and xfs has been stabilized,
which gets us a step closer to removing the experimental tag from that
feature.
We have a few new contributors this time around. Welcome, all!
I anticipate a second pull request next week for a few small bugfixes
that have been trickling in, but this is it for big changes.
Summary:
- Fix some btree block pingponging problems when swapping extents
- Redesign the reflink copy loop so that we only run one remapping
operation per transaction. This helps us avoid running out of block
reservation on highly deduped filesystems.
- Take the MMAPLOCK around filemap_map_pages.
- Make inode reclaim fully async so that we avoid stalling processes
on flushing inodes to disk.
- Reduce inode cluster buffer RMW cycles by attaching the buffer to
dirty inodes so we won't let go of the cluster buffer when we know
we're going to need it soon.
- Add some more checks to the realtime bitmap file scrubber.
- Don't trip false lockdep warnings in fs freeze.
- Remove various redundant lines of code.
- Remove unnecessary calls to xfs_perag_{get,put}.
- Preserve I_VERSION state across remounts.
- Fix an unmount hang due to AIL going to sleep with a non-empty
delwri buffer list.
- Fix an error in the inode allocation space reservation macro that
caused regressions in generic/531.
- Fix a potential livelock when dquot flush fails because the dquot
buffer is locked.
- Fix a miscalculation when reserving inode quota that could cause
users to exceed a hardlimit.
- Refactor struct xfs_dquot to use native types for incore fields
instead of abusing the ondisk struct for this purpose. This will
eventually enable proper y2038+ support, but for now it merely
cleans up the quota function declarations.
- Actually increment the quota softlimit warning counter so that soft
failures turn into hard(er) failures when they exceed the softlimit
warning counter limits set by the administrator.
- Split incore dquot state flags into their own field and namespace,
to avoid mixing them with quota type flags.
- Create a new quota type flags namespace so that we can make it
obvious when a quota function takes a quota type (user, group,
project) as an argument.
- Rename the ondisk dquot flags field to type, as that more
accurately represents what we store in it.
- Drop our bespoke memory allocation flags in favor of GFP_*.
- Rearrange the xattr functions so that we no longer mix metadata
updates and transaction management (e.g. rolling complex
transactions) in the same functions. This work will prepare us for
atomic xattr operations (itself a prerequisite for directory
backrefs) in future release cycles.
- Support FS_DAX_FL (aka FS_XFLAG_DAX) via GETFLAGS/SETFLAGS"
* tag 'xfs-5.9-merge-7' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (117 commits)
fs/xfs: Support that ioctl(SETXFLAGS/GETXFLAGS) can set/get inode DAX on XFS.
xfs: Lift -ENOSPC handler from xfs_attr_leaf_addname
xfs: Simplify xfs_attr_node_addname
xfs: Simplify xfs_attr_leaf_addname
xfs: Add helper function xfs_attr_node_removename_rmt
xfs: Add helper function xfs_attr_node_removename_setup
xfs: Add remote block helper functions
xfs: Add helper function xfs_attr_leaf_mark_incomplete
xfs: Add helpers xfs_attr_is_shortform and xfs_attr_set_shortform
xfs: Remove xfs_trans_roll in xfs_attr_node_removename
xfs: Remove unneeded xfs_trans_roll_inode calls
xfs: Add helper function xfs_attr_node_shrink
xfs: Pull up xfs_attr_rmtval_invalidate
xfs: Refactor xfs_attr_rmtval_remove
xfs: Pull up trans roll in xfs_attr3_leaf_clearflag
xfs: Factor out xfs_attr_rmtval_invalidate
xfs: Pull up trans roll from xfs_attr3_leaf_setflag
xfs: Refactor xfs_attr_try_sf_addname
xfs: Split apart xfs_attr_leaf_addname
xfs: Pull up trans handling in xfs_attr3_leaf_flipflags
...
Pull init and set_fs() cleanups from Al Viro:
"Christoph's 'getting rid of ksys_...() uses under KERNEL_DS' series"
* 'hch.init_path' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (50 commits)
init: add an init_dup helper
init: add an init_utimes helper
init: add an init_stat helper
init: add an init_mknod helper
init: add an init_mkdir helper
init: add an init_symlink helper
init: add an init_link helper
init: add an init_eaccess helper
init: add an init_chmod helper
init: add an init_chown helper
init: add an init_chroot helper
init: add an init_chdir helper
init: add an init_rmdir helper
init: add an init_unlink helper
init: add an init_umount helper
init: add an init_mount helper
init: mark create_dev as __init
init: mark console_on_rootfs as __init
init: initialize ramdisk_execute_command at compile time
devtmpfs: refactor devtmpfsd()
...
Pull ptrace regset updates from Al Viro:
"Internal regset API changes:
- regularize copy_regset_{to,from}_user() callers
- switch to saner calling conventions for ->get()
- kill user_regset_copyout()
The ->put() side of things will have to wait for the next cycle,
unfortunately.
The balance is about -1KLoC and replacements for ->get() instances are
a lot saner"
* 'work.regset' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (41 commits)
regset: kill user_regset_copyout{,_zero}()
regset(): kill ->get_size()
regset: kill ->get()
csky: switch to ->regset_get()
xtensa: switch to ->regset_get()
parisc: switch to ->regset_get()
nds32: switch to ->regset_get()
nios2: switch to ->regset_get()
hexagon: switch to ->regset_get()
h8300: switch to ->regset_get()
openrisc: switch to ->regset_get()
riscv: switch to ->regset_get()
c6x: switch to ->regset_get()
ia64: switch to ->regset_get()
arc: switch to ->regset_get()
arm: switch to ->regset_get()
sh: convert to ->regset_get()
arm64: switch to ->regset_get()
mips: switch to ->regset_get()
sparc: switch to ->regset_get()
...
Before this patch, if function gfs2_dirty_inode got an error when
trying to lock the inode glock, it complained, but it didn't say
what glock or inode had the problem.
In this case, it almost always means that dinode_in found an error
with the dinode in the file system. So it makes sense to dump the
glock, which tells us the location of the dinode in the file system.
That will allow us to analyze the corruption from the metadata.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Before this patch, some functions started transactions then they called
gfs2_block_zero_range. However, gfs2_block_zero_range, like writes, can
start transactions, which results in a recursive transaction error.
For example:
do_shrink
trunc_start
gfs2_trans_begin <------------------------------------------------
gfs2_block_zero_range
iomap_zero_range(inode, from, length, NULL, &gfs2_iomap_ops);
iomap_apply ... iomap_zero_range_actor
iomap_begin
gfs2_iomap_begin
gfs2_iomap_begin_write
actor (iomap_zero_range_actor)
iomap_zero
iomap_write_begin
gfs2_iomap_page_prepare
gfs2_trans_begin <------------------------
This patch reorders the callers of gfs2_block_zero_range so that they
only start their transactions after the call. It also adds a BUG_ON to
ensure this doesn't happen again.
Fixes: 2257e468a6 ("gfs2: implement gfs2_block_zero_range using iomap_zero_range")
Cc: stable@vger.kernel.org # v5.5+
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
If function gfs2_trans_begin is called with another transaction active
it BUGs out, but it doesn't give any details about the duplicate.
This patch moves function gfs2_print_trans and calls it when this
situation arises for better debugging.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
The comment regarding journal flush thresholds is wrong. This patch fixes it.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
The error handling calls kfree(full_path) so we can't let it be a NULL
pointer. There used to be a NULL assignment here but we accidentally
deleted it. Add it back.
Fixes: 7efd081582 ("cifs: document and cleanup dfs mount")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
This set includes a some improvements to the dlm
networking layer: improving the ability to trace
dlm messages for debugging, and improved handling
of bad messages or disrupted connections.
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJfLCPxAAoJEDgbc8f8gGmqz04P/2hvv/4rXo9AOgnnstvZV1Qy
Yo01Cy807vB1c3jhIJryM2gG61GNH22RAHc2NcfjJwy04HH/1IEr6P48Po3qYEnS
8fZ8B9msxpsujVOrRoeBuLN8elI1HftyNVWaVjH7xtD+fLCDLu9i10kv3aeS+DiB
T6f7yQQv7hgXS3xGvlMr2//aLwGD2ZdcRbkOEGo+k7yUjQbIDH/wdZWcPLh6y4yT
p20i2ulYKjEZFmXDMa17diONISeGO6iaDhee24XPDwNDp8qI1iPGJsmxltMmn8Qf
d2HPF1IDh4eM8lCwmqBtjYTnJd6rAW0v3+Ek1+wzQKVeXLFiz/MEyuOldtpsqmMO
8Og0vr6zfTCjFo8uvyj+cF7Fcj0yIPWg1yb7EauqqxreK8V9GBA1V2ZXYVd8xwea
thrAUaq8f+PYQ9uy1FsN3xaO3BFN1VpcvHu4/3gU3OudnZZt2Ae670RYHKC0bq8D
2tSsqaiDnlvniHgh4xvtNIvRANkDS1ZSbkUPZhMHL7DnRJn66oDIfCr7NMbZwvCa
AS0q6suUFyXFbAEJcY6XWxe3aQ3WuxIClT84MgzX/dAK2Qcl8ryWGGSVc0dp4Vl1
cd8MtmpnIWsnxqNRl4jn6cfolDheaxL8nouLtJ+3/dC9VkyDyfmrtnM+8aTZKHoa
3/xrBuVkEJAwkAAr8Pb8
=qgti
-----END PGP SIGNATURE-----
Merge tag 'dlm-5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm
Pull dlm updates from David Teigland:
"This set includes a some improvements to the dlm networking layer:
improving the ability to trace dlm messages for debugging, and
improved handling of bad messages or disrupted connections"
* tag 'dlm-5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm:
fs: dlm: implement tcp graceful shutdown
fs: dlm: change handling of reconnects
fs: dlm: don't close socket on invalid message
fs: dlm: set skb mark per peer socket
fs: dlm: set skb mark for listen socket
net: sock: add sock_set_mark
dlm: Fix kobject memleak
- Make sure we call ->iomap_end with a failure code if ->iomap_begin
failed in any way; some filesystems need to try to undo things.
- Don't invalidate the page cache during direct reads since we already
sync'd the cache with disk.
- Make direct writes fall back to the page cache if the pre-write
cache invalidation fails. This avoids a cache coherency problem.
- Fix some idiotic virus scanner warning bs in the previous tag.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAl8q3aMACgkQ+H93GTRK
tOvh9BAAkUF11er5pSefKdM1t2WGlSjMgPRmMHELRFwLhJBPK1zyIIs+kCz9+k1/
lGe8mAEI7cA06jiUXYCbHZW1Cgno46VYZxWVnIE3i7c3xYt8pwApqwY+ATqH6X75
7peax9L0Dn8DK7mzw6ihcO6LCIH0iyfHeBpWyKN87APBhKU6nNtVah/I/3NGnbWJ
EbH6TSf4FWqzBvYJZKUQRqrGZRJWUinrRAqLnh2fWxVcjUDLVTnbjWxAuL0StgWB
H1AY3dof9ROYK3SKFNPqtur8nXcrHNCvnOSgQmB8F++ZkfsubR1MREWpndBJTHnd
/a5zNveiQGvA8drM1+2v/QLd30yp3I+LHSlM+BY5Bc/Xl8c2ZanwAhu+x40Ha6qq
rjsh31Hdn6E4qzP+ne+eVSWyPPHNZCK0i7gBWlTBodlJyHN70N1RBCfGBnO2VVbt
fZCxn6kxLYrfKEoQVQS+9QGu3cRSh7yYsLGjWoK5iynsVJCOvMjmTZ6uPL2EWAEY
9oz/QRxyTaVit1sgk0ypsrfZ4yFafI5QIDCLM9pHpxgj0QNddO2smAyKO2WItZ90
ERz/0UYg1LJoEl4lmBwHoYAI3aU37FyO9UhjgTIJSZeLZbnK1aba9uikwgrSmS/c
XLVy0WyPWd/JMBhA0EAAaQFBa1D6gTdTskSG8Djl1saiYNu6kGs=
=rjsZ
-----END PGP SIGNATURE-----
Merge tag 'iomap-5.9-merge-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull iomap updates from Darrick Wong:
"The most notable changes are:
- iomap no longer invalidates the page cache when performing a direct
read, since doing so is unnecessary and the old directio code
doesn't do that either.
- iomap embraced the use of returning ENOTBLK from a direct write to
trigger falling back to a buffered write since ext4 already did
this and btrfs wants it for their port.
- iomap falls back to buffered writes if we're doing a direct write
and the page cache invalidation after the flush fails; this was
necessary to handle a corner case in the btrfs port.
- Remove email virus scanner detritus that was accidentally included
in yesterday's pull request. Clearly I need(ed) to update my git
branch checker scripts. :("
* tag 'iomap-5.9-merge-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
iomap: fall back to buffered writes for invalidation failures
xfs: use ENOTBLK for direct I/O to buffered I/O fallback
iomap: Only invalidate page cache pages on direct IO writes
iomap: Make sure iomap_end is called after iomap_begin
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAl8qeCkACgkQnJ2qBz9k
QNlAGQf/YVruyVLZ7kCv6EMCHauXm3K1lEGpbXsTW04HpStxGx7mtLGN/Au+EYJR
VnRkCMt6TSMQGMBkNF83dUCwXHkeL1rd6frJBLVOErkg50nUuD4kjTVw9Lzw9itx
CPhKnPPlsRkDkZPxkg3WEdqPgzJREWBZUaB38QUPjYN46q7HfPYDANTh5wI1GiGs
27+PvzlttjhkQpQ14pYU/nu4xf/nmgmmHhgfsJArQP2EzYOrKxsWKhXS5uPdtNlf
mXiZMaqW2AlyDGlw3myOEySrrSuaR77M2bzDo7mjqffI9wSVTytKEhtg0i8OMWmv
pZ38OQobznnFoqzc1GL70IE0DEU48g==
=d81d
-----END PGP SIGNATURE-----
Merge tag 'fsnotify_for_v5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull fsnotify updates from Jan Kara:
- fanotify fix for softlockups when there are many queued events
- performance improvement to reduce fsnotify overhead when not used
- Amir's implementation of fanotify events with names. With these you
can now efficiently monitor whole filesystem, eg to mirror changes to
another machine.
* tag 'fsnotify_for_v5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: (37 commits)
fanotify: compare fsid when merging name event
fsnotify: create method handle_inode_event() in fsnotify_operations
fanotify: report parent fid + child fid
fanotify: report parent fid + name + child fid
fanotify: add support for FAN_REPORT_NAME
fanotify: report events with parent dir fid to sb/mount/non-dir marks
fanotify: add basic support for FAN_REPORT_DIR_FID
fsnotify: remove check that source dentry is positive
fsnotify: send event with parent/name info to sb/mount/non-dir marks
audit: do not set FS_EVENT_ON_CHILD in audit marks mask
inotify: do not set FS_EVENT_ON_CHILD in non-dir mark mask
fsnotify: pass dir and inode arguments to fsnotify()
fsnotify: create helper fsnotify_inode()
fsnotify: send event to parent and child with single callback
inotify: report both events on parent and child with single callback
dnotify: report both events on parent and child with single callback
fanotify: no external fh buffer in fanotify_name_event
fanotify: use struct fanotify_info to parcel the variable size buffer
fsnotify: add object type "child" to object type iterator
fanotify: use FAN_EVENT_ON_CHILD as implicit flag on sb/mount/non-dir marks
...
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAl8qdtkACgkQnJ2qBz9k
QNkNbQgAiLy3zzqBT9noZ5WEI8VzStsRDUyccbzaCIbSrqv7sBbf2ey+iaE9V5gR
HCNZtTSBChMyzpGt1j9l+1/a/0ntzcypb74+kRWi6eApqGh6X8tCggjqIKloy5Bg
jAkYHpvjz1Dpv1qdOWgcCI76XkF8Q+bID4HjsbvxKr4dEVaqlTictZhwtk2oonRN
paREsiwSvjdCEZ/3r2FO4kYAtxMD+x2KhImu/UHJKG92GsQiC4IY5zJmy9aV4gw+
16Z46PtYmzvYli59m2NQgCY5j95dL2VBmjtjFoxMOsUgb76PcqVAhfNeYVo0rmYU
vfs5ngYdxDjYFBCbg45Fu+zO3ploTQ==
=zoom
-----END PGP SIGNATURE-----
Merge tag 'for_v5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull ext2, udf, reiserfs, quota cleanups and minor fixes from Jan Kara:
"A few ext2 fixups and then several (mostly comment and documentation)
cleanups in ext2, udf, reiserfs, and quota"
* tag 'for_v5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
reiserfs: delete duplicated words
udf: osta_udf.h: delete a duplicated word
reiserfs: reiserfs.h: delete a duplicated word
ext2: ext2.h: fix duplicated word + typos
udf: Replace HTTP links with HTTPS ones
quota: Fixup http links in quota doc
Replace HTTP links with HTTPS ones: DISKQUOTA
ext2: initialize quota info in ext2_xattr_set()
ext2: fix some incorrect comments in inode.c
ext2: remove nocheck option
ext2: fix missing percpu_counter_inc
ext2: ext2_find_entry() return -ENOENT if no entry found
ext2: propagate errors up to ext2_find_entry()'s callers
ext2: fix improper assignment for e_value_offs
- use HTTPS links instead of insecure HTTP ones;
- fix crossing page boundary on specific extended inodes;
- remove useless WQ_CPU_INTENSIVE flag for unbound wq;
- minor cleanup.
-----BEGIN PGP SIGNATURE-----
iIsEABYIADMWIQThPAmQN9sSA0DVxtI5NzHcH7XmBAUCXytmoBUcaHNpYW5na2Fv
QHJlZGhhdC5jb20ACgkQOTcx3B+15gSg8gEA/LwZy3e/Tnor9CP2Mc+QSMPmuhvX
ZwsxOyYqYGkVtlcBAMLKiBu96hqH+V3AOPHNfqS19N3fdjs34CEp/wbl1x8G
=I/Yp
-----END PGP SIGNATURE-----
Merge tag 'erofs-for-5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs
Pull erofs updates from Gao Xiang:
"This cycle mainly addresses an issue out of some extended inode with
designated location, which are not generated by current mkfs but need
to handled at runtime anyway. The others are quite trivial ones.
- use HTTPS links instead of insecure HTTP ones;
- fix crossing page boundary on specific extended inodes;
- remove useless WQ_CPU_INTENSIVE flag for unbound wq;
- minor cleanup"
* tag 'erofs-for-5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs:
erofs: remove WQ_CPU_INTENSIVE flag from unbound wq's
erofs: fold in used-once helper erofs_workgroup_unfreeze_final()
erofs: fix extended inode could cross boundary
erofs: Replace HTTP links with HTTPS ones
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAl8ojNUACgkQiiy9cAdy
T1Fkxwv/fwoI+RGQyIRYeFSr85yqnbDRG/tBFUdsiOxjvKwzzINg0ECGpmkxpuwb
npICFmjmhzApXiTnEj9Jm81Qk50W0L1sQtjgPe14ROuRdzbhooVdQM2Y7LCCUwjm
8TzVWRyNkYrwg95hhfYTlqXCIrH9aNYv6UbFlRgeoT4fTgqiWZysDOlnt9wVwXtu
O5SXy8DBn9SRehtzVvXkMXosirwgMJ1QjHwyGuyVpsiQRGEPy7jWe1s90fxhWUNA
CejrD/OURNCgVO8B2DlygCl0RqsE3yqQA8IM+0tFnEeOXkYW9GdRoEZlLHGRT2Nf
HqE/lxpJqip1ykmAtbzLH1g+nmEkrXjQB4+q7krZnuw1L132dbP5hl/x+NJR8wsU
ymjuou96dCedNVemwbqjRvNBSHA1IiqQP0B9bPJCqsBMgX6HAZgDK3gVrlDqR8gt
7M/f935YRu2I/aE810Bw4SOTYkEdLKJyrw5fQFQO+mZGDe6zbtabMvflpxW5jz5m
sRLjOY/X
=X8aQ
-----END PGP SIGNATURE-----
Merge tag '5.9-rc-smb3-fixes-part1' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs updates from Steve French:
"16 cifs/smb3 fixes, about half DFS related, two fixes for stable.
Still working on and testing an additional set of fixes (including
updates to mount, and some fallocate scenario improvements) for later
in the merge window"
* tag '5.9-rc-smb3-fixes-part1' of git://git.samba.org/sfrench/cifs-2.6:
cifs: document and cleanup dfs mount
cifs: only update prefix path of DFS links in cifs_tree_connect()
cifs: fix double free error on share and prefix
cifs: handle RESP_GET_DFS_REFERRAL.PathConsumed in reconnect
cifs: handle empty list of targets in cifs_reconnect()
cifs: rename reconn_inval_dfs_target()
cifs: reduce number of referral requests in DFS link lookups
cifs: merge __{cifs,smb2}_reconnect[_tcon]() into cifs_tree_connect()
cifs: convert to use be32_add_cpu()
cifs: delete duplicated words in header files
cifs: Remove the superfluous break
cifs: smb1: Try failing back to SetFileInfo if SetPathInfo fails
cifs`: handle ERRBaduid for SMB1
cifs: remove unused variable 'server'
smb3: warn on confusing error scenario with sec=krb5
cifs: Fix leak when handling lease break for cached root fid
During my code inspection I saw there is no implementation of a graceful
shutdown for tcp. This patch will introduce a graceful shutdown for tcp
connections. The shutdown is implemented synchronized as
dlm_lowcomms_stop() is called to end all dlm communication. After shutdown
is done, a lot of flush and closing functionality will be called. However
I don't see a problem with that.
The waitqueue for synchronize the shutdown has a timeout of 10 seconds, if
timeout a force close will be exectued.
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
This patch changes the handling of reconnects. At first we only close
the connection related to the communication failure. If we get a new
connection for an already existing connection we close the existing
connection and take the new one.
This patch improves significantly the stability of tcp connections while
running "tcpkill -9 -i $IFACE port 21064" while generating a lot of dlm
messages e.g. on a gfs2 mount with many files. My test setup shows that a
deadlock is "more" unlikely. Before this patch I wasn't able to get
not a deadlock after 5 seconds. After this patch my observation is
that it's more likely to survive after 5 seconds and more, but still a
deadlock occurs after certain time. My guess is that there are still
"segments" inside the tcp writequeue or retransmit queue which get dropped
when receiving a tcp reset [1]. Hard to reproduce because the right message
need to be inside these queues, which might even be in the 5 first seconds
with this patch.
[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/net/ipv4/tcp_input.c?h=v5.8-rc6#n4122
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
This patch doesn't close sockets when there is an invalid dlm message
received. The connection will probably reconnect anyway so. To not
close the connection will reduce the number of possible failtures.
As we don't have a different strategy to react on such scenario
just keep going the connection and ignore the message.
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
This patch adds support to set the skb mark value for the DLM tcp and
sctp socket per peer. The mark value will be offered as per comm value
of configfs. At creation time of the peer socket it will be set as
socket option.
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
This patch adds support to set the skb mark value for the DLM listen
tcp and sctp sockets. The mark value will be offered as cluster
configuration. At creation time of the listen socket it will be set as
socket option.
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
Currently the error return path from kobject_init_and_add() is not
followed by a call to kobject_put() - which means we are leaking
the kobject.
Set do_unreg = 1 before kobject_init_and_add() to ensure that
kobject_put() can be called in its error patch.
Fixes: 901195ed7f ("Kobject: change GFS2 to use kobject_init_and_add")
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Wang Hai <wanghai38@huawei.com>
Signed-off-by: David Teigland <teigland@redhat.com>
The tear down path will always unaccount the memory, so ensure that we
have accounted it before hitting any of them.
Reported-by: Tomáš Chaloupka <chalucha@gmail.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If we hit an earlier error path in io_uring_create(), then we will have
accounted memory, but not set ctx->{sq,cq}_entries yet. Then when the
ring is torn down in error, we use those values to unaccount the memory.
Ensure we set the ctx entries before we're able to hit a potential error
path.
Cc: stable@vger.kernel.org
Reported-by: Tomáš Chaloupka <chalucha@gmail.com>
Tested-by: Tomáš Chaloupka <chalucha@gmail.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
cr=0 is supposed to be an optimization to save CPU cycles, but if
buddy data (in memory) is not initialized then all this makes no sense
as we have to do sync IO taking a lot of cycles. Also, at cr=0
mballoc doesn't choose any available chunk. cr=1 also skips groups
using heuristic based on avg. fragment size. It's more useful to skip
such groups and switch to cr=2 where groups will be scanned for
available chunks. However, we always read the first block group in a
flex_bg so metadata blocks will get read into the first flex_bg if
possible.
Using sparse image and dm-slow virtual device of 120TB was
simulated, then the image was formatted and filled using debugfs to
mark ~85% of available space as busy. mount process w/o the patch
couldn't complete in half an hour (according to vmstat it would take
~10-11 hours). With the patch applied mount took ~20 seconds.
Lustre-bug-id: https://jira.whamcloud.com/browse/LU-12988
Signed-off-by: Alex Zhuravlev <azhuravlev@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Artem Blagodarenko <artem.blagodarenko@gmail.com>
This should significantly improve bitmap loading, especially for flex
groups as it tries to load all bitmaps within a flex.group instead of
one by one synchronously.
Prefetching is done in 8 * flex_bg groups, so it should be 8
read-ahead reads for a single allocating thread. At the end of
allocation the thread waits for read-ahead completion and initializes
buddy information so that read-aheads are not lost in case of memory
pressure.
At cr=0 the number of prefetching IOs is limited per allocation
context to prevent a situation when mballoc loads thousands of bitmaps
looking for a perfect group and ignoring groups with good chunks.
Together with the patch "ext4: limit scanning of uninitialized groups"
the mount time (which includes few tiny allocations) of a 1PB
filesystem is reduced significantly:
0% full 50%-full unpatched patched
mount time 33s 9279s 563s
[ Restructured by tytso; removed the state flags in the allocation
context, so it can be used to lazily prefetch the allocation bitmaps
immediately after the file system is mounted. Skip prefetching
block groups which are uninitialized. Finally pass in the
REQ_RAHEAD flag to the block layer while prefetching. ]
Signed-off-by: Alex Zhuravlev <bzzz@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Don't define EXT4_IOC_* aliases to ioctls that already have a generic
FS_IOC_* name. These aliases are unnecessary, and they make it unclear
which ioctls are ext4-specific and which are generic.
Exception: leave EXT4_IOC_GETVERSION_OLD and EXT4_IOC_SETVERSION_OLD
as-is for now, since renaming them to FS_IOC_GETVERSION and
FS_IOC_SETVERSION would probably make them more likely to be confused
with EXT4_IOC_GETVERSION and EXT4_IOC_SETVERSION which also exist.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20200714230909.56349-1-ebiggers@kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Define the EXT4_FL_USER_* constants by OR-ing together the appropriate
flags, rather than hard-coding a numeric value. This makes it much
easier to see which flags are listed.
No change in the actual values.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20200713031012.192440-1-ebiggers@kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
A customer has reported a BUG_ON in ext4_clear_journal_err() hitting
during an LTP testing. Either this has been caused by a test setup
issue where the filesystem was being overwritten while LTP was mounting
it or the journal replay has overwritten the superblock with invalid
data. In either case it is preferable we don't take the machine down
with a BUG_ON. So handle the situation of unexpectedly missing
has_journal feature more gracefully. We issue warning and fail the mount
in the cases where the race window is narrow and the failed check is
most likely a programming error. In cases where fs corruption is more
likely, we do full ext4_error() handling before failing mount / remount.
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20200710140759.18031-1-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Since commit 378f32bab3 ("ext4: introduce direct I/O write using iomap
infrastructure") we don't properly bail out of RWF_NOWAIT direct IO
write if underlying blocks are not allocated. Also
ext4_dio_write_checks() does not honor RWF_NOWAIT when re-acquiring
i_rwsem. Fix both issues.
Fixes: 378f32bab3 ("ext4: introduce direct I/O write using iomap infrastructure")
Cc: stable@kernel.org
Reported-by: Filipe Manana <fdmanana@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/20200708153516.9507-1-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Rationale:
Reduces attack surface on kernel devs opening the links for MITM
as HTTPS traffic is much harder to manipulate.
Deterministic algorithm:
For each file:
If not .svg:
For each line:
If doesn't contain `\bxmlns\b`:
For each link, `\bhttp://[^# \t\r\n]*(?:\w|/)`:
If both the HTTP and HTTPS versions
return 200 OK and serve the same content:
Replace HTTP with HTTPS.
Signed-off-by: Alexander A. Klimov <grandmaster@al2klimov.de>
Link: https://lore.kernel.org/r/20200706190339.20709-1-grandmaster@al2klimov.de
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
If dquot_initialize() return non-zero and trace of ext4_unlink_enter/exit
enabled then the matching-pair of trace_exit will lost in log.
Signed-off-by: Yi Zhuang <zhuangyi1@huawei.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/20200629122621.129953-1-zhuangyi1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
It should call trace exit in all return path for ext4_truncate.
Signed-off-by: zhengliang <zhengliang6@huawei.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/20200701083027.45996-1-zhengliang6@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
jbd2_write_superblock() is under the buffer lock of journal superblock
before ending that superblock write, so add a missing unlock_buffer() in
in the error path before submitting buffer.
Fixes: 742b06b562 ("jbd2: check superblock mapped prior to committing")
Signed-off-by: zhangyi (F) <yi.zhang@huawei.com>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Cc: stable@kernel.org
Link: https://lore.kernel.org/r/20200620061948.2049579-1-yi.zhang@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
If for any reason a directory passed to do_split() does not have enough
active entries to exceed half the size of the block, we can end up
iterating over all "count" entries without finding a split point.
In this case, count == move, and split will be zero, and we will
attempt a negative index into map[].
Guard against this by detecting this case, and falling back to
split-to-half-of-count instead; in this case we will still have
plenty of space (> half blocksize) in each split block.
Fixes: ef2b02d3e6 ("ext34: ensure do_split leaves enough free space in both blocks")
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/f53e246b-647c-64bb-16ec-135383c70ad7@redhat.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Callers of __jbd2_journal_unfile_buffer() and
__jbd2_journal_refile_buffer() assume that the b_transaction is set. In
fact if it's not, we can end up with journal_head refcounting errors
leading to crash much later that might be very hard to track down. Add
asserts to make sure that is the case.
We also make sure that b_next_transaction is NULL in
__jbd2_journal_unfile_buffer() since the callers expect that as well and
we should not get into that stage in this state anyway, leading to
problems later on if we do.
Tested with fstests.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20200617092549.6712-1-lczerner@redhat.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
The brelse() function tests whether its argument is NULL
and then returns immediately.
Thus remove the tests which are not needed around the shown calls.
This issue was detected by using the Coccinelle software.
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/0d713702-072f-a89c-20ec-ca70aa83a432@web.de
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Pull networking updates from David Miller:
1) Support 6Ghz band in ath11k driver, from Rajkumar Manoharan.
2) Support UDP segmentation in code TSO code, from Eric Dumazet.
3) Allow flashing different flash images in cxgb4 driver, from Vishal
Kulkarni.
4) Add drop frames counter and flow status to tc flower offloading,
from Po Liu.
5) Support n-tuple filters in cxgb4, from Vishal Kulkarni.
6) Various new indirect call avoidance, from Eric Dumazet and Brian
Vazquez.
7) Fix BPF verifier failures on 32-bit pointer arithmetic, from
Yonghong Song.
8) Support querying and setting hardware address of a port function via
devlink, use this in mlx5, from Parav Pandit.
9) Support hw ipsec offload on bonding slaves, from Jarod Wilson.
10) Switch qca8k driver over to phylink, from Jonathan McDowell.
11) In bpftool, show list of processes holding BPF FD references to
maps, programs, links, and btf objects. From Andrii Nakryiko.
12) Several conversions over to generic power management, from Vaibhav
Gupta.
13) Add support for SO_KEEPALIVE et al. to bpf_setsockopt(), from Dmitry
Yakunin.
14) Various https url conversions, from Alexander A. Klimov.
15) Timestamping and PHC support for mscc PHY driver, from Antoine
Tenart.
16) Support bpf iterating over tcp and udp sockets, from Yonghong Song.
17) Support 5GBASE-T i40e NICs, from Aleksandr Loktionov.
18) Add kTLS RX HW offload support to mlx5e, from Tariq Toukan.
19) Fix the ->ndo_start_xmit() return type to be netdev_tx_t in several
drivers. From Luc Van Oostenryck.
20) XDP support for xen-netfront, from Denis Kirjanov.
21) Support receive buffer autotuning in MPTCP, from Florian Westphal.
22) Support EF100 chip in sfc driver, from Edward Cree.
23) Add XDP support to mvpp2 driver, from Matteo Croce.
24) Support MPTCP in sock_diag, from Paolo Abeni.
25) Commonize UDP tunnel offloading code by creating udp_tunnel_nic
infrastructure, from Jakub Kicinski.
26) Several pci_ --> dma_ API conversions, from Christophe JAILLET.
27) Add FLOW_ACTION_POLICE support to mlxsw, from Ido Schimmel.
28) Add SK_LOOKUP bpf program type, from Jakub Sitnicki.
29) Refactor a lot of networking socket option handling code in order to
avoid set_fs() calls, from Christoph Hellwig.
30) Add rfc4884 support to icmp code, from Willem de Bruijn.
31) Support TBF offload in dpaa2-eth driver, from Ioana Ciornei.
32) Support XDP_REDIRECT in qede driver, from Alexander Lobakin.
33) Support PCI relaxed ordering in mlx5 driver, from Aya Levin.
34) Support TCP syncookies in MPTCP, from Flowian Westphal.
35) Fix several tricky cases of PMTU handling wrt. briding, from Stefano
Brivio.
* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2056 commits)
net: thunderx: initialize VF's mailbox mutex before first usage
usb: hso: remove bogus check for EINPROGRESS
usb: hso: no complaint about kmalloc failure
hso: fix bailout in error case of probe
ip_tunnel_core: Fix build for archs without _HAVE_ARCH_IPV6_CSUM
selftests/net: relax cpu affinity requirement in msg_zerocopy test
mptcp: be careful on subflow creation
selftests: rtnetlink: make kci_test_encap() return sub-test result
selftests: rtnetlink: correct the final return value for the test
net: dsa: sja1105: use detected device id instead of DT one on mismatch
tipc: set ub->ifindex for local ipv6 address
ipv6: add ipv6_dev_find()
net: openvswitch: silence suspicious RCU usage warning
Revert "vxlan: fix tos value before xmit"
ptp: only allow phase values lower than 1 period
farsync: switch from 'pci_' to 'dma_' API
wan: wanxl: switch from 'pci_' to 'dma_' API
hv_netvsc: do not use VF device if link is down
dpaa2-eth: Fix passing zero to 'PTR_ERR' warning
net: macb: Properly handle phylink on at91sam9x
...
Here is the "big" set of changes to the driver core, and some drivers
using the changes, for 5.9-rc1.
"Biggest" thing in here is the device link exposure in sysfs, to help
to tame the madness that is SoC device tree representations and driver
interactions with it.
Other stuff in here that is interesting is:
- device probe log helper so that drivers can report problems in
a unified way easier.
- devres functions added
- DEVICE_ATTR_ADMIN_* macro added to make it harder to write
incorrect sysfs file permissions
- documentation cleanups
- ability for debugfs to be present in the kernel, yet not
exposed to userspace. Needed for systems that want it
enabled, but do not trust users, so they can still use some
kernel functions that were otherwise disabled.
- other minor fixes and cleanups
The patches outside of drivers/base/ all have acks from the respective
subsystem maintainers to go through this tree instead of theirs.
All of these have been in linux-next with no reported issues.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCXylhOQ8cZ3JlZ0Brcm9h
aC5jb20ACgkQMUfUDdst+ylGdACeKqxm8IIDZycj0QjLUlPiEwVIROgAnjpf5jAB
mb4jMvgEGsB6/FwxypPG
=RUss
-----END PGP SIGNATURE-----
Merge tag 'driver-core-5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core updates from Greg KH:
"Here is the "big" set of changes to the driver core, and some drivers
using the changes, for 5.9-rc1.
"Biggest" thing in here is the device link exposure in sysfs, to help
to tame the madness that is SoC device tree representations and driver
interactions with it.
Other stuff in here that is interesting is:
- device probe log helper so that drivers can report problems in a
unified way easier.
- devres functions added
- DEVICE_ATTR_ADMIN_* macro added to make it harder to write
incorrect sysfs file permissions
- documentation cleanups
- ability for debugfs to be present in the kernel, yet not exposed to
userspace. Needed for systems that want it enabled, but do not
trust users, so they can still use some kernel functions that were
otherwise disabled.
- other minor fixes and cleanups
The patches outside of drivers/base/ all have acks from the respective
subsystem maintainers to go through this tree instead of theirs.
All of these have been in linux-next with no reported issues"
* tag 'driver-core-5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (39 commits)
drm/bridge: lvds-codec: simplify error handling
drm/bridge/sii8620: fix resource acquisition error handling
driver core: add deferring probe reason to devices_deferred property
driver core: add device probe log helper
driver core: Avoid binding drivers to dead devices
Revert "test_firmware: Test platform fw loading on non-EFI systems"
firmware_loader: EFI firmware loader must handle pre-allocated buffer
selftest/firmware: Add selftest timeout in settings
test_firmware: Test platform fw loading on non-EFI systems
driver core: Change delimiter in devlink device's name to "--"
debugfs: Add access restriction option
tracefs: Remove unnecessary debug_fs checks.
driver core: Fix probe_count imbalance in really_probe()
kobject: remove unused KOBJ_MAX action
driver core: Fix sleeping in invalid context during device link deletion
driver core: Add waiting_for_supplier sysfs file for devices
driver core: Add state_synced sysfs file for devices that support it
driver core: Expose device link details in sysfs
driver core: Drop mention of obsolete bus rwsem from kernel-doc
debugfs: file: Remove unnecessary cast in kfree()
...
Failing to invalid the page cache means data in incoherent, which is
a very bad state for the system. Always fall back to buffered I/O
through the page cache if we can't invalidate mappings.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Acked-by: Bob Peterson <rpeterso@redhat.com>
Acked-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Theodore Ts'o <tytso@mit.edu> # for ext4
Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com> # for gfs2
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
This is what the classic fs/direct-io.c implementation and thuse other
file systems use.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The historic requirement for XFS to invalidate cached pages on
direct IO reads has been lost in the twisty pages of history - it was
inherited from Irix, which implemented page cache invalidation on
read as a method of working around problems synchronising page
cache state with uncached IO.
XFS has carried this ever since. In the initial linux ports it was
necessary to get mmap and DIO to play "ok" together and not
immediately corrupt data. This was the state of play until the linux
kernel had infrastructure to track unwritten extents and synchronise
page faults with allocations and unwritten extent conversions
(->page_mkwrite infrastructure). IOws, the page cache invalidation
on DIO read was necessary to prevent trivial data corruptions. This
didn't solve all the problems, though.
There were peformance problems if we didn't invalidate the entire
page cache over the file on read - we couldn't easily determine if
the cached pages were over the range of the IO, and invalidation
required taking a serialising lock (i_mutex) on the inode. This
serialising lock was an issue for XFS, as it was the only exclusive
lock in the direct Io read path.
Hence if there were any cached pages, we'd just invalidate the
entire file in one go so that subsequent IOs didn't need to take the
serialising lock. This was a problem that prevented ranged
invalidation from being particularly useful for avoiding the
remaining coherency issues. This was solved with the conversion of
i_mutex to i_rwsem and the conversion of the XFS inode IO lock to
use i_rwsem. Hence we could now just do ranged invalidation and the
performance problem went away.
However, page cache invalidation was still needed to serialise
sub-page/sub-block zeroing via direct IO against buffered IO because
bufferhead state attached to the cached page could get out of whack
when direct IOs were issued. We've removed bufferheads from the
XFS code, and we don't carry any extent state on the cached pages
anymore, and so this problem has gone away, too.
IOWs, it would appear that we don't have any good reason to be
invalidating the page cache on DIO reads anymore. Hence remove the
invalidation on read because it is unnecessary overhead,
not needed to maintain coherency between mmap/buffered access and
direct IO anymore, and prevents anyone from using direct IO reads
from intentionally invalidating the page cache of a file.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Delete repeated words in fs/xfs/.
{we, that, the, a, to, fork}
Change "it it" to "it is" in one location.
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
To: linux-fsdevel@vger.kernel.org
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: linux-xfs@vger.kernel.org
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Most session messages contain a feature mask, but the MDS will
routinely send a REJECT message with one that is zero-length.
Commit 0fa8263367 ("ceph: fix endianness bug when handling MDS
session feature bits") fixed the decoding of the feature mask,
but failed to account for the MDS sending a zero-length feature
mask. This causes REJECT message decoding to fail.
Skip trying to decode a feature mask if the word count is zero.
Cc: stable@vger.kernel.org
URL: https://tracker.ceph.com/issues/46823
Fixes: 0fa8263367 ("ceph: fix endianness bug when handling MDS session feature bits")
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Tested-by: Patrick Donnelly <pdonnell@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Ensure we correctly report the stateid and status in the layoutreturn on
close tracepoint.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
while to come. Changes include:
- Some new Chinese translations
- Progress on the battle against double words words and non-HTTPS URLs
- Some block-mq documentation
- More RST conversions from Mauro. At this point, that task is
essentially complete, so we shouldn't see this kind of churn again for a
while. Unless we decide to switch to asciidoc or something...:)
- Lots of typo fixes, warning fixes, and more.
-----BEGIN PGP SIGNATURE-----
iQFDBAABCAAtFiEEIw+MvkEiF49krdp9F0NaE2wMflgFAl8oVkwPHGNvcmJldEBs
d24ubmV0AAoJEBdDWhNsDH5YoW8H/jJ/xnXFn7tkgVPQAlL3k5HCnK7A5nDP9RVR
cg1pTx1cEFdjzxPlJyExU6/v+AImOvtweHXC+JDK7YcJ6XFUNYXJI3LxL5KwUXbY
BL/xRFszDSXH2C7SJF5GECcFYp01e/FWSLN3yWAh+g+XwsKiTJ8q9+CoIDkHfPGO
7oQsHKFu6s36Af0LfSgxk4sVB7EJbo8e4psuPsP5SUrl+oXRO43Put0rXkR4yJoH
9oOaB51Do5fZp8I4JVAqGXvpXoExyLMO4yw0mASm6YSZ3KyjR8Fae+HD9Cq4ZuwY
0uzb9K+9NEhqbfwtyBsi99S64/6Zo/MonwKwevZuhtsDTK4l4iU=
=JQLZ
-----END PGP SIGNATURE-----
Merge tag 'docs-5.9' of git://git.lwn.net/linux
Pull documentation updates from Jonathan Corbet:
"It's been a busy cycle for documentation - hopefully the busiest for a
while to come. Changes include:
- Some new Chinese translations
- Progress on the battle against double words words and non-HTTPS
URLs
- Some block-mq documentation
- More RST conversions from Mauro. At this point, that task is
essentially complete, so we shouldn't see this kind of churn again
for a while. Unless we decide to switch to asciidoc or
something...:)
- Lots of typo fixes, warning fixes, and more"
* tag 'docs-5.9' of git://git.lwn.net/linux: (195 commits)
scripts/kernel-doc: optionally treat warnings as errors
docs: ia64: correct typo
mailmap: add entry for <alobakin@marvell.com>
doc/zh_CN: add cpu-load Chinese version
Documentation/admin-guide: tainted-kernels: fix spelling mistake
MAINTAINERS: adjust kprobes.rst entry to new location
devices.txt: document rfkill allocation
PCI: correct flag name
docs: filesystems: vfs: correct flag name
docs: filesystems: vfs: correct sync_mode flag names
docs: path-lookup: markup fixes for emphasis
docs: path-lookup: more markup fixes
docs: path-lookup: fix HTML entity mojibake
CREDITS: Replace HTTP links with HTTPS ones
docs: process: Add an example for creating a fixes tag
doc/zh_CN: add Chinese translation prefer section
doc/zh_CN: add clearing-warn-once Chinese version
doc/zh_CN: add admin-guide index
doc:it_IT: process: coding-style.rst: Correct __maybe_unused compiler label
futex: MAINTAINERS: Re-add selftests directory
...
The NFS_CONTEXT_ERROR_WRITE flag (as well as the check of said flag) was
removed by commit 6fbda89b25. The absence of an error check allows
writes to be continually queued up for a server that may no longer be
able to handle them. Fix it by adding an error check using the generic
error reporting functions.
Fixes: 6fbda89b25 ("NFS: Replace custom error reporting mechanism with generic one")
Signed-off-by: Scott Mayhew <smayhew@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Add a simple helper to grab a reference to a file and install it at
the next available fd, and switch the early init code over to it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCXygcpgAKCRCRxhvAZXjc
ogPeAQDv1ncqtNroFAC4pJ4tQhH7JSjW0OltiMk/AocY/J2SdQD9GJ15luYJ0/om
697q/Z68sndRynhdoZlMuf3oYuBlHQw=
=3ZhE
-----END PGP SIGNATURE-----
Merge tag 'close-range-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux
Pull close_range() implementation from Christian Brauner:
"This adds the close_range() syscall. It allows to efficiently close a
range of file descriptors up to all file descriptors of a calling
task.
This is coordinated with the FreeBSD folks which have copied our
version of this syscall and in the meantime have already merged it in
April 2019:
https://reviews.freebsd.org/D21627https://svnweb.freebsd.org/base?view=revision&revision=359836
The syscall originally came up in a discussion around the new mount
API and making new file descriptor types cloexec by default. During
this discussion, Al suggested the close_range() syscall.
First, it helps to close all file descriptors of an exec()ing task.
This can be done safely via (quoting Al's example from [1] verbatim):
/* that exec is sensitive */
unshare(CLONE_FILES);
/* we don't want anything past stderr here */
close_range(3, ~0U);
execve(....);
The code snippet above is one way of working around the problem that
file descriptors are not cloexec by default. This is aggravated by the
fact that we can't just switch them over without massively regressing
userspace. For a whole class of programs having an in-kernel method of
closing all file descriptors is very helpful (e.g. demons, service
managers, programming language standard libraries, container managers
etc.).
Second, it allows userspace to avoid implementing closing all file
descriptors by parsing through /proc/<pid>/fd/* and calling close() on
each file descriptor and other hacks. From looking at various
large(ish) userspace code bases this or similar patterns are very
common in service managers, container runtimes, and programming
language runtimes/standard libraries such as Python or Rust.
In addition, the syscall will also work for tasks that do not have
procfs mounted and on kernels that do not have procfs support compiled
in. In such situations the only way to make sure that all file
descriptors are closed is to call close() on each file descriptor up
to UINT_MAX or RLIMIT_NOFILE, OPEN_MAX trickery.
Based on Linus' suggestion close_range() also comes with a new flag
CLOSE_RANGE_UNSHARE to more elegantly handle file descriptor dropping
right before exec. This would usually be expressed in the sequence:
unshare(CLONE_FILES);
close_range(3, ~0U);
as pointed out by Linus it might be desirable to have this be a part
of close_range() itself under a new flag CLOSE_RANGE_UNSHARE which
gets especially handy when we're closing all file descriptors above a
certain threshold.
Test-suite as always included"
* tag 'close-range-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
tests: add CLOSE_RANGE_UNSHARE tests
close_range: add CLOSE_RANGE_UNSHARE
tests: add close_range() tests
arch: wire-up close_range()
open: add close_range()
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCXygegQAKCRCRxhvAZXjc
olWZAQCMPbhI/20LA3OYJ6s+BgBEnm89PymvlHcym6Z4AvTungD+KqZonIYuxWgi
6Ttlv/fzgFFbXgJgbuass5mwFVoN5wM=
=oK7d
-----END PGP SIGNATURE-----
Merge tag 'cap-checkpoint-restore-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux
Pull checkpoint-restore updates from Christian Brauner:
"This enables unprivileged checkpoint/restore of processes.
Given that this work has been going on for quite some time the first
sentence in this summary is hopefully more exciting than the actual
final code changes required. Unprivileged checkpoint/restore has seen
a frequent increase in interest over the last two years and has thus
been one of the main topics for the combined containers &
checkpoint/restore microconference since at least 2018 (cf. [1]).
Here are just the three most frequent use-cases that were brought forward:
- The JVM developers are integrating checkpoint/restore into a Java
VM to significantly decrease the startup time.
- In high-performance computing environment a resource manager will
typically be distributing jobs where users are always running as
non-root. Long-running and "large" processes with significant
startup times are supposed to be checkpointed and restored with
CRIU.
- Container migration as a non-root user.
In all of these scenarios it is either desirable or required to run
without CAP_SYS_ADMIN. The userspace implementation of
checkpoint/restore CRIU already has the pull request for supporting
unprivileged checkpoint/restore up (cf. [2]).
To enable unprivileged checkpoint/restore a new dedicated capability
CAP_CHECKPOINT_RESTORE is introduced. This solution has last been
discussed in 2019 in a talk by Google at Linux Plumbers (cf. [1]
"Update on Task Migration at Google Using CRIU") with Adrian and
Nicolas providing the implementation now over the last months. In
essence, this allows the CRIU binary to be installed with the
CAP_CHECKPOINT_RESTORE vfs capability set thereby enabling
unprivileged users to restore processes.
To make this possible the following permissions are altered:
- Selecting a specific PID via clone3() set_tid relaxed from userns
CAP_SYS_ADMIN to CAP_CHECKPOINT_RESTORE.
- Selecting a specific PID via /proc/sys/kernel/ns_last_pid relaxed
from userns CAP_SYS_ADMIN to CAP_CHECKPOINT_RESTORE.
- Accessing /proc/pid/map_files relaxed from init userns
CAP_SYS_ADMIN to init userns CAP_CHECKPOINT_RESTORE.
- Changing /proc/self/exe from userns CAP_SYS_ADMIN to userns
CAP_CHECKPOINT_RESTORE.
Of these four changes the /proc/self/exe change deserves a few words
because the reasoning behind even restricting /proc/self/exe changes
in the first place is just full of historical quirks and tracking this
down was a questionable version of fun that I'd like to spare others.
In short, it is trivial to change /proc/self/exe as an unprivileged
user, i.e. without userns CAP_SYS_ADMIN right now. Either via ptrace()
or by simply intercepting the elf loader in userspace during exec.
Nicolas was nice enough to even provide a POC for the latter (cf. [3])
to illustrate this fact.
The original patchset which introduced PR_SET_MM_MAP had no
permissions around changing the exe link. They too argued that it is
trivial to spoof the exe link already which is true. The argument
brought up against this was that the Tomoyo LSM uses the exe link in
tomoyo_manager() to detect whether the calling process is a policy
manager. This caused changing the exe links to be guarded by userns
CAP_SYS_ADMIN.
All in all this rather seems like a "better guard it with something
rather than nothing" argument which imho doesn't qualify as a great
security policy. Again, because spoofing the exe link is possible for
the calling process so even if this were security relevant it was
broken back then and would be broken today. So technically, dropping
all permissions around changing the exe link would probably be
possible and would send a clearer message to any userspace that relies
on /proc/self/exe for security reasons that they should stop doing
this but for now we're only relaxing the exe link permissions from
userns CAP_SYS_ADMIN to userns CAP_CHECKPOINT_RESTORE.
There's a final uapi change in here. Changing the exe link used to
accidently return EINVAL when the caller lacked the necessary
permissions instead of the more correct EPERM. This pr contains a
commit fixing this. I assume that userspace won't notice or care and
if they do I will revert this commit. But since we are changing the
permissions anyway it seems like a good opportunity to try this fix.
With these changes merged unprivileged checkpoint/restore will be
possible and has already been tested by various users"
[1] LPC 2018
1. "Task Migration at Google Using CRIU"
https://www.youtube.com/watch?v=yI_1cuhoDgA&t=12095
2. "Securely Migrating Untrusted Workloads with CRIU"
https://www.youtube.com/watch?v=yI_1cuhoDgA&t=14400
LPC 2019
1. "CRIU and the PID dance"
https://www.youtube.com/watch?v=LN2CUgp8deo&list=PLVsQ_xZBEyN30ZA3Pc9MZMFzdjwyz26dO&index=9&t=2m48s
2. "Update on Task Migration at Google Using CRIU"
https://www.youtube.com/watch?v=LN2CUgp8deo&list=PLVsQ_xZBEyN30ZA3Pc9MZMFzdjwyz26dO&index=9&t=1h2m8s
[2] https://github.com/checkpoint-restore/criu/pull/1155
[3] https://github.com/nviennot/run_as_exe
* tag 'cap-checkpoint-restore-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
selftests: add clone3() CAP_CHECKPOINT_RESTORE test
prctl: exe link permission error changed from -EINVAL to -EPERM
prctl: Allow local CAP_CHECKPOINT_RESTORE to change /proc/self/exe
proc: allow access in init userns for map_files with CAP_CHECKPOINT_RESTORE
pid_namespace: use checkpoint_restore_ns_capable() for ns_last_pid
pid: use checkpoint_restore_ns_capable() for set_tid
capabilities: Introduce CAP_CHECKPOINT_RESTORE
Pull execve updates from Eric Biederman:
"During the development of v5.7 I ran into bugs and quality of
implementation issues related to exec that could not be easily fixed
because of the way exec is implemented. So I have been diggin into
exec and cleaning up what I can.
This cycle I have been looking at different ideas and different
implementations to see what is possible to improve exec, and cleaning
the way exec interfaces with in kernel users. Only cleaning up the
interfaces of exec with rest of the kernel has managed to stabalize
and make it through review in time for v5.9-rc1 resulting in 2 sets of
changes this cycle.
- Implement kernel_execve
- Make the user mode driver code a better citizen
With kernel_execve the code size got a little larger as the copying of
parameters from userspace and copying of parameters from userspace is
now separate. The good news is kernel threads no longer need to play
games with set_fs to use exec. Which when combined with the rest of
Christophs set_fs changes should security bugs with set_fs much more
difficult"
* 'exec-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (23 commits)
exec: Implement kernel_execve
exec: Factor bprm_stack_limits out of prepare_arg_pages
exec: Factor bprm_execve out of do_execve_common
exec: Move bprm_mm_init into alloc_bprm
exec: Move initialization of bprm->filename into alloc_bprm
exec: Factor out alloc_bprm
exec: Remove unnecessary spaces from binfmts.h
umd: Stop using split_argv
umd: Remove exit_umh
bpfilter: Take advantage of the facilities of struct pid
exit: Factor thread_group_exited out of pidfd_poll
umd: Track user space drivers with struct pid
bpfilter: Move bpfilter_umh back into init data
exec: Remove do_execve_file
umh: Stop calling do_execve_file
umd: Transform fork_usermode_blob into fork_usermode_driver
umd: Rename umd_info.cmdline umd_info.driver_name
umd: For clarity rename umh_info umd_info
umh: Separate the user mode driver and the user mode helper support
umh: Remove call_usermodehelper_setup_file.
...
- Improved selftest coverage, timeouts, and reporting
- Add EPOLLHUP support for SECCOMP_RET_USER_NOTIF (Christian Brauner)
- Refactor __scm_install_fd() into __receive_fd() and fix buggy callers
- Introduce "addfd" command for SECCOMP_RET_USER_NOTIF (Sargun Dhillon)
-----BEGIN PGP SIGNATURE-----
iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAl8oZcQWHGtlZXNjb29r
QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJomDD/4x3j7eXREcXDsHOmlgEaHWGx4l
JldHFQhV5GjmD7gOkPcoZSG7NfG7F6VpwAJg7ZoR3qUkem7K8DFucxqgo1RldCot
nigleeLX6JeMS0Z+iwjAVZd+5t4xG4J/7GGDHIIMiG5qvwJ0Yf64o1bkjaB2Q/Bv
tluBg0WF32kFMG/ZwyY/V2QDbbue97CFPflybOh1o2nWbVzmUlFEEum3UUvZsxc8
smMsattJyuAV7kcEKzKrs8b010NdFZqwdbub5Np9W3XEXGBYMdIPoNsOQGmB9wby
j2ui0lzboXRG997jM7TCd1l/XZAv8aAwvPplw3FJRybzkOGs9NDyLMoz87yJpR1T
xp511vnMyMbyKIGdungkt7cIyzaictHwaYzznsmuNdCPEjTaIQJr1ctsa4GEgtqf
pnkktZ9YbMCcHU0CtZ8GlOVqA9wE+FUm0/u0zgikzJQsB+HcNItiARTTTHRyco7p
VJCqK8o4Zx4ELV7QNkSH4nhFkVgRopvrvBiPAGro/qwGOofBg8W8wM8O1+V/MDmp
zSU22v4SncT1Xb7dtmdJqDEeHfDikhaCAb4Je2hsGQWzbdAqwHGlpa7vpk9x3Q5r
L+XyP+Z+rPHlXYyypJwUvvOQhXOmP0zYxcEHxByqIBfXiwy+3dN4tDDfatWbccwl
uTlTDM8kmQn6QzSztA==
=yb55
-----END PGP SIGNATURE-----
Merge tag 'seccomp-v5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull seccomp updates from Kees Cook:
"There are a bunch of clean ups and selftest improvements along with
two major updates to the SECCOMP_RET_USER_NOTIF filter return:
EPOLLHUP support to more easily detect the death of a monitored
process, and being able to inject fds when intercepting syscalls that
expect an fd-opening side-effect (needed by both container folks and
Chrome). The latter continued the refactoring of __scm_install_fd()
started by Christoph, and in the process found and fixed a handful of
bugs in various callers.
- Improved selftest coverage, timeouts, and reporting
- Add EPOLLHUP support for SECCOMP_RET_USER_NOTIF (Christian Brauner)
- Refactor __scm_install_fd() into __receive_fd() and fix buggy
callers
- Introduce 'addfd' command for SECCOMP_RET_USER_NOTIF (Sargun
Dhillon)"
* tag 'seccomp-v5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: (30 commits)
selftests/seccomp: Test SECCOMP_IOCTL_NOTIF_ADDFD
seccomp: Introduce addfd ioctl to seccomp user notifier
fs: Expand __receive_fd() to accept existing fd
pidfd: Replace open-coded receive_fd()
fs: Add receive_fd() wrapper for __receive_fd()
fs: Move __scm_install_fd() to __receive_fd()
net/scm: Regularize compat handling of scm_detach_fds()
pidfd: Add missing sock updates for pidfd_getfd()
net/compat: Add missing sock updates for SCM_RIGHTS
selftests/seccomp: Check ENOSYS under tracing
selftests/seccomp: Refactor to use fixture variants
selftests/harness: Clean up kern-doc for fixtures
seccomp: Use -1 marker for end of mode 1 syscall list
seccomp: Fix ioctl number for SECCOMP_IOCTL_NOTIF_ID_VALID
selftests/seccomp: Rename user_trap_syscall() to user_notif_syscall()
selftests/seccomp: Make kcmp() less required
seccomp: Use pr_fmt
selftests/seccomp: Improve calibration loop
selftests/seccomp: use 90s as timeout
selftests/seccomp: Expand benchmark to per-filter measurements
...
- Fix linking when crypto API disabled (Matteo Croce)
-----BEGIN PGP SIGNATURE-----
iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAl8oWy4WHGtlZXNjb29r
QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJtKnD/9k74pkcsl/xeB4KR4XRxmdKxwq
1wCfhrd7PRoUu1it7rrhGeAxFIyWfu+WTZKRqce5OwKgqx8BF4T1dpCuyOqHSMSz
O5UyRHiPl43EJHs7jvOVR7V5cjQx57SeaHxxtV/PGofNUTFqLVa0w9Pxh5Ma4nfT
R8j71qXOceHiwU/roHY+52vvwIMiixrgKmFfQb5klmoAQsUGXMiZYPkoelA7P4+S
M7OUL/GfLBFLH2IYbCEB4YBhX127PJIL74jIOpdvT/KsAFep4PCOD0a2qPH+FrF5
DVlF2BbGpbPg+uvFWu6gU6AZC/S+D7ZnV4cDVvg4ZNknJS8XEbZNIM1EgLPbKy0C
GzblUZdlA6KyEwIh9oAMiL9zjL3dianLq/mlSi8kKdiFmI2zNmwhQzW+xFgoEbRS
fgGG9wcAejM5X9YqHs2BO5TLvJ7qBLHzaQceEL2Z6ZNIm7rU4grX6HD7MD93SMOu
BA9O6tliYEDApSBNFLUKAlE8CGlwKjFdwgzbw7malm254uVQrCeICWVdo+caKFZr
JXjEgls2gYNM7oAME1MUksy5xzwqLjXmSWJXVCud+CjYaKoyAM5Po97sQr4aYXcu
F8zUdFGoy148IFi59xYZZKczLlyItCEr9OpCDA78V6MNiup0in4LhAERSPC8Ljw+
5LpiI0IIJFshcGhriA==
=gGdN
-----END PGP SIGNATURE-----
Merge tag 'pstore-v5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull pstore update from Kees Cook:
"A tiny pstore update which fixes a very corner-case build failure:
- Fix linking when crypto API disabled (Matteo Croce)"
* tag 'pstore-v5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
pstore: Fix linking when crypto API disabled
The variable ret is guaranteed to be 0 in this if (). So we can remove
this assignement.
Signed-off-by: Jing Xiangfeng <jingxiangfeng@huawei.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
When doing some tests with multiple mds, we were seeing many mds
forwarding requests between them, causing clients to resend.
If the request is a modification operation and the mode is set to
USE_AUTH_MDS, then the auth mds should be selected to handle the
request. If auth mds for frag is already set, then it should be returned
directly without further processing.
The current logic is wrong because it only returns directly if
mode is USE_AUTH_MDS, but we want to do that for all modes. If we don't,
then when the frag's mds is not equal to cap session's mds, the request
will get sent to the wrong MDS needlessly.
Drop the mode check in this condition.
Signed-off-by: Yanhu Cao <gmayyyha@gmail.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
When doing some testing recently, I hit some page allocation failures
on mount, when creating the wb_pagevec_pool for the mount. That
requires 128k (32 contiguous pages), and after thrashing the memory
during an xfstests run, sometimes that would fail.
128k for each mount seems like a lot to hold in reserve for a rainy
day, so let's change this to a global mempool that gets allocated
when the module is plugged in.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Symlink inodes should have the security context set in their xattrs on
creation. We already set the context on creation, but we don't attach
the pagelist. The effect is that symlink inodes don't get an SELinux
context set on them at creation, so they end up unlabeled instead of
inheriting the proper context. Make it do so.
Cc: stable@vger.kernel.org
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
- Make the Energy Model cover non-CPU devices (Lukasz Luba).
- Add Ice Lake server idle states table to the intel_idle driver
and eliminate a redundant static variable from it (Chen Yu,
Rafael Wysocki).
- Eliminate all W=1 build warnings from cpufreq (Lee Jones).
- Add support for Sapphire Rapids and for Power Limit 4 to the
Intel RAPL power capping driver (Sumeet Pawnikar, Zhang Rui).
- Fix function name in kerneldoc comments in the idle_inject power
capping driver (Yangtao Li).
- Fix locking issues with cpufreq governors and drop a redundant
"weak" function definition from cpufreq (Viresh Kumar).
- Rearrange cpufreq to register non-modular governors at the
core_initcall level and allow the default cpufreq governor to
be specified in the kernel command line (Quentin Perret).
- Extend, fix and clean up the intel_pstate driver (Srinivas
Pandruvada, Rafael Wysocki):
* Add a new sysfs attribute for disabling/enabling CPU
energy-efficiency optimizations in the processor.
* Make the driver avoid enabling HWP if EPP is not supported.
* Allow the driver to handle numeric EPP values in the sysfs
interface and fix the setting of EPP via sysfs in the active
mode.
* Eliminate a static checker warning and clean up a kerneldoc
comment.
- Clean up some variable declarations in the powernv cpufreq
driver (Wei Yongjun).
- Fix up the ->enter_s2idle callback definition to cover the case
when it points to the same function as ->idle correctly (Neal
Liu).
- Rearrange and clean up the PSCI cpuidle driver (Ulf Hansson).
- Make the PM core emit "changed" uevent when adding/removing the
"wakeup" sysfs attribute of devices (Abhishek Pandit-Subedi).
- Add a helper macro for declaring PM callbacks and use it in the
MMC jz4740 driver (Paul Cercueil).
- Fix white space in some places in the hibernate code and make the
system-wide PM code use "const char *" where appropriate (Xiang
Chen, Alexey Dobriyan).
- Add one more "unsafe" helper macro to the freezer to cover the NFS
use case (He Zhe).
- Change the language in the generic PM domains framework to use
parent/child terminology and clean up a typo and some comment
fromatting in that code (Kees Cook, Geert Uytterhoeven).
- Update the operating performance points OPP framework (Lukasz
Luba, Andrew-sh.Cheng, Valdis Kletnieks):
* Refactor dev_pm_opp_of_register_em() and update related drivers.
* Add a missing function export.
* Allow disabled OPPs in dev_pm_opp_get_freq().
- Update devfreq core and drivers (Chanwoo Choi, Lukasz Luba, Enric
Balletbo i Serra, Dmitry Osipenko, Kieran Bingham, Marc Zyngier):
* Add support for delayed timers to the devfreq core and make the
Samsung exynos5422-dmc driver use it.
* Unify sysfs interface to use "df-" as a prefix in instance names
consistently.
* Fix devfreq_summary debugfs node indentation.
* Add the rockchip,pmu phandle to the rk3399_dmc driver DT
bindings.
* List Dmitry Osipenko as the Tegra devfreq driver maintainer.
* Fix typos in the core devfreq code.
- Update the pm-graph utility to version 5.7 including a number of
fixes related to suspend-to-idle (Todd Brandt).
- Fix coccicheck errors and warnings in the cpupower utility (Shuah
Khan).
- Replace HTTP links with HTTPs ones in multiple places (Alexander
A. Klimov).
-----BEGIN PGP SIGNATURE-----
iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAl8oO24SHHJqd0Byand5
c29ja2kubmV0AAoJEILEb/54YlRx7ZQP/0lQ0yABnASnwomdOH6+K/m7rvc+e9FE
zx5pTDQswhU5tM7SQAIKqe0uSI+okF2UrBrT5onA16F+JUbnrbexJLazBPfVTTGF
AKpKEQ7Wh69Wz+Y6cQZjm1dTuRL+dlBJuBrzR2tLSnONPMMHuFcO3xd7lgE9UAxC
oGEf393taA6OqcUNRQIa2gqbq+k1qhKjeDucGkbOaoJ6CL0ZyWI+Tfw1WWaBBGv0
/2wBd6V513OH8WtQCW6H3YpHmhYW6OwL8w19KyGcjPRGJaeaIP4W/Ng7mkvgL5ZB
vZqg3XiufFV9uTe8W1NQaVv/NjlN256OteuK809aosTVjD0dhFkhBYg5TLu6HbQq
C/NciZ+78oLedWLT73EUfw3NyS+V0jk6X2EIlBUwNi0Qw1B1pCifGOCKzWFFe5cr
ci4xr4FG7dBkxScOxwFAU2s5TdPHLOkGkQtg4jZr0OYDrzkyLEdsnZEUjLPORo+0
6EBXGfTOSy2CBHcYswRtzJr/1pUTzj7oejhTAMCCuYW2r3VyQtnYcVjlehtp20if
6BfmGisk8nmtxlSm+/Y2FqKa4bNnSTMmr0UJQ+Rjp0tHs47QeucI0ORfZ5nPaBac
+ptvIjWmn3xejT/+oAehpH9066Iuy66vzHdnj7x5+WAsmYS8n8OFtlBFkYELmLJB
3xI5hIl7WtGo
=8cUO
-----END PGP SIGNATURE-----
Merge tag 'pm-5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management updates from Rafael Wysocki:
"The most significant change here is the extension of the Energy Model
to cover non-CPU devices (as well as CPUs) from Lukasz Luba.
There is also some new hardware support (Ice Lake server idle states
table for intel_idle, Sapphire Rapids and Power Limit 4 support in the
RAPL driver), some new functionality in the existing drivers (eg. a
new switch to disable/enable CPU energy-efficiency optimizations in
intel_pstate, delayed timers in devfreq), some assorted fixes (cpufreq
core, intel_pstate, intel_idle) and cleanups (eg. cpuidle-psci,
devfreq), including the elimination of W=1 build warnings from cpufreq
done by Lee Jones.
Specifics:
- Make the Energy Model cover non-CPU devices (Lukasz Luba).
- Add Ice Lake server idle states table to the intel_idle driver and
eliminate a redundant static variable from it (Chen Yu, Rafael
Wysocki).
- Eliminate all W=1 build warnings from cpufreq (Lee Jones).
- Add support for Sapphire Rapids and for Power Limit 4 to the Intel
RAPL power capping driver (Sumeet Pawnikar, Zhang Rui).
- Fix function name in kerneldoc comments in the idle_inject power
capping driver (Yangtao Li).
- Fix locking issues with cpufreq governors and drop a redundant
"weak" function definition from cpufreq (Viresh Kumar).
- Rearrange cpufreq to register non-modular governors at the
core_initcall level and allow the default cpufreq governor to be
specified in the kernel command line (Quentin Perret).
- Extend, fix and clean up the intel_pstate driver (Srinivas
Pandruvada, Rafael Wysocki):
* Add a new sysfs attribute for disabling/enabling CPU
energy-efficiency optimizations in the processor.
* Make the driver avoid enabling HWP if EPP is not supported.
* Allow the driver to handle numeric EPP values in the sysfs
interface and fix the setting of EPP via sysfs in the active
mode.
* Eliminate a static checker warning and clean up a kerneldoc
comment.
- Clean up some variable declarations in the powernv cpufreq driver
(Wei Yongjun).
- Fix up the ->enter_s2idle callback definition to cover the case
when it points to the same function as ->idle correctly (Neal Liu).
- Rearrange and clean up the PSCI cpuidle driver (Ulf Hansson).
- Make the PM core emit "changed" uevent when adding/removing the
"wakeup" sysfs attribute of devices (Abhishek Pandit-Subedi).
- Add a helper macro for declaring PM callbacks and use it in the MMC
jz4740 driver (Paul Cercueil).
- Fix white space in some places in the hibernate code and make the
system-wide PM code use "const char *" where appropriate (Xiang
Chen, Alexey Dobriyan).
- Add one more "unsafe" helper macro to the freezer to cover the NFS
use case (He Zhe).
- Change the language in the generic PM domains framework to use
parent/child terminology and clean up a typo and some comment
fromatting in that code (Kees Cook, Geert Uytterhoeven).
- Update the operating performance points OPP framework (Lukasz Luba,
Andrew-sh.Cheng, Valdis Kletnieks):
* Refactor dev_pm_opp_of_register_em() and update related drivers.
* Add a missing function export.
* Allow disabled OPPs in dev_pm_opp_get_freq().
- Update devfreq core and drivers (Chanwoo Choi, Lukasz Luba, Enric
Balletbo i Serra, Dmitry Osipenko, Kieran Bingham, Marc Zyngier):
* Add support for delayed timers to the devfreq core and make the
Samsung exynos5422-dmc driver use it.
* Unify sysfs interface to use "df-" as a prefix in instance
names consistently.
* Fix devfreq_summary debugfs node indentation.
* Add the rockchip,pmu phandle to the rk3399_dmc driver DT
bindings.
* List Dmitry Osipenko as the Tegra devfreq driver maintainer.
* Fix typos in the core devfreq code.
- Update the pm-graph utility to version 5.7 including a number of
fixes related to suspend-to-idle (Todd Brandt).
- Fix coccicheck errors and warnings in the cpupower utility (Shuah
Khan).
- Replace HTTP links with HTTPs ones in multiple places (Alexander A.
Klimov)"
* tag 'pm-5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (71 commits)
cpuidle: ACPI: fix 'return' with no value build warning
cpufreq: intel_pstate: Fix EPP setting via sysfs in active mode
cpufreq: intel_pstate: Rearrange the storing of new EPP values
intel_idle: Customize IceLake server support
PM / devfreq: Fix the wrong end with semicolon
PM / devfreq: Fix indentaion of devfreq_summary debugfs node
PM / devfreq: Clean up the devfreq instance name in sysfs attr
memory: samsung: exynos5422-dmc: Add module param to control IRQ mode
memory: samsung: exynos5422-dmc: Adjust polling interval and uptreshold
memory: samsung: exynos5422-dmc: Use delayed timer as default
PM / devfreq: Add support delayed timer for polling mode
dt-bindings: devfreq: rk3399_dmc: Add rockchip,pmu phandle
PM / devfreq: tegra: Add Dmitry as a maintainer
PM / devfreq: event: Fix trivial spelling
PM / devfreq: rk3399_dmc: Fix kernel oops when rockchip,pmu is absent
cpuidle: change enter_s2idle() prototype
cpuidle: psci: Prevent domain idlestates until consumers are ready
cpuidle: psci: Convert PM domain to platform driver
cpuidle: psci: Fix error path via converting to a platform driver
cpuidle: psci: Fail cpuidle registration if set OSI mode failed
...
Daniel Borkmann says:
====================
pull-request: bpf-next 2020-08-04
The following pull-request contains BPF updates for your *net-next* tree.
We've added 73 non-merge commits during the last 9 day(s) which contain
a total of 135 files changed, 4603 insertions(+), 1013 deletions(-).
The main changes are:
1) Implement bpf_link support for XDP. Also add LINK_DETACH operation for the BPF
syscall allowing processes with BPF link FD to force-detach, from Andrii Nakryiko.
2) Add BPF iterator for map elements and to iterate all BPF programs for efficient
in-kernel inspection, from Yonghong Song and Alexei Starovoitov.
3) Separate bpf_get_{stack,stackid}() helpers for perf events in BPF to avoid
unwinder errors, from Song Liu.
4) Allow cgroup local storage map to be shared between programs on the same
cgroup. Also extend BPF selftests with coverage, from YiFei Zhu.
5) Add BPF exception tables to ARM64 JIT in order to be able to JIT BPF_PROBE_MEM
load instructions, from Jean-Philippe Brucker.
6) Follow-up fixes on BPF socket lookup in combination with reuseport group
handling. Also add related BPF selftests, from Jakub Sitnicki.
7) Allow to use socket storage in BPF_PROG_TYPE_CGROUP_SOCK-typed programs for
socket create/release as well as bind functions, from Stanislav Fomichev.
8) Fix an info leak in xsk_getsockopt() when retrieving XDP stats via old struct
xdp_statistics, from Peilin Ye.
9) Fix PT_REGS_RC{,_CORE}() macros in libbpf for MIPS arch, from Jerry Crunchtime.
10) Extend BPF kernel test infra with skb->family and skb->{local,remote}_ip{4,6}
fields and allow user space to specify skb->dev via ifindex, from Dmitry Yakunin.
11) Fix a bpftool segfault due to missing program type name and make it more robust
to prevent them in future gaps, from Quentin Monnet.
12) Consolidate cgroup helper functions across selftests and fix a v6 localhost
resolver issue, from John Fastabend.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Current judgment condition of f2fs_bug_on in function update_sit_entry():
new_vblocks >> (sizeof(unsigned short) << 3) ||
new_vblocks > sbi->blocks_per_seg
which equivalents to:
new_vblocks < 0 || new_vblocks > sbi->blocks_per_seg
The latter is more intuitive.
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Reported-by: Jack Qiu <jack.qiu@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Since set/clear_inode_flag() don't need to return value to show
if flag is set, we can just call set/clear_bit() here.
Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
When we use F2FS_IOC_RELEASE_COMPRESS_BLOCKS ioctl, if we can't find
any compressed blocks in the file even with large file size, the
ioctl just ends up without changing the file's status as immutable.
It makes the user, who expects that the file is immutable when it
returns successfully, confused.
Signed-off-by: Daeho Jeong <daehojeong@google.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The retry based logic here isn't easy to follow unless you're already
familiar with how io_uring does task_work based retries. Add some
comments explaining the flow a little better.
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Since we don't do exclusive waits or wakeups, we know that the bit is
always going to be set. Kill the test. Also see commit:
2a9127fcf2 ("mm: rewrite wait_on_page_bit_common() logic")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl8m7asQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgplrCD/0S17kio+k4cOJDGwl88WoJw+QiYmM5019k
decZ1JymQvV1HXRmlcZiEAu0hHDD0FoovSRrw7II3gw3GouETmYQM62f6ZTpDeMD
CED/fidnfULAkPaI6h+bj3jyI0cEuujG/R47rGSQEkIIr3RttqKZUzVkB9KN+KMw
+OBuXZtMIoFFEVJ91qwC2dm2qHLqOn1/5MlT59knso/xbPOYOXsFQpGiACJqF97x
6qSSI8uGE+HZqvL2OLWPDBbLEJhrq+dzCgxln5VlvLele4UcRhOdonUb7nUwEKCe
zwvtXzz16u1D1b8bJL4Kg5bGqyUAQUCSShsfBJJxh6vTTULiHyCX5sQaai1OEB16
4dpBL9E+nOUUix4wo9XBY0/KIYaPWg5L1CoEwkAXqkXPhFvNUucsC0u6KvmzZR3V
1OogVTjl6GhS8uEVQjTKNshkTIC9QHEMXDUOHtINDCb/sLU+ANXU5UpvsuzZ9+kt
KGc4mdyCwaKBq4YW9sVwhhq/RHLD4AUtWZiUVfOE+0cltCLJUNMbQsJ+XrcYaQnm
W4zz22Rep+SJuQNVcCW/w7N2zN3yB6gC1qeroSLvzw4b5el2TdFp+BcgVlLHK+uh
xjsGNCq++fyzNk7vvMZ5hVq4JGXYjza7AiP5HlQ8nqdiPUKUPatWCBqUm9i9Cz/B
n+0dlYbRwQ==
=2vmy
-----END PGP SIGNATURE-----
Merge tag 'for-5.9/io_uring-20200802' of git://git.kernel.dk/linux-block
Pull io_uring updates from Jens Axboe:
"Lots of cleanups in here, hardening the code and/or making it easier
to read and fixing bugs, but a core feature/change too adding support
for real async buffered reads. With the latter in place, we just need
buffered write async support and we're done relying on kthreads for
the fast path. In detail:
- Cleanup how memory accounting is done on ring setup/free (Bijan)
- sq array offset calculation fixup (Dmitry)
- Consistently handle blocking off O_DIRECT submission path (me)
- Support proper async buffered reads, instead of relying on kthread
offload for that. This uses the page waitqueue to drive retries
from task_work, like we handle poll based retry. (me)
- IO completion optimizations (me)
- Fix race with accounting and ring fd install (me)
- Support EPOLLEXCLUSIVE (Jiufei)
- Get rid of the io_kiocb unionizing, made possible by shrinking
other bits (Pavel)
- Completion side cleanups (Pavel)
- Cleanup REQ_F_ flags handling, and kill off many of them (Pavel)
- Request environment grabbing cleanups (Pavel)
- File and socket read/write cleanups (Pavel)
- Improve kiocb_set_rw_flags() (Pavel)
- Tons of fixes and cleanups (Pavel)
- IORING_SQ_NEED_WAKEUP clear fix (Xiaoguang)"
* tag 'for-5.9/io_uring-20200802' of git://git.kernel.dk/linux-block: (127 commits)
io_uring: flip if handling after io_setup_async_rw
fs: optimise kiocb_set_rw_flags()
io_uring: don't touch 'ctx' after installing file descriptor
io_uring: get rid of atomic FAA for cq_timeouts
io_uring: consolidate *_check_overflow accounting
io_uring: fix stalled deferred requests
io_uring: fix racy overflow count reporting
io_uring: deduplicate __io_complete_rw()
io_uring: de-unionise io_kiocb
io-wq: update hash bits
io_uring: fix missing io_queue_linked_timeout()
io_uring: mark ->work uninitialised after cleanup
io_uring: deduplicate io_grab_files() calls
io_uring: don't do opcode prep twice
io_uring: clear IORING_SQ_NEED_WAKEUP after executing task works
io_uring: batch put_task_struct()
tasks: add put_task_struct_many()
io_uring: return locked and pinned page accounting
io_uring: don't miscount pinned memory
io_uring: don't open-code recv kbuf managment
...
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl8m7YwQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpt+dEAC7a0HYuX2OrkyawBnsgd1QQR/soC7surec
yDDa7SMM8cOq3935bfzcYHV9FWJszEGIknchiGb9R3/T+vmSohbvDsM5zgwya9u/
FHUIuTq324I6JWXKl30k4rwjiX9wQeMt+WZ5gC8KJYCWA296i2IpJwd0A45aaKuS
x4bTjxqknE+fD4gQiMUSt+bmuOUAp81fEku3EPapCRYDPAj8f5uoY7R2arT/POwB
b+s+AtXqzBymIqx1z0sZ/XcdZKmDuhdurGCWu7BfJFIzw5kQ2Qe3W8rUmrQ3pGut
8a21YfilhUFiBv+B4wptfrzJuzU6Ps0BXHCnBsQjzvXwq5uFcZH495mM/4E4OJvh
SbjL2K4iFj+O1ngFkukG/F8tdEM1zKBYy2ZEkGoWKUpyQanbAaGI6QKKJA+DCdBi
yPEb7yRAa5KfLqMiocm1qCEO1I56HRiNHaJVMqCPOZxLmpXj19Fs71yIRplP1Trv
GGXdWZsccjuY6OljoXWdEfnxAr5zBsO3Yf2yFT95AD+egtGsU1oOzlqAaU1mtflw
ABo452pvh6FFpxGXqz6oK4VqY4Et7WgXOiljA4yIGoPpG/08L1Yle4eVc2EE01Jb
+BL49xNJVeUhGFrvUjPGl9kVMeLmubPFbmgrtipW+VRg9W8+Yirw7DPP6K+gbPAR
RzAUdZFbWw==
=abJG
-----END PGP SIGNATURE-----
Merge tag 'for-5.9/block-20200802' of git://git.kernel.dk/linux-block
Pull core block updates from Jens Axboe:
"Good amount of cleanups and tech debt removals in here, and as a
result, the diffstat shows a nice net reduction in code.
- Softirq completion cleanups (Christoph)
- Stop using ->queuedata (Christoph)
- Cleanup bd claiming (Christoph)
- Use check_events, moving away from the legacy media change
(Christoph)
- Use inode i_blkbits consistently (Christoph)
- Remove old unused writeback congestion bits (Christoph)
- Cleanup/unify submission path (Christoph)
- Use bio_uninit consistently, instead of bio_disassociate_blkg
(Christoph)
- sbitmap cleared bits handling (John)
- Request merging blktrace event addition (Jan)
- sysfs add/remove race fixes (Luis)
- blk-mq tag fixes/optimizations (Ming)
- Duplicate words in comments (Randy)
- Flush deferral cleanup (Yufen)
- IO context locking/retry fixes (John)
- struct_size() usage (Gustavo)
- blk-iocost fixes (Chengming)
- blk-cgroup IO stats fixes (Boris)
- Various little fixes"
* tag 'for-5.9/block-20200802' of git://git.kernel.dk/linux-block: (135 commits)
block: blk-timeout: delete duplicated word
block: blk-mq-sched: delete duplicated word
block: blk-mq: delete duplicated word
block: genhd: delete duplicated words
block: elevator: delete duplicated word and fix typos
block: bio: delete duplicated words
block: bfq-iosched: fix duplicated word
iocost_monitor: start from the oldest usage index
iocost: Fix check condition of iocg abs_vdebt
block: Remove callback typedefs for blk_mq_ops
block: Use non _rcu version of list functions for tag_set_list
blk-cgroup: show global disk stats in root cgroup io.stat
blk-cgroup: make iostat functions visible to stat printing
block: improve discard bio alignment in __blkdev_issue_discard()
block: change REQ_OP_ZONE_RESET and REQ_OP_ZONE_RESET_ALL to be odd numbers
block: defer flush request no matter whether we have elevator
block: make blk_timeout_init() static
block: remove retry loop in ioc_release_fn()
block: remove unnecessary ioc nested locking
block: integrate bd_start_claiming into __blkdev_get
...
Instead of waiting in a loop for the userfaultfd condition to become
true, just wait once and return VM_FAULT_RETRY.
We've already dropped the mmap lock, we know we can't really
successfully handle the fault at this point and the caller will have to
retry anyway. So there's no point in making the wait any more
complicated than it needs to be - just schedule away.
And once you don't have that complexity with explicit looping, you can
also just lose all the 'userfaultfd_signal_pending()' complexity,
because once we've set the correct process sleeping state, and don't
loop, the act of scheduling itself will be checking if there are any
pending signals before going to sleep.
We can also drop the VM_FAULT_MAJOR games, since we'll be treating all
retried faults as major soon anyway (series to regularize and share more
of fault handling across architectures in a separate series by Peter Xu,
and in the meantime we won't worry about the possible minor - I'll be
here all week, try the veal - accounting difference).
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Peter Xu <peterx@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-----BEGIN PGP SIGNATURE-----
iQJHBAABCAAxFiEES8DXskRxsqGE6vXTAA5oQRlWghUFAl8oDgYTHGpsYXl0b25A
a2VybmVsLm9yZwAKCRAADmhBGVaCFaA9D/9HzjmL8/17DdCiFFucl9fgyIUUIlqZ
mSM9RslHQuaOAM5c5RbtbifRZbh5H/pIm930at+JxFcZBN51iwB7xAc8MYEelxIy
9i3hwZJP2mmqum3GTD4QtUcoirzjmYvGffThq9Cb/XuUaXd6S/PZZPZVVk4bChIA
TDwday9Us+5Qz+NddnDPtkZbjv/edYS+gXh5NItODiV/B38yCiRVW36vazdWhZf9
UMRz7YpUT4xijjFd06rQZb6otJSAnP9BEi/4ihYAjsPuf8aot85vLfKD9CzkdLpd
+LbBkaXfoM6pb7C2QFx1PlBB4DeTkYzR7n89kp9poy/F35SyAEvj3zf12AceVG1a
4AbyVhFz6tNea5PLKBhswvGT0Kq0LfDJh6SnH03dqgcU7LQm20OMBT7ImWb3I1/3
1TMe44auGy4Ap1XgkPNq6xMNteX/XIUJIvKJ1g0sYyLppc2jLRnyH+n+aJCFyFQo
ghDKFRUYlmsYZJmzzV17rZjfnqewrlyHf6BcA1aq7C7GbdSJ8eMmxH+UaU3AgRES
Jy693Vd7XTOFPUwOGzHRKRxQ9cFQloTQxSKF6xcigBcKZE1xVZGarR8s4mRlsIU9
oqx50d37nVRVbLtC0OK2ZwD6hvtt9z4v0xM8ahF9n0XDkxnAwi7Hs3XhAvArUPnF
QLPVFaBbWDxwMQ==
=7CeF
-----END PGP SIGNATURE-----
Merge tag 'filelock-v5.9-1' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux
Pull file locking fix from Jeff Layton:
"Just a single, one-line patch to fix an inefficiency in the posix
locking code that can lead to it doing more wakeups than necessary"
* tag 'filelock-v5.9-1' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux:
locks: add locks_move_blocks in posix_lock_inode
If CONFIG_F2FS_FS_COMPRESSION is off, don't allow to configure or
show compression related mount option.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In f2fs_read_multi_pages(), we don't have to check cluster's type
again, since overwrite or partial truncation need page lock in
cluster which has already been held by reader, so cluster's type
is stable, let's change check condition to sanity check.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Because fsverity_descriptor_location.version is constant,
so use macro for better reading.
Signed-off-by: Jack Qiu <jack.qiu@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Function parameter mode could be TRANS_DIR_INO.
Signed-off-by: Jack Qiu <jack.qiu@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
One fix for fs/verity/ to strengthen a memory barrier which might be too
weak. This mirrors a similar fix in fs/crypto/.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQSacvsUNc7UX4ntmEPzXCl4vpKOKwUCXyezbRQcZWJpZ2dlcnNA
Z29vZ2xlLmNvbQAKCRDzXCl4vpKOK3geAQCT35f0xoQkOGLZVqHqlymI1otozKGP
N+arximQuWK2WAD/cKgth+/mJUBE2Ygcfef7hnFYD3maK2P6pzW1Q+GREAc=
=FeLN
-----END PGP SIGNATURE-----
Merge tag 'fsverity-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt
Pull fsverity update from Eric Biggers:
"One fix for fs/verity/ to strengthen a memory barrier which might be
too weak. This mirrors a similar fix in fs/crypto/"
* tag 'fsverity-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt:
fs-verity: use smp_load_acquire() for ->i_verity_info
This release, we add support for inline encryption via the blk-crypto
framework which was added in 5.8. Now when an ext4 or f2fs filesystem
is mounted with '-o inlinecrypt', the contents of encrypted files will
be encrypted/decrypted via blk-crypto, instead of directly using the
crypto API. This model allows taking advantage of the inline encryption
hardware that is integrated into the UFS or eMMC host controllers on
most mobile SoCs. Note that this is just an alternate implementation;
the ciphertext written to disk stays the same.
(This pull request does *not* include support for direct I/O on
encrypted files, which blk-crypto makes possible, since that part is
still being discussed.)
Besides the above feature update, there are also a few fixes and
cleanups, e.g. strengthening some memory barriers that may be too weak.
All these patches have been in linux-next with no reported issues. I've
also tested them with the fscrypt xfstests, as usual. It's also been
tested that the inline encryption support works with the support for
Qualcomm and Mediatek inline encryption hardware that will be in the
scsi pull request for 5.9. Also, several SoC vendors are already using
a previous, functionally equivalent version of these patches.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQSacvsUNc7UX4ntmEPzXCl4vpKOKwUCXye2EBQcZWJpZ2dlcnNA
Z29vZ2xlLmNvbQAKCRDzXCl4vpKOK0veAQCKEnwvy+M6s2/QWhC9vo01rABMtt7h
VRAAKPiFzLNH3AD/dCnZNsFUzk3x0ZyiU1YRW3FvlxFOaEO7Ea0Pt/pyyQ0=
=g9FK
-----END PGP SIGNATURE-----
Merge tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt
Pull fscrypt updates from Eric Biggers:
"This release, we add support for inline encryption via the blk-crypto
framework which was added in 5.8.
Now when an ext4 or f2fs filesystem is mounted with '-o inlinecrypt',
the contents of encrypted files will be encrypted/decrypted via
blk-crypto, instead of directly using the crypto API. This model
allows taking advantage of the inline encryption hardware that is
integrated into the UFS or eMMC host controllers on most mobile SoCs.
Note that this is just an alternate implementation; the ciphertext
written to disk stays the same.
(This pull request does *not* include support for direct I/O on
encrypted files, which blk-crypto makes possible, since that part is
still being discussed.)
Besides the above feature update, there are also a few fixes and
cleanups, e.g. strengthening some memory barriers that may be too
weak.
All these patches have been in linux-next with no reported issues.
I've also tested them with the fscrypt xfstests, as usual. It's also
been tested that the inline encryption support works with the support
for Qualcomm and Mediatek inline encryption hardware that will be in
the scsi pull request for 5.9. Also, several SoC vendors are already
using a previous, functionally equivalent version of these patches"
* tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt:
fscrypt: don't load ->i_crypt_info before it's known to be valid
fscrypt: document inline encryption support
fscrypt: use smp_load_acquire() for ->i_crypt_info
fscrypt: use smp_load_acquire() for ->s_master_keys
fscrypt: use smp_load_acquire() for fscrypt_prepared_key
fscrypt: switch fscrypt_do_sha256() to use the SHA-256 library
fscrypt: restrict IV_INO_LBLK_* to AES-256-XTS
fscrypt: rename FS_KEY_DERIVATION_NONCE_SIZE
fscrypt: add comments that describe the HKDF info strings
ext4: add inline encryption support
f2fs: add inline encryption support
fscrypt: add inline encryption support
fs: introduce SB_INLINECRYPT
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl8jAKwACgkQxWXV+ddt
WDtFvQ/7BIMM6cn+k/LoiK6cTTpq9DKTMoK64XzXJsOiY4ey6pXE0iSyVyn3rC6k
C+wAafdd7UPGPnI5z7L1lJOI7cE/X3PmADmAWB6WhARp19B2SmKfkF+jFAr+T4dE
OZw5lNqHSGv/aByBq8qegrAhWjpRR3VZtCCGW5KvN/strx7MC7t9wFZAB0zIsdKX
aK37VKYhoc+MOF1ikUDn4lRSIjqQYJetjvgC6Yt9dLfx+5oLOK8tpm1XkifN/1xs
HrRR9EpDTKlfJFDee1O+0gof6cKWTqFsbup1EFTrDbkA11zx8r6itBGY5G8P3zMh
JCsVOOJeDLecp1cz1ZWFpyBgrEAN7uHTY0hZbCZgN/dKbSKmv51iujdXB+dDOtxF
cSPywc0NxmftvBbweInwBfsA54BHI0XxCCA0U1yA8xgxPmBE15t81b7F56zmCRke
mSJxAP1dcX8gmL3mzEOUUuKkVbFJ0lIMi2YVkM1lud8Vn4xaWU9HzXlzEvkh7At0
tqlb+LHzaxxVU2m6/6W/KEuiXW1S7/q4nX87wvyMLnylHAaSlA+UtAp3t1q92rdJ
3VGzyvbgBRT2H+22DgCkrPTRlhOifeeuXT3nOwehY4AVkENYQrENb7FmqvppCEtl
v7yTBxxe4zPEjc8dm7o9RBYaVESVFXVQtpCHwz0D+p+adzIYmVM=
=HNGC
-----END PGP SIGNATURE-----
Merge tag 'for-5.9-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
"We don't have any big feature updates this time, there are lots of
small enhacements or fixes. A highlight perhaps is the parallel fsync
performance improvements, numbers below.
Regarding the dio/iomap that was reverted last time, the required API
changes are likely to land in the upcoming cycle, the btrfs part will
be updated afterwards.
User visible changes:
- new mount option rescue= to group all recovery-related mount
options so we don't have many specific options, currently
introducing only aliases for existing options, future extensions
are in development to allow read-only mount with partially damaged
structures:
- usebackuproot is an alias for rescue=usebackuproot
- nologreplay is an alias for rescue=nologreplay
- start deprecation of mount option inode_cache, removal scheduled to
v5.11
- removed deprecated mount options alloc_start and subvolrootid
- device stats corruption counter gets incremented when a checksum
mismatch is found
- qgroup information exported in /sys/fs/btrfs/<UUID>/qgroups/<id>
using sysfs
- add link /sys/fs/btrfs/<UUID>/bdi pointing to the associated
backing dev info
- FS_INFO ioctl enhancements:
- add flags to request/describe newly added items
- new item: numeric checksum type and checksum size
- new item: generation
- new item: metadata_uuid
- seed device: with one new read-write device added, print the new
device information in /proc/mounts
- balance: detect cancellation by Ctrl-C in existing cancellation
points
Performance improvements:
- optimized versions of various helpers on little-endian
architectures, where we don't have to do LE/BE conversion from
on-disk format
- tree-log/fsync optimizations leading to lower max latency reported
by dbench, reduced by about 12%
- all chunk tree leaves are prefetched at mount time, can improve
mount time on large (terabyte-sized) filesystems
- speed up parallel fsync of files with reflinked/deduped extents,
with jobs 16 to 1024 the throughput gets improved roughly by 50% on
average and runtime decreased roughly by 30% on average, notable
outlier is 128 jobs with +121.2% on throughput and -54.6% runtime
- another speed up of parallel fsync, reduce number of checksum tree
lookups and contention, the improvements start to show up with 2
tasks with +20% throughput and -16% runtime up to 64 with +200%
throughput and -66% runtime
Core:
- umount-time qgroup leak checker
- qgroups
- add a way to unreserve partial range after failure, avoiding
some EDQUOT errors
- improved flushing logic when EDQUOT is hit
- possible EINTR interruption caused by failed reservations after
transaction start is better handled and documented
- transaction abort errors are unified to EROFS in case it's not the
original reason of abort or we don't have other way to determine
the reason
Fixes:
- make truncate succeed on a NOCOW file even if data space is
exhausted
- fix cancelling balance on filesystem with exhausted metadata space
- anon block device:
- preallocate anon bdev when subvolume is created to report
failure early
- shorten time the anon bdev id is allocated
- don't allocate anon bdev for internal roots
- minor memory leak in ref-verify
- refuse invalid combinations of compression and NOCOW file flags
- lockdep fixes, updating the device locks
- remove obsolete fallback logic for block group profile adjustments
when switching from 1 to more devices, causing allocation of
unwanted block groups
Other cleanups, refactoring, simplifications:
- conversions from struct inode to struct btrfs_inode in internal
functions
- removal of unused struct members"
* tag 'for-5.9-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (151 commits)
btrfs: do not set the full sync flag on the inode during page release
btrfs: release old extent maps during page release
btrfs: fix race between page release and a fast fsync
btrfs: open-code remount flag setting in btrfs_remount
btrfs: if we're restriping, use the target restripe profile
btrfs: don't adjust bg flags and use default allocation profiles
btrfs: fix lockdep splat from btrfs_dump_space_info
btrfs: move the chunk_mutex in btrfs_read_chunk_tree
btrfs: open device without device_list_mutex
btrfs: sysfs: use NOFS for device creation
btrfs: return EROFS for BTRFS_FS_STATE_ERROR cases
btrfs: document special case error codes for fs errors
btrfs: don't WARN if we abort a transaction with EROFS
btrfs: reduce contention on log trees when logging checksums
btrfs: remove done label in writepage_delalloc
btrfs: add comments for btrfs_reserve_flush_enum
btrfs: relocation: review the call sites which can be interrupted by signal
btrfs: avoid possible signal interruption of btrfs_drop_snapshot() on relocation tree
btrfs: relocation: allow signal to cancel balance
btrfs: raid56: remove out label in __raid56_parity_recover
...
It's expected that erofs_workgroup_unfreeze_final() won't
be used in other places. Let's fold it to simplify the code.
Link: https://lore.kernel.org/r/20200729180235.25443-1-hsiangkao@redhat.com
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
Each ondisk inode should be aligned with inode slot boundary
(32-byte alignment) because of nid calculation formula, so all
compact inodes (32 byte) cannot across page boundary. However,
extended inode is now 64-byte form, which can across page boundary
in principle if the location is specified on purpose, although
it's hard to be generated by mkfs due to the allocation policy
and rarely used by Android use case now mainly for > 4GiB files.
For now, only two fields `i_ctime_nsec` and `i_nlink' couldn't
be read from disk properly and cause out-of-bound memory read
with random value.
Let's fix now.
Fixes: 431339ba90 ("staging: erofs: add inode operations")
Cc: <stable@vger.kernel.org> # 4.19+
Link: https://lore.kernel.org/r/20200729175801.GA23973@xiangao.remote.csb
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
Rationale:
Reduces attack surface on kernel devs opening the links for MITM
as HTTPS traffic is much harder to manipulate.
Deterministic algorithm:
For each file:
If not .svg:
For each line:
If doesn't contain `\bxmlns\b`:
For each link, `\bhttp://[^# \t\r\n]*(?:\w|/)`:
If neither `\bgnu\.org/license`, nor `\bmozilla\.org/MPL\b`:
If both the HTTP and HTTPS versions
return 200 OK and serve the same content:
Replace HTTP with HTTPS.
Reviewed-by: Gao Xiang <hsiangkao@redhat.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Alexander A. Klimov <grandmaster@al2klimov.de>
Link: https://lore.kernel.org/r/20200713130944.34419-1-grandmaster@al2klimov.de
Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
In gfs2_glock_poke, make sure gfs2_holder_uninit is called on the local
glock holder. Without that, we're leaking a glock and a pid reference.
Fixes: 9e8990dea9 ("gfs2: Smarter iopen glock waiting")
Cc: stable@vger.kernel.org # v5.8+
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Pass a pointer to the existing glock holder from
gfs2_file_{read,write}_iter to gfs2_file_direct_{read,write}
to save some stack space.
Suggested-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Before this patch, three flags were not represented in the glock output.
This patch adds them in:
c - GLF_INODE_CREATING
P - GLF_PENDING_DELETE
x - GLF_FREEING (both f and F are already used)
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
* pm-sleep:
PM: sleep: spread "const char *" correctness
PM: hibernate: fix white space in a few places
freezer: Add unsafe version of freezable_schedule_timeout_interruptible() for NFS
PM: sleep: core: Emit changed uevent on wakeup_sysfs_add/remove
* pm-domains:
PM: domains: Restore comment indentation for generic_pm_domain.child_links
PM: domains: Fix up terminology with parent/child
* powercap:
powercap: Add Power Limit4 support
powercap: idle_inject: Replace play_idle() with play_idle_precise() in comments
powercap: intel_rapl: add support for Sapphire Rapids
* pm-tools:
pm-graph v5.7 - important s2idle fixes
cpupower: Replace HTTP links with HTTPS ones
cpupower: Fix NULL but dereferenced coccicheck errors
cpupower: Fix comparing pointer to 0 coccicheck warns
The variable mds is being initialized with a value that is never read
and it is being updated later with a new value. The initialization is
redundant and can be removed.
Addresses-Coverity: ("Unused value")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
If the ceph_mdsc_init() fails, it will free the mdsc already.
Reported-by: syzbot+b57f46d8d6ea51960b8c@syzkaller.appspotmail.com
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Fix build warnings:
fs/ceph/mdsmap.c: In function ‘ceph_mdsmap_decode’:
fs/ceph/mdsmap.c:192:7: warning: variable ‘info_cv’ set but not used [-Wunused-but-set-variable]
fs/ceph/mdsmap.c:177:7: warning: variable ‘state_seq’ set but not used [-Wunused-but-set-variable]
fs/ceph/mdsmap.c:123:15: warning: variable ‘mdsmap_cv’ set but not used [-Wunused-but-set-variable]
Note that p is increased in ceph_decode_*.
Signed-off-by: Jia Yang <jiayang5@huawei.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Drop duplicated words "down" and "the" in fs/ceph/.
[ idryomov: merge into a single patch ]
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Send metric flags to the MDS, indicating what metrics the client
supports. Currently that consists of cap statistics, and read, write and
metadata latencies.
URL: https://tracker.ceph.com/issues/43435
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
This will send the caps/read/write/metadata metrics to any available MDS
once per second, which will be the same as the userland client. It will
skip the MDS sessions which don't support the metric collection, as the
MDSs will close socket connections when they get an unknown type
message.
We can disable the metric sending via the disable_send_metrics module
parameter.
[ jlayton: fix up endianness bug in ceph_mdsc_send_metrics() ]
URL: https://tracker.ceph.com/issues/43215
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
If the session is already in closed state, we should skip it.
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Rationale:
Reduces attack surface on kernel devs opening the links for MITM
as HTTPS traffic is much harder to manipulate.
Deterministic algorithm:
For each file:
If not .svg:
For each line:
If doesn't contain `\bxmlns\b`:
For each link, `\bhttp://[^# \t\r\n]*(?:\w|/)`:
If neither `\bgnu\.org/license`, nor `\bmozilla\.org/MPL\b`:
If both the HTTP and HTTPS versions
return 200 OK and serve the same content:
Replace HTTP with HTTPS.
[ idryomov: Do the same for the CRUSH paper and replace
ceph.newdream.net with ceph.io. ]
Signed-off-by: Alexander A. Klimov <grandmaster@al2klimov.de>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Remove unnecassary casts in the argument to kfree.
Signed-off-by: Xu Wang <vulab@iscas.ac.cn>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
In aio case, if the completion comes very fast just before the
ceph_read_iter() returns to fs/aio.c, the kiocb will be freed in
the completion callback, then if ceph_read_iter() access again
we will potentially hit the use-after-free bug.
[ jlayton: initialize direct_lock early, and use it everywhere ]
URL: https://tracker.ceph.com/issues/45649
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Make this loop look a bit more sane. Also optimize away the spinlock
release/reacquire if we can't get an inode reference.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Make sure the delayed work stopped before releasing the resources.
cancel_delayed_work_sync() will only guarantee that the work finishes
executing if the work is already in the ->worklist. That means after
the cancel_delayed_work_sync() returns, it will leave the work requeued
if it was rearmed at the end. That can lead to a use after free once the
work struct is freed.
Fix it by flushing the delayed work instead of trying to cancel it, and
ensure that the work doesn't rearm if the mdsc is stopping.
URL: https://tracker.ceph.com/issues/46293
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
...and let the errnos bubble up to the callers.
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
This will help to reduce using the global mdsc->mutex lock in many
places.
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
And remove the unsed mdsc parameter to simplify the code.
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
cifs_mount() for DFS mounts is for a long time way too complex to
follow, mostly because it lacks some documentation, does a lot of
operations like resolving DFS roots and links, checking for path
components, perform failover, crap code, etc.
Besides adding some documentation to it, do some cleanup and ensure
that the following is implemented and supported:
* non-DFS mounts
* DFS failover
* DFS root mounts
- tcon and cifs_sb must contain DFS path (NOT including prefix)
- if prefix path, then save it in cifs_sb and it must not be
changed
* DFS link mounts
- tcon and cifs_sb must contain DFS path (including prefix)
- if prefix path, then save it in cifs_sb and it may be changed
* prevent recursion on broken link referrals (MAX_NESTED_LINKS)
* check every path component of the currently resolved
target (including prefix), and chase them accordingly
* make sure that DFS referrals go through newly resolved root
servers
Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
For DFS root mounts that contain a prefix path, do not change them
after failover.
E.g., if the user mounts
//srvA/root/dir1
and then lost connection to srvA, it will reconnect to
//srvB/root/dir1
In case of DFS links, which may resolve to different prefix paths
depending on their list of targets, the following must be supported:
- mount //srvA/root/link/bar
- connect to //srvA/share
- set prefix path to "bar"
- lost connection to srvA
- reconnect to next target: //srvB/share/foo
- set new prefix path to "foo/bar"
In cifs_tree_connect(), check the server_type field of the cached DFS
referral to determine whether or not prefix path should be updated.
Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Currently if the call dfs_cache_get_tgt_share fails we cannot
fully guarantee that share and prefix are set to NULL and the
next iteration of the loop can end up potentially double freeing
these pointers. Since the semantics of dfs_cache_get_tgt_share
are ambiguous for failure cases with the setting of share and
prefix (currently now and the possibly the future), it seems
prudent to set the pointers to NULL when the objects are
free'd to avoid any double frees.
Addresses-Coverity: ("Double free")
Fixes: 96296c946a2a ("cifs: handle RESP_GET_DFS_REFERRAL.PathConsumed in reconnect")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Use PathConsumed field when parsing prefixes of referral paths that
either match a cache entry or are a complete prefix path of an
existing entry.
Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
In case there were no cached DFS referrals in
reconn_setup_dfs_targets(), set cifs_sb to NULL prior to calling
reconn_set_next_dfs_target() so it would not try to access an empty
tgt_list.
Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
This function has nothing to do with *invalidation* but setting up the
next target server from a cached referral.
Rename it to reconn_set_next_dfs_target(). While at it, get rid of
some meaningless checks.
Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
When looking up the DFS cache with a referral path that has more than
two path components, and is a complete prefix of an existing cache
entry, do not request another referral and just return the matched
entry as specified in MS-DFSC 3.2.5.5 Receiving a Root Referral
Request or Link Referral Request.
Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
They were identical execpt to CIFSTCon() vs. SMB2_tcon().
These are also available via ops->tree_connect().
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Convert cpu_to_be32(be32_to_cpu(E1) + E2) to use be32_add_cpu().
Signed-off-by: Qinglang Miao <miaoqinglang@huawei.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Drop repeated words in multiple comments.
(be, use, the, See)
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Steve French <sfrench@samba.org>
Cc: linux-cifs@vger.kernel.org
Cc: samba-technical@lists.samba.org
Signed-off-by: Steve French <stfrench@microsoft.com>
Remove the superfuous break, as there is a 'return' before it.
Signed-off-by: Liao Pingfang <liao.pingfang@zte.com.cn>
Signed-off-by: Yi Wang <wang.yi59@zte.com.cn>
Signed-off-by: Steve French <stfrench@microsoft.com>
RHBZ 1145308
Some very old server may not support SetPathInfo to adjust the timestamps
of directories. For these servers, try to open the directory and use SetFileInfo.
Minor correction to patch included that was
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Tested-by: Kenneth D'souza <kdsouza@redhat.com>
If server returns ERRBaduid but does not reset transport connection,
we'll keep sending command with a non-valid UID for the server as long
as transport is healthy, without actually recovering. This have been
observed on the field.
This patch adds ERRBaduid handling so that we set CifsNeedReconnect.
map_and_check_smb_error() can be modified to extend use cases.
Signed-off-by: Roberto Bergantinos Corpas <rbergant@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Fix build warning by removing unused variable 'server':
fs/cifs/inode.c:1089:26: warning:
variable server set but not used [-Wunused-but-set-variable]
1089 | struct TCP_Server_Info *server;
| ^~~~~~
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
When mounting with Kerberos, users have been confused about the
default error returned in scenarios in which either keyutils is
not installed or the user did not properly acquire a krb5 ticket.
Log a warning message in the case that "ENOKEY" is returned
from the get_spnego_key upcall so that users can better understand
why mount failed in those two cases.
CC: Stable <stable@vger.kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
The log of UAF problem is listed below.
BUG: KASAN: use-after-free in jffs2_rmdir+0xa4/0x1cc [jffs2] at addr c1f165fc
Read of size 4 by task rm/8283
=============================================================================
BUG kmalloc-32 (Tainted: P B O ): kasan: bad access detected
-----------------------------------------------------------------------------
INFO: Allocated in 0xbbbbbbbb age=3054364 cpu=0 pid=0
0xb0bba6ef
jffs2_write_dirent+0x11c/0x9c8 [jffs2]
__slab_alloc.isra.21.constprop.25+0x2c/0x44
__kmalloc+0x1dc/0x370
jffs2_write_dirent+0x11c/0x9c8 [jffs2]
jffs2_do_unlink+0x328/0x5fc [jffs2]
jffs2_rmdir+0x110/0x1cc [jffs2]
vfs_rmdir+0x180/0x268
do_rmdir+0x2cc/0x300
ret_from_syscall+0x0/0x3c
INFO: Freed in 0x205b age=3054364 cpu=0 pid=0
0x2e9173
jffs2_add_fd_to_list+0x138/0x1dc [jffs2]
jffs2_add_fd_to_list+0x138/0x1dc [jffs2]
jffs2_garbage_collect_dirent.isra.3+0x21c/0x288 [jffs2]
jffs2_garbage_collect_live+0x16bc/0x1800 [jffs2]
jffs2_garbage_collect_pass+0x678/0x11d4 [jffs2]
jffs2_garbage_collect_thread+0x1e8/0x3b0 [jffs2]
kthread+0x1a8/0x1b0
ret_from_kernel_thread+0x5c/0x64
Call Trace:
[c17ddd20] [c02452d4] kasan_report.part.0+0x298/0x72c (unreliable)
[c17ddda0] [d2509680] jffs2_rmdir+0xa4/0x1cc [jffs2]
[c17dddd0] [c026da04] vfs_rmdir+0x180/0x268
[c17dde00] [c026f4e4] do_rmdir+0x2cc/0x300
[c17ddf40] [c001a658] ret_from_syscall+0x0/0x3c
The root cause is that we don't get "jffs2_inode_info.sem" before
we scan list "jffs2_inode_info.dents" in function jffs2_rmdir.
This patch add codes to get "jffs2_inode_info.sem" before we scan
"jffs2_inode_info.dents" to slove the UAF problem.
Signed-off-by: Zhe Li <lizhe67@huawei.com>
Reviewed-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
Thanks for the advice mentioned in the email.
This is my v3 patch for this problem.
Mounting jffs2 on nand flash will get message "failed: I/O error"
with the steps listed below.
1.umount jffs2
2.erase nand flash
3.mount jffs2 on it (this mounting operation will be successful)
4.do chown or chmod to the mount point directory
5.umount jffs2
6.mount jffs2 on nand flash
After step 6, we will get message "mount ... failed: I/O error".
Typical image of this problem is like:
Empty space found from 0x00000000 to 0x008a0000
Inode node at xx, totlen 0x00000044, #ino 1, version 1, isize 0...
The reason for this mounting failure is that at the end of function
jffs2_scan_medium(), jffs2 will check the used_size and some info
of nr_blocks.If conditions are met, it will return -EIO.
The detail is that, in the steps listed above, step 4 will write
jffs2_raw_inode into flash without jffs2_raw_dirent, which will
cause that there are some jffs2_raw_inode but no jffs2_raw_dirent
on flash. This will meet the condition at the end of function
jffs2_scan_medium() and return -EIO if we umount jffs2 and mount it
again.
We notice that jffs2 add the value of c->unchecked_size if we find
an inode node while mounting. And jffs2 will never add the value of
c->unchecked_size in other situations. So this patch add one more
condition about c->unchecked_size of the judgement to fix this problem.
Signed-off-by: Zhe Li <lizhe67@huawei.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
There a wrong orphan node deleting in error handling path in
ubifs_jnl_update() and ubifs_jnl_rename(), which may cause
following error msg:
UBIFS error (ubi0:0 pid 1522): ubifs_delete_orphan [ubifs]:
missing orphan ino 65
Fix this by checking whether the node has been operated for
adding to orphan list before being deleted,
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Fixes: 823838a486 ("ubifs: Add hashes to the tree node cache")
Signed-off-by: Richard Weinberger <richard@nod.at>
Drop the repeated word "as" in a comment.
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Richard Weinberger <richard@nod.at>
Cc: linux-mtd@lists.infradead.org
Signed-off-by: Richard Weinberger <richard@nod.at>
Instead of creating ubifs file systems with UBIFS_FORMAT_VERSION
by default, add a module parameter ubifs.default_version to allow
the user to specify the desired version. Valid values are 4 to
UBIFS_FORMAT_VERSION (currently 5).
This way, one can for example create a file system with version 4
on kernel 4.19 which can still be mounted rw when downgrading to
kernel 4.9.
Signed-off-by: Martin Kaistra <martin.kaistra@linutronix.de>
Signed-off-by: Richard Weinberger <richard@nod.at>
nfs_wb_all() calls filemap_write_and_wait(), which uses
filemap_check_errors() to determine the error to return.
filemap_check_errors() only looks at the mapping->flags and will
therefore only return either -ENOSPC or -EIO. To ensure that the
correct error is returned on close(), nfs{,4}_file_flush() should call
filemap_check_wb_err() which looks at the errseq value in
mapping->wb_err without consuming it.
Fixes: 6fbda89b25 ("NFS: Replace custom error reporting mechanism with
generic one")
Signed-off-by: Scott Mayhew <smayhew@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
As recently done with with send/recv, flip the if after
rw_verify_aread() in io_{read,write}() and tabulise left bits left.
This removes mispredicted by a compiler jump on the success/fast path.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
As soon as we install the file descriptor, we have to assume that it
can get arbitrarily closed. We currently account memory (and note that
we did) after installing the ring fd, which means that it could be a
potential use-after-free condition if the fd is closed right after
being installed, but before we fiddle with the ctx.
In fact, syzbot reported this exact scenario:
BUG: KASAN: use-after-free in io_account_mem fs/io_uring.c:7397 [inline]
BUG: KASAN: use-after-free in io_uring_create fs/io_uring.c:8369 [inline]
BUG: KASAN: use-after-free in io_uring_setup+0x2797/0x2910 fs/io_uring.c:8400
Read of size 1 at addr ffff888087a41044 by task syz-executor.5/18145
CPU: 0 PID: 18145 Comm: syz-executor.5 Not tainted 5.8.0-rc7-next-20200729-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x18f/0x20d lib/dump_stack.c:118
print_address_description.constprop.0.cold+0xae/0x497 mm/kasan/report.c:383
__kasan_report mm/kasan/report.c:513 [inline]
kasan_report.cold+0x1f/0x37 mm/kasan/report.c:530
io_account_mem fs/io_uring.c:7397 [inline]
io_uring_create fs/io_uring.c:8369 [inline]
io_uring_setup+0x2797/0x2910 fs/io_uring.c:8400
do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x45c429
Code: 8d b6 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 5b b6 fb ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007f8f121d0c78 EFLAGS: 00000246 ORIG_RAX: 00000000000001a9
RAX: ffffffffffffffda RBX: 0000000000008540 RCX: 000000000045c429
RDX: 0000000000000000 RSI: 0000000020000040 RDI: 0000000000000196
RBP: 000000000078bf38 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 000000000078bf0c
R13: 00007fff86698cff R14: 00007f8f121d19c0 R15: 000000000078bf0c
Move the accounting of the ring used locked memory before we get and
install the ring file descriptor.
Cc: stable@vger.kernel.org
Reported-by: syzbot+9d46305e76057f30c74e@syzkaller.appspotmail.com
Fixes: 309758254e ("io_uring: report pinned memory usage")
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add a simple helper to set timestamps with a kernel space file name and
switch the early init code over to it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Add a simple helper to mknod with a kernel space file name and switch
the early init code over to it. Remove the now unused ksys_mknod.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Add a simple helper to mkdir with a kernel space file name and switch
the early init code over to it. Remove the now unused ksys_mkdir.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Add a simple helper to symlink with a kernel space file name and switch
the early init code over to it. Remove the now unused ksys_symlink.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Add a simple helper to link with a kernel space file name and switch
the early init code over to it. Remove the now unused ksys_link.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Add a simple helper to check if a file exists based on kernel space file
name and switch the early init code over to it. Note that this
theoretically changes behavior as it always is based on the effective
permissions. But during early init that doesn't make a difference.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Add a simple helper to chroot with a kernel space file name and switch
the early init code over to it. Remove the now unused ksys_chroot.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Add a simple helper to chdir with a kernel space file name and switch
the early init code over to it. Remove the now unused ksys_chdir.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Add a simple helper to rmdir with a kernel space file name and switch
the early init code over to it. Remove the now unused ksys_rmdir.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Add a simple helper to unlink with a kernel space file name and switch
the early init code over to it. Remove the now unused ksys_unlink.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Like ksys_umount, but takes a kernel pointer for the destination path.
Switch over the umount in the init code, which just happen to work due to
the implicit set_fs(KERNEL_DS) during early init right now.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Like do_mount, but takes a kernel pointer for the destination path.
Switch over the mounts in the init code and devtmpfs to it, which
just happen to work due to the implicit set_fs(KERNEL_DS) during early
init right now.
Signed-off-by: Christoph Hellwig <hch@lst.de>
This mirrors do_unlinkat and will make life a little easier for
the init code to reuse the whole function with a kernel filename.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Factor out a path_umount helper that takes a struct path * instead of the
actual file name. This will allow to convert the init and devtmpfs code
to properly mount based on a kernel pointer instead of relying on the
implicit set_fs(KERNEL_DS) during early init.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Factor out a path_mount helper that takes a struct path * instead of the
actual file name. This will allow to convert the init and devtmpfs code
to properly mount based on a kernel pointer instead of relying on the
implicit set_fs(KERNEL_DS) during early init.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Rename utimes_common to vfs_utimes and make it available outside of
utimes.c. This will be used by the initramfs unpacking code.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Consolidate the validation of the timespec from the two callers into
utimes_common. That means it is done a little later (e.g. after the
path lookup), but I can't find anything that requires a specific
order of processing the errors.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Split out one helper each for path vs fd based operations.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Remove kmem_cache_alloc return value cast.
Coccinelle emits the following warning:
./fs/9p/vfs_inode.c:226:12-29: WARNING: casting value returned by memory allocation function to (struct v9fs_inode *) is useless.
Link: http://lkml.kernel.org/r/1596013140-49744-1-git-send-email-liheng40@huawei.com
Signed-off-by: Li Heng <liheng40@huawei.com>
[Dominique: commit message wording]
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
The argument passed to xas_set_err() to indicate an error should be negative.
Otherwise, xas_error() will return 0, and grab_mapping_entry() will return the
found entry instead of 'SIGBUS' when the entry is not in fact valid.
This would result in problems in subsequent code paths.
Link: https://lore.kernel.org/r/20200729034436.24267-1-lihao2018.fnst@cn.fujitsu.com
Reviewed-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Signed-off-by: Hao Li <lihao2018.fnst@cn.fujitsu.com>
Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
In fscrypt_set_bio_crypt_ctx(), ->i_crypt_info isn't known to be
non-NULL until we check fscrypt_inode_uses_inline_crypto(). So, load
->i_crypt_info after the check rather than before. This makes no
difference currently, but it prevents people from introducing bugs where
the pointer is dereferenced when it may be NULL.
Suggested-by: Dave Chinner <david@fromorbit.com>
Cc: Satya Tangirala <satyat@google.com>
Link: https://lore.kernel.org/r/20200727174158.121456-1-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
Caches should be small enough to discard them inline, so do that
instead of using a work queue.
Signed-off-by: Frank van der Linden <fllinden@amazon.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
If ->cq_timeouts modifications are done under ->completion_lock, we
don't really nee any fetch-and-add and other complex atomics. Replace it
with non-atomic FAA, that saves an implicit full memory barrier.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add a helper to mark ctx->{cq,sq}_check_overflow to get rid of
duplicates, and it's clearer to check cq_overflow_list directly anyway.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Always do io_commit_cqring() after completing a request, even if it was
accounted as overflowed on the CQ side. Failing to do that may lead to
not to pushing deferred requests when needed, and so stalling the whole
ring.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
All ->cq_overflow modifications should be under completion_lock,
otherwise it can report a wrong number to the userspace. Fix it in
io_uring_cancel_files().
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Call __io_complete_rw() in io_iopoll_queue() instead of hand coding it.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
As io_kiocb have enough space, move ->work out of a union. It's safer
this way and removes ->work memcpy bouncing.
By the way make tabulation in struct io_kiocb consistent.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There is no good reason to mess with file descriptors from in-kernel
code, switch the initrd loading to struct file based read and writes
instead.
Also Pass an explicit offset instead of ->f_pos, and to make that easier,
use file scope file structs and offsets everywhere except for
identify_ramdisk_image instead of the current strange mix.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
A sequence counter write side critical section must be protected by some
form of locking to serialize writers. A plain seqcount_t does not
contain the information of which lock must be held when entering a write
side critical section.
Use the new seqcount_spinlock_t data type, which allows to associate a
spinlock with the sequence counter. This enables lockdep to verify that
the spinlock used for writer serialization is held when the write side
critical section is entered.
If lockdep is disabled this lock association is compiled out and has
neither storage size nor runtime overhead.
Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200720155530.1173732-23-a.darwish@linutronix.de
A sequence counter write side critical section must be protected by some
form of locking to serialize writers. A plain seqcount_t does not
contain the information of which lock must be held when entering a write
side critical section.
Use the new seqcount_spinlock_t data type, which allows to associate a
spinlock with the sequence counter. This enables lockdep to verify that
the spinlock used for writer serialization is held when the write side
critical section is entered.
If lockdep is disabled this lock association is compiled out and has
neither storage size nor runtime overhead.
Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200720155530.1173732-22-a.darwish@linutronix.de
A sequence counter write side critical section must be protected by some
form of locking to serialize writers. A plain seqcount_t does not
contain the information of which lock must be held when entering a write
side critical section.
Use the new seqcount_spinlock_t data type, which allows to associate a
spinlock with the sequence counter. This enables lockdep to verify that
the spinlock used for writer serialization is held when the write side
critical section is entered.
If lockdep is disabled this lock association is compiled out and has
neither storage size nor runtime overhead.
Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200720155530.1173732-19-a.darwish@linutronix.de
1) FS_DAX_FL has been introduced by commit b383a73f2b.
2) In future, chattr/lsattr command from e2fsprogs can set/get
inode DAX on XFS by calling ioctl(SETXFLAGS/GETXFLAGS).
Signed-off-by: Xiao Yang <yangx.jy@cn.fujitsu.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Lift -ENOSPC handler from xfs_attr_leaf_addname. This will help to
reorganize transitions between the attr forms later.
Signed-off-by: Allison Collins <allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Acked-by: Dave Chinner <dchinner@redhat.com>
Invert the rename logic in xfs_attr_node_addname to simplify the
delayed attr logic later.
Signed-off-by: Allison Collins <allison.henderson@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Acked-by: Dave Chinner <dchinner@redhat.com>
Invert the rename logic in xfs_attr_leaf_addname to simplify the
delayed attr logic later.
Signed-off-by: Allison Collins <allison.henderson@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Acked-by: Dave Chinner <dchinner@redhat.com>
This patch adds another new helper function
xfs_attr_node_removename_rmt. This will also help modularize
xfs_attr_node_removename when we add delay ready attributes later.
Signed-off-by: Allison Collins <allison.henderson@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Chandan Rajendra <chandanrlinux@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Acked-by: Dave Chinner <dchinner@redhat.com>
This patch adds a new helper function xfs_attr_node_removename_setup.
This will help modularize xfs_attr_node_removename when we add delay
ready attributes later.
Signed-off-by: Allison Collins <allison.henderson@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Chandan Rajendra <chandanrlinux@gmail.com>
[darrick: fix unused variable complaints by 0day robot]
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Acked-by: Dave Chinner <dchinner@redhat.com>
Reported-by: kernel test robot <lkp@intel.com>
This patch adds two new helper functions xfs_attr_store_rmt_blk and
xfs_attr_restore_rmt_blk. These two helpers assist to remove redundant
code associated with storing and retrieving remote blocks during the
attr set operations.
Signed-off-by: Allison Collins <allison.henderson@oracle.com>
Reviewed-by: Chandan Rajendra <chandanrlinux@gmail.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Acked-by: Dave Chinner <dchinner@redhat.com>
This patch helps to simplify xfs_attr_node_removename by modularizing
the code around the transactions into helper functions. This will make
the function easier to follow when we introduce delayed attributes.
Signed-off-by: Allison Collins <allison.henderson@oracle.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Chandan Rajendra <chandanrlinux@gmail.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Acked-by: Dave Chinner <dchinner@redhat.com>
In this patch, we hoist code from xfs_attr_set_args into two new helpers
xfs_attr_is_shortform and xfs_attr_set_shortform. These two will help
to simplify xfs_attr_set_args when we get into delayed attrs later.
Signed-off-by: Allison Collins <allison.henderson@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Chandan Rajendra <chandanrlinux@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Acked-by: Dave Chinner <dchinner@redhat.com>
A transaction roll is not necessary immediately after setting the
INCOMPLETE flag when removing a node xattr entry with remote value
blocks. The remote block invalidation that immediately follows setting
the flag is an in-core only change. The next step after that is to start
unmapping the remote blocks from the attr fork, but the xattr remove
transaction reservation includes reservation for full tree splits of the
dabtree and bmap tree. The remote block unmap code will roll the
transaction as extents are unmapped and freed.
Signed-off-by: Allison Collins <allison.henderson@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Acked-by: Dave Chinner <dchinner@redhat.com>
Some calls to xfs_trans_roll_inode and xfs_defer_finish routines are not
needed. If they are the last operations executed in these functions, and
no further changes are made, then higher level routines will roll or
commit the transactions.
Signed-off-by: Allison Collins <allison.henderson@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Acked-by: Dave Chinner <dchinner@redhat.com>
This patch adds a new helper function xfs_attr_node_shrink used to
shrink an attr name into an inode if it is small enough. This helps to
modularize the greater calling function xfs_attr_node_removename.
Signed-off-by: Allison Collins <allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Acked-by: Dave Chinner <dchinner@redhat.com>
This patch pulls xfs_attr_rmtval_invalidate out of
xfs_attr_rmtval_remove and into the calling functions. Eventually
__xfs_attr_rmtval_remove will replace xfs_attr_rmtval_remove when we
introduce delayed attributes. These functions are exepcted to return
-EAGAIN when they need a new transaction. Because the invalidate does
not need a new transaction, we need to separate it from the rest of the
function that does. This will enable __xfs_attr_rmtval_remove to
smoothly replace xfs_attr_rmtval_remove later.
Signed-off-by: Allison Collins <allison.henderson@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Acked-by: Dave Chinner <dchinner@redhat.com>
Refactor xfs_attr_rmtval_remove to add helper function
__xfs_attr_rmtval_remove. We will use this later when we introduce
delayed attributes. This function will eventually replace
xfs_attr_rmtval_remove
Signed-off-by: Allison Collins <allison.henderson@oracle.com>
Reviewed-by: Chandan Rajendra <chandanrlinux@gmail.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Acked-by: Dave Chinner <dchinner@redhat.com>
New delayed allocation routines cannot be handling transactions so
pull them out into the calling functions
Signed-off-by: Allison Collins <allison.henderson@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Chandan Rajendra <chandanrlinux@gmail.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Acked-by: Dave Chinner <dchinner@redhat.com>
Because new delayed attribute routines cannot roll transactions, we
carve off the parts of xfs_attr_rmtval_remove that we can use. This
will help to reduce repetitive code later when we introduce delayed
attributes.
Signed-off-by: Allison Collins <allison.henderson@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Chandan Rajendra <chandanrlinux@gmail.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Acked-by: Dave Chinner <dchinner@redhat.com>
New delayed allocation routines cannot be handling transactions so
pull them up into the calling functions
Signed-off-by: Allison Collins <allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Chandan Rajendra <chandanrlinux@gmail.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Acked-by: Dave Chinner <dchinner@redhat.com>
To help pre-simplify xfs_attr_set_args, we need to hoist transaction
handling up, while modularizing the adjacent code down into helpers. In
this patch, hoist the commit in xfs_attr_try_sf_addname up into the
calling function, and also pull the attr list creation down.
Signed-off-by: Allison Collins <allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Chandan Rajendra <chandanrlinux@gmail.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Acked-by: Dave Chinner <dchinner@redhat.com>
Split out new helper function xfs_attr_leaf_try_add from
xfs_attr_leaf_addname. Because new delayed attribute routines cannot
roll transactions, we split off the parts of xfs_attr_leaf_addname that
we can use, and move the commit into the calling function.
Signed-off-by: Allison Collins <allison.henderson@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Chandan Rajendra <chandanrlinux@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Acked-by: Dave Chinner <dchinner@redhat.com>
Since delayed operations cannot roll transactions, pull up the
transaction handling into the calling function
Signed-off-by: Allison Collins <allison.henderson@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Chandan Rajendra <chandanrlinux@gmail.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Acked-by: Dave Chinner <dchinner@redhat.com>
Break xfs_attr_rmtval_set into two helper functions
xfs_attr_rmt_find_hole and xfs_attr_rmtval_set_value.
xfs_attr_rmtval_set rolls the transaction between the helpers, but
delayed operations cannot. We will use the helpers later when
constructing new delayed attribute routines.
Signed-off-by: Allison Collins <allison.henderson@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Chandan Rajendra <chandanrlinux@gmail.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Acked-by: Dave Chinner <dchinner@redhat.com>
Delayed operations cannot return error codes. So we must check for
these conditions first before starting set or remove operations
Signed-off-by: Allison Collins <allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Acked-by: Dave Chinner <dchinner@redhat.com>
This patch adds a new functions to check for the existence of an
attribute. Subroutines are also added to handle the cases of leaf
blocks, nodes or shortform. Common code that appears in existing attr
add and remove functions have been factored out to help reduce the
appearance of duplicated code. We will need these routines later for
delayed attributes since delayed operations cannot return error codes.
Signed-off-by: Allison Collins <allison.henderson@oracle.com>
Reviewed-by: Chandan Rajendra <chandanrlinux@gmail.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
[darrick: fix a leak-on-error bug reported by Dan Carpenter]
[darrick: fix unused variable warning reported by 0day]
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Acked-by: Dave Chinner <dchinner@redhat.com>
Reported-by: dan.carpenter@oracle.com
Reported-by: kernel test robot <lkp@intel.com>
Every call to xfs_da_state_alloc() also requires setting up state->args
and state->mp
Change xfs_da_state_alloc() to receive an xfs_da_args_t as argument and
return a xfs_da_state_t with both args and mp already set.
Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
[darrick: reduce struct typedef usage]
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
All their users have been converted to use MM API directly, no need to
keep them around anymore.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
xlog_ticket_alloc() is always called under NOFS context, except from
unmount path, which eitherway is holding many FS locks, so, there is no
need for its callers to keep passing allocation flags into it.
change xlog_ticket_alloc() to use default kmem_cache_zalloc(), remove
its alloc_flags argument, and always use GFP_NOFS | __GFP_NOFAIL flags.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Use kmem_cache_zalloc() directly.
With the exception of xlog_ticket_alloc() which will be dealt on the
next patch for readability.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Use kmem_cache_alloc() directly.
All kmem_zone_alloc() users pass 0 as flags, which are translated into:
GFP_KERNEL | __GFP_NOWARN, and kmem_zone_alloc() loops forever until the
allocation succeeds.
We can use __GFP_NOFAIL to tell the allocator to loop forever rather
than doing it ourself, and because the allocation will never fail, we do
not need to use __GFP_NOWARN anymore. Hence, all callers can be
converted to use GFP_KERNEL | __GFP_NOFAIL
Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
[darrick: add a comment back in about nofail]
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Drop the repeated words "with" and "be" in comments.
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: linux-xfs@vger.kernel.org
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The ondisk dquot stores the quota record type in the flags field.
Rename this field to d_type to make the _type relationship between the
ondisk and incore dquot more obvious.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Create an XFS_DQTYPE_ANY mask for ondisk dquots flags, and use that to
ensure that we never accept any garbage flags when we're loading dquots.
While we're at it, restructure the quota type flag checking to use the
proper masking.
Note that I plan to add y2038 support soon, which will require a new
xfs_dqtype_t flag for extended timestamp support, hence all the work to
make the type masking work correctly.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Create a new type (xfs_dqtype_t) to represent the type of an incore
dquot (user, group, project, or none). Rename the incore dquot's
dq_flags field to q_type.
This allows us to replace all the "uint type" arguments to the quota
functions with "xfs_dqtype_t type", to make it obvious when we're
passing a quota type argument into a function.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Fix a few places where we open-coded this mask constant.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
When XFS' quota functions take a parameter for the quota type, they only
care about the three quota record types (user, group, project).
Internal state flags and whatnot should never be passed by callers and
are an error. Now that we've moved responsibility for filtering out
internal state to the callers, we can drop the masking everywhere else.
In other words, if you call a quota function, you must only pass in
one of XFS_DQTYPE_{USER,GROUP,PROJ}.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Always use the xfs_dquot_type helper to extract the quota type from an
incore dquot. This moves responsibility for filtering internal state
information and whatnot to anybody passing around a struct xfs_dquot.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Certain functions can only act upon one quota type, so refactor those
functions to use switch statements, in keeping with all the other high
level xfs quota api calls.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Remove these macros and use xfs_dquot_type() for everything.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Create a small helper to test if enforcement is enabled for a
given incore dquot and replace the open-code logic testing.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
We're going to split up the incore dquot state flags from the ondisk
dquot flags (eventually renaming this "type") so start by renaming the
three flags and the bitmask that are going to participate in this.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
xfs_qm_reset_dqcounts (aka quotacheck) is the only xfs_dqblk_verify
caller that actually knows the specific quota type that it's looking
for. Since everything else just pass in type==0 (including the buffer
verifier), drop the parameter and open-code the check like
xfs_dquot_from_disk already does.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Add all the xfs_dquot fields to the tracepoint for that type; add a new
tracepoint type for the qtrx structure (dquot transaction deltas); and
use our new tracepoints. This makes it easier for the author to trace
changes to dquot counters for debugging.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Allison Collins <allison.henderson@oracle.com>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Currently, xfs quotas have the ability to send netlink warnings when a
user exceeds the limits. They also have all the support code necessary
to convert softlimit warnings into failures if the number of warnings
exceeds a limit set by the administrator. Unfortunately, we never
actually increase the warning counter, so this never actually happens.
Make it so we actually do something useful with the warning counts.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
We always initialize the default quota limits to something nowadays, so
we don't need to check that the defaults are set to something before
using them.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Hoist the code that adjusts the incore quota reservation count
adjustments into a separate function, both to reduce the level of
indentation and also to reduce the amount of open-coded logic.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Allison Collins <allison.henderson@oracle.com>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Now that we've refactored the resource usage and limits into
per-resource structures, we can refactor some of the open-coded
reservation limit checking in xfs_trans_dqresv.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: Allison Collins <allison.henderson@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Now that we can pass around quota resource and limit structures, clean
up the open-coded field setting in xfs_qm_scall_setqlim.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Allison Collins <allison.henderson@oracle.com>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Refactor the open-coded test for whether or not we're over quota.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
struct xfs_dquot already has a pointer to the xfs mount, so remove the
redundant parameter from xfs_qm_adjust_dq*.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Now that we've split up the dquot resource fields into separate structs,
do the same for the default limits to enable further refactoring.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Now that we've stopped using qcore entirely, drop it from the incore
dquot.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Add timers fields to the incore dquot, and use that instead of the ones
in qcore. This eliminates a bunch of endian conversions and will
eventually allow us to remove qcore entirely.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Allison Collins <allison.henderson@oracle.com>
Add warning counter fields to the incore dquot, and use that instead of
the ones in qcore. This eliminates a bunch of endian conversions and
will eventually allow us to remove qcore entirely.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Allison Collins <allison.henderson@oracle.com>
Add counter fields to the incore dquot, and use that instead of the ones
in qcore. This eliminates a bunch of endian conversions and will
eventually allow us to remove qcore entirely.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Allison Collins <allison.henderson@oracle.com>
Add limits fields in the incore dquot, and use that instead of the ones
in qcore. This eliminates a bunch of endian conversions and will
eventually allow us to remove qcore entirely.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Allison Collins <allison.henderson@oracle.com>
Introduce a new struct xfs_dquot_res that we'll use to track all the
incore data for a particular resource type (block, inode, rt block).
This will help us (once we've eliminated q_core) to declutter quota
functions that currently open-code field access or pass around fields
around explicitly.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Allison Collins <allison.henderson@oracle.com>
Add a dquot id field to the incore dquot, and use that instead of the
one in qcore. This eliminates a bunch of endian conversions and will
eventually allow us to remove qcore entirely.
We also rearrange the start of xfs_dquot to remove padding holes, saving
8 bytes.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Allison Collins <allison.henderson@oracle.com>
Use the incore dq_flags to figure out the dquot type. This is the first
step towards removing xfs_disk_dquot from the incore dquot.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Move the dquot cluster size #define to xfs_format.h. It is an important
part of the ondisk format because the ondisk dquot record size is not an
even power of two, which means that the buffer size we use is
significant here because the kernel leaves slack space at the end of the
buffer to avoid having to deal with a dquot record crossing a block
boundary.
This is also an excuse to fix one of the longstanding discrepancies
between kernel and userspace libxfs headers.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Rename the existing incore dquot "dq_flags" field to "q_flags" to match
everything else in the structure, then move the two actual dquot state
flags to the XFS_DQFLAG_ namespace from XFS_DQ_.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
We only use the XFS_QMOPT flags in quotacheck to signal the quota type,
so rip out all the flags handling and just pass the type all the way
through.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Since xfs_qm_scall_trunc_qfiles can take a bitset of quota types that we
want to truncate, change the flags argument to take XFS_QMOPT_[UGP}QUOTA
so that the next patch can start to deprecate XFS_DQ_*.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
While loading dquot records off disk, make sure that the quota type
flags are the same between the incore dquot and the ondisk dquot.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
xfs_trans_dqresv is the function that we use to make reservations
against resource quotas. Each resource contains two counters: the
q_core counter, which tracks resources allocated on disk; and the dquot
reservation counter, which tracks how much of that resource has either
been allocated or reserved by threads that are working on metadata
updates.
For disk blocks, we compare the proposed reservation counter against the
hard and soft limits to decide if we're going to fail the operation.
However, for inodes we inexplicably compare against the q_core counter,
not the incore reservation count.
Since the q_core counter is always lower than the reservation count and
we unlock the dquot between reservation and transaction commit, this
means that multiple threads can reserve the last inode count before we
hit the hard limit, and when they commit, we'll be well over the hard
limit.
Fix this by checking against the incore inode reservation counter, since
we would appear to maintain that correctly (and that's what we report in
GETQUOTA).
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Allison Collins <allison.henderson@oracle.com>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
In commit 8d3d7e2b35, we changed xfs_qm_dqpurge to bail out if we
can't lock the dquot buf to flush the dquot. This prevents the AIL from
blocking on the dquot, but it also forgets to clear the FREEING flag on
its way out. A subsequent purge attempt will see the FREEING flag is
set and bail out, which leads to dqpurge_all failing to purge all the
dquots.
(copy-pasting from Dave Chinner's identical patch)
This was found by inspection after having xfs/305 hang 1 in ~50
iterations in a quotaoff operation:
[ 8872.301115] xfs_quota D13888 92262 91813 0x00004002
[ 8872.302538] Call Trace:
[ 8872.303193] __schedule+0x2d2/0x780
[ 8872.304108] ? do_raw_spin_unlock+0x57/0xd0
[ 8872.305198] schedule+0x6e/0xe0
[ 8872.306021] schedule_timeout+0x14d/0x300
[ 8872.307060] ? __next_timer_interrupt+0xe0/0xe0
[ 8872.308231] ? xfs_qm_dqusage_adjust+0x200/0x200
[ 8872.309422] schedule_timeout_uninterruptible+0x2a/0x30
[ 8872.310759] xfs_qm_dquot_walk.isra.0+0x15a/0x1b0
[ 8872.311971] xfs_qm_dqpurge_all+0x7f/0x90
[ 8872.313022] xfs_qm_scall_quotaoff+0x18d/0x2b0
[ 8872.314163] xfs_quota_disable+0x3a/0x60
[ 8872.315179] kernel_quotactl+0x7e2/0x8d0
[ 8872.316196] ? __do_sys_newstat+0x51/0x80
[ 8872.317238] __x64_sys_quotactl+0x1e/0x30
[ 8872.318266] do_syscall_64+0x46/0x90
[ 8872.319193] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 8872.320490] RIP: 0033:0x7f46b5490f2a
[ 8872.321414] Code: Bad RIP value.
Returning -EAGAIN from xfs_qm_dqpurge() without clearing the
XFS_DQ_FREEING flag means the xfs_qm_dqpurge_all() code can never
free the dquot, and we loop forever waiting for the XFS_DQ_FREEING
flag to go away on the dquot that leaked it via -EAGAIN.
Fixes: 8d3d7e2b35 ("xfs: trylock underlying buffer on dquot flush")
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Allison Collins <allison.henderson@oracle.com>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
The block reservation calculation for inode allocation is supposed
to consist of the blocks required for the inode chunk plus
(maxlevels-1) of the inode btree multiplied by the number of inode
btrees in the fs (2 when finobt is enabled, 1 otherwise).
Instead, the macro returns (ialloc_blocks + 2) due to a precedence
error in the calculation logic. This leads to block reservation
overruns via generic/531 on small block filesystems with finobt
enabled. Add braces to fix the calculation and reserve the
appropriate number of blocks.
Fixes: 9d43b180af ("xfs: update inode allocation/free transaction reservations for finobt")
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
xfsaild is racy with respect to transaction abort and shutdown in
that the task can idle or exit with an empty AIL but buffers still
on the delwri queue. This was partly addressed by cancelling the
delwri queue before the task exits to prevent memory leaks, but it's
also possible for xfsaild to empty and idle with buffers on the
delwri queue. For example, a transaction that pins a buffer that
also happens to sit on the AIL delwri queue will explicitly remove
the associated log item from the AIL if the transaction aborts. The
side effect of this is an unmount hang in xfs_wait_buftarg() as the
associated buffers remain held by the delwri queue indefinitely.
This is reproduced on repeated runs of generic/531 with an fs format
(-mrmapbt=1 -bsize=1k) that happens to also reproduce transaction
aborts.
Update xfsaild to not idle until both the AIL and associated delwri
queue are empty and update the push code to continue delwri queue
submission attempts even when the AIL is empty. This allows the AIL
to eventually release aborted buffers stranded on the delwri queue
when they are unlocked by the associated transaction. This should
have no significant effect on normal runtime behavior because the
xfsaild currently idles only when the AIL is empty and in practice
the AIL is rarely empty with a populated delwri queue. The items
must be AIL resident to land in the queue in the first place and
generally aren't removed until writeback completes.
Note that the pre-existing delwri queue cancel logic in the exit
path is retained because task stop is external, could technically
come at any point, and xfsaild is still responsible to release its
buffer references before it exits.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Passing size to copy_user_dax implies it can copy variable sizes of data
when in fact it calls copy_user_page() which is exactly a page.
We are safe because the only caller uses PAGE_SIZE anyway so just remove
the variable for clarity.
While we are at it change copy_user_dax() to copy_cow_page_dax() to make
it clear it is a singleton helper for this one case not implementing
what dax_iomap_actor() does.
Link: https://lore.kernel.org/r/20200717072056.73134-11-ira.weiny@intel.com
Reviewed-by: Ben Widawsky <ben.widawsky@intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
Al Viro pointed out that I broke some acl functionality...
* ACLs could not be fully removed
* posix_acl_chmod would be called while the old ACL was still cached
* new mode propagated to orangefs server before ACL.
... when I tried to make sure that modes that got changed as a
result of ACL-sets would be sent back to the orangefs server.
Not wanting to try and change the code without having some cases to
test it with, I began to hunt for setfacl examples that were expressible
in pure mode. Along the way I found examples like the following
which confused me:
user A had a file (/home/A/asdf) with mode 740
user B was in user A's group
user C was not in user A's group
setfacl -m u:C:rwx /home/A/asdf
The above setfacl caused ls -l /home/A/asdf to show a mode of 770,
making it appear that all users in user A's group now had full access
to /home/A/asdf, however, user B still only had read acces. Madness.
Anywho, I finally found that the above (whacky as it is) appears to
be "posixly on purpose" and explained in acl(5):
If the ACL has an ACL_MASK entry, the group permissions correspond
to the permissions of the ACL_MASK entry.
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
The variable result is being initialized with a value that is never read
and it is being updated later with a new value. The initialization is
redundant and can be removed.
Addresses-Coverity: ("Unused value")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
When merging name events, fsids of the two involved events have to
match. Otherwise we could merge events from two different filesystems
and thus effectively loose the second event.
Backporting note: Although the commit cacfb956d4 introducing this bug
was merged for 5.7, the relevant code didn't get used in the end until
7e8283af6e ("fanotify: report parent fid + name + child fid") which
will be merged with this patch. So there's no need for backporting this.
Fixes: cacfb956d4 ("fanotify: record name info for FAN_DIR_MODIFY event")
Reported-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
The method handle_event() grew a lot of complexity due to the design of
fanotify and merging of ignore masks.
Most backends do not care about this complex functionality, so we can hide
this complexity from them.
Introduce a method handle_inode_event() that serves those backends and
passes a single inode mark and less arguments.
This change converts all backends except fanotify and inotify to use the
simplified handle_inode_event() method. In pricipal, inotify could have
also used the new method, but that would require passing more arguments
on the simple helper (data, data_type, cookie), so we leave it with the
handle_event() method.
Link: https://lore.kernel.org/r/20200722125849.17418-9-amir73il@gmail.com
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Add support for FAN_REPORT_FID | FAN_REPORT_DIR_FID.
Internally, it is implemented as a private case of reporting both
parent and child fids and name, the parent and child fids are recorded
in a variable length fanotify_name_event, but there is no name.
It should be noted that directory modification events are recorded
in fixed size fanotify_fid_event when not reporting name, just like
with group flags FAN_REPORT_FID.
Link: https://lore.kernel.org/r/20200716084230.30611-23-amir73il@gmail.com
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
For a group with fanotify_init() flag FAN_REPORT_DFID_NAME, the parent
fid and name are reported for events on non-directory objects with an
info record of type FAN_EVENT_INFO_TYPE_DFID_NAME.
If the group also has the init flag FAN_REPORT_FID, the child fid
is also reported with another info record that follows the first info
record. The second info record is the same info record that would have
been reported to a group with only FAN_REPORT_FID flag.
When the child fid needs to be recorded, the variable size struct
fanotify_name_event is preallocated with enough space to store the
child fh between the dir fh and the name.
Link: https://lore.kernel.org/r/20200716084230.30611-22-amir73il@gmail.com
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Introduce a new fanotify_init() flag FAN_REPORT_NAME. It requires the
flag FAN_REPORT_DIR_FID and there is a constant for setting both flags
named FAN_REPORT_DFID_NAME.
For a group with flag FAN_REPORT_NAME, the parent fid and name are
reported for directory entry modification events (create/detete/move)
and for events on non-directory objects.
Events on directories themselves are reported with their own fid and
"." as the name.
The parent fid and name are reported with an info record of type
FAN_EVENT_INFO_TYPE_DFID_NAME, similar to the way that parent fid is
reported with into type FAN_EVENT_INFO_TYPE_DFID, but with an appended
null terminated name string.
Link: https://lore.kernel.org/r/20200716084230.30611-21-amir73il@gmail.com
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
In a group with flag FAN_REPORT_DIR_FID, when adding an inode mark with
FAN_EVENT_ON_CHILD, events on non-directory children are reported with
the fid of the parent.
When adding a filesystem or mount mark or mark on a non-dir inode, we
want to report events that are "possible on child" (e.g. open/close)
also with fid of the parent, as if the victim inode's parent is
interested in events "on child".
Some events, currently only FAN_MOVE_SELF, should be reported to a
sb/mount/non-dir mark with parent fid even though they are not
reported to a watching parent.
To get the desired behavior we set the flag FAN_EVENT_ON_CHILD on
all the sb/mount/non-dir mark masks in a group with FAN_REPORT_DIR_FID.
Link: https://lore.kernel.org/r/20200716084230.30611-20-amir73il@gmail.com
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
For now, the flag is mutually exclusive with FAN_REPORT_FID.
Events include a single info record of type FAN_EVENT_INFO_TYPE_DFID
with a directory file handle.
For now, events are only reported for:
- Directory modification events
- Events on children of a watching directory
- Events on directory objects
Soon, we will add support for reporting the parent directory fid
for events on non-directories with filesystem/mount mark and
support for reporting both parent directory fid and child fid.
Link: https://lore.kernel.org/r/20200716084230.30611-19-amir73il@gmail.com
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Similar to events "on child" to watching directory, send event
with parent/name info if sb/mount/non-dir marks are interested in
parent/name info.
The FS_EVENT_ON_CHILD flag can be set on sb/mount/non-dir marks to specify
interest in parent/name info for events on non-directory inodes.
Events on "orphan" children (disconnected dentries) are sent without
parent/name info.
Events on directories are sent with parent/name info only if the parent
directory is watching.
After this change, even groups that do not subscribe to events on
children could get an event with mark iterator type TYPE_CHILD and
without mark iterator type TYPE_INODE if fanotify has marks on the same
objects.
dnotify and inotify event handlers can already cope with that situation.
audit does not subscribe to events that are possible on child, so won't
get to this situation. nfsd does not access the marks iterator from its
event handler at the moment, so it is not affected.
This is a bit too fragile, so we should prepare all groups to cope with
mark type TYPE_CHILD preferably using a generic helper.
Link: https://lore.kernel.org/r/20200716084230.30611-16-amir73il@gmail.com
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
FS_EVENT_ON_CHILD has currently no meaning for non-dir inode marks. In
the following patches we want to use that bit to mean that mark's
notification group cares about parent and name information. So stop
setting FS_EVENT_ON_CHILD for non-dir marks.
Link: https://lore.kernel.org/r/20200722125849.17418-3-amir73il@gmail.com
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
The arguments of fsnotify() are overloaded and mean different things
for different event types.
Replace the to_tell argument with separate arguments @dir and @inode,
because we may be sending to both dir and child. Using the @data
argument to pass the child is not enough, because dirent events pass
this argument (for audit), but we do not report to child.
Document the new fsnotify() function argumenets.
Link: https://lore.kernel.org/r/20200722125849.17418-7-amir73il@gmail.com
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Instead of calling fsnotify() twice, once with parent inode and once
with child inode, if event should be sent to parent inode, send it
with both parent and child inodes marks in object type iterator and call
the backend handle_event() callback only once.
The parent inode is assigned to the standard "inode" iterator type and
the child inode is assigned to the special "child" iterator type.
In that case, the bit FS_EVENT_ON_CHILD will be set in the event mask,
the dir argument to handle_event will be the parent inode, the file_name
argument to handle_event is non NULL and refers to the name of the child
and the child inode can be accessed with fsnotify_data_inode().
This will allow fanotify to make decisions based on child or parent's
ignored mask. For example, when a parent is interested in a specific
event on its children, but a specific child wishes to ignore this event,
the event will not be reported. This is not what happens with current
code, but according to man page, it is the expected behavior.
Link: https://lore.kernel.org/r/20200716084230.30611-15-amir73il@gmail.com
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
fsnotify usually calls inotify_handle_event() once for watching parent
to report event with child's name and once for watching child to report
event without child's name.
Do the same thing with a single callback instead of two callbacks when
marks iterator contains both inode and child entries.
Link: https://lore.kernel.org/r/20200716084230.30611-13-amir73il@gmail.com
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
For some events (e.g. DN_ATTRIB on sub-directory) fsnotify may call
dnotify_handle_event() once for watching parent and once again for
the watching sub-directory.
Do the same thing with a single callback instead of two callbacks when
marks iterator contains both inode and child entries.
Link: https://lore.kernel.org/r/20200716084230.30611-12-amir73il@gmail.com
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
The fanotify_fh struct has an inline buffer of size 12 which is enough
to store the most common local filesystem file handles (e.g. ext4, xfs).
For file handles that do not fit in the inline buffer (e.g. btrfs), an
external buffer is allocated to store the file handle.
When allocating a variable size fanotify_name_event, there is no point
in allocating also an external fh buffer when file handle does not fit
in the inline buffer.
Check required size for encoding fh, preallocate an event buffer
sufficient to contain both file handle and name and store the name after
the file handle.
At this time, when not reporting name in event, we still allocate
the fixed size fanotify_fid_event and an external buffer for large
file handles, but fanotify_alloc_name_event() has already been prepared
to accept a NULL file_name.
Link: https://lore.kernel.org/r/20200716084230.30611-11-amir73il@gmail.com
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
An fanotify event name is always recorded relative to a dir fh.
Encapsulate the name_len member of fanotify_name_event in a new struct
fanotify_info, which describes the parceling of the variable size
buffer of an fanotify_name_event.
The dir_fh member of fanotify_name_event is renamed to _dir_fh and is not
accessed directly, but via the fanotify_info_dir_fh() accessor.
Although the dir_fh len information is already available in struct
fanotify_fh, we store it also in dif_fh_totlen member of fanotify_info,
including the size of fanotify_fh header, so we know the offset of the
name in the buffer without looking inside the dir_fh.
We also add a file_fh_totlen member to allow packing another file handle
in the variable size buffer after the dir_fh and before the name.
We are going to use that space to store the child fid.
Link: https://lore.kernel.org/r/20200716084230.30611-10-amir73il@gmail.com
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Up to now, fanotify allowed to set the FAN_EVENT_ON_CHILD flag on
sb/mount marks and non-directory inode mask, but the flag was ignored.
Mask out the flag if it is provided by user on sb/mount/non-dir marks
and define it as an implicit flag that cannot be removed by user.
This flag is going to be used internally to request for events with
parent and name info.
Link: https://lore.kernel.org/r/20200716084230.30611-8-amir73il@gmail.com
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
So far, all flags that can be set in an fanotify mark mask can be set
explicitly by a call to fanotify_mark(2).
Prepare for defining implicit event flags that cannot be set by user with
fanotify_mark(2), similar to how inotify/dnotify implicitly set the
FS_EVENT_ON_CHILD flag.
Implicit event flags cannot be removed by user and mark gets destroyed
when only implicit event flags remain in the mask.
Link: https://lore.kernel.org/r/20200716084230.30611-7-amir73il@gmail.com
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
The special event flags (FAN_ONDIR, FAN_EVENT_ON_CHILD) never had
any meaning in ignored mask. Mask them out explicitly.
Link: https://lore.kernel.org/r/20200716084230.30611-6-amir73il@gmail.com
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
As preparation for new flags that report fids, define a bit set
of flags for a group reporting fids, currently containing the
only bit FAN_REPORT_FID.
Link: https://lore.kernel.org/r/20200716084230.30611-5-amir73il@gmail.com
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
In fanotify_encode_fh(), both cases of NULL inode and failure to encode
ended up with fh type FILEID_INVALID.
Distiguish the case of NULL inode, by setting fh type to FILEID_ROOT.
This is just a semantic difference at this point.
Remove stale comment and unneeded check from fid event compare helpers.
Link: https://lore.kernel.org/r/20200716084230.30611-4-amir73il@gmail.com
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
An event on directory should never be merged with an event on
non-directory regardless of the event struct type.
This change has no visible effect, because currently, with struct
fanotify_path_event, the relevant events will not be merged because
event path of dir will be different than event path of non-dir.
Link: https://lore.kernel.org/r/20200716084230.30611-3-amir73il@gmail.com
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
In fanotify_group_event_mask() there is logic in place to make sure we
are not going to handle an event with no type and just FAN_ONDIR flag.
Generalize this logic to any FANOTIFY_EVENT_FLAGS.
There is only one more flag in this group at the moment -
FAN_EVENT_ON_CHILD. We never report it to user, but we do pass it in to
fanotify_alloc_event() when group is reporting fid as indication that
event happened on child. We will have use for this indication later on.
Link: https://lore.kernel.org/r/20200716084230.30611-2-amir73il@gmail.com
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
It was never enabled in uapi and its functionality is about to be
superseded by events FAN_CREATE, FAN_DELETE, FAN_MOVE with group
flag FAN_REPORT_NAME.
Keep a place holder variable name_event instead of removing the
name recording code since it will be used by the new events.
Link: https://lore.kernel.org/r/20200708111156.24659-17-amir73il@gmail.com
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
similar to how elf coredump is working on architectures that
have regsets, and all architectures with elf-fdpic support *do*
have that.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
the only reason to have it open-coded for the first (dumper) thread is
that coredump has a couple of process-wide notes stuck right after the
first (NT_PRSTATUS) note of the first thread. But we don't need to
make the data collection side irregular for the first thread to handle
that - it's only the logics ordering the calls of writenote() that
needs to take care of that.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
all uses are conditional upon ELF_CORE_COPY_XFPREGS, which has not
been defined on any architecture since 2010
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The only architecture where we might end up using both is arm,
and there we definitely don't want fdpic-related fields in
elf_prstatus - coredump layout of ELF binaries should not
depend upon having the kernel built with the support of ELF_FDPIC
ones. Just move the fdpic-modified variant into binfmt_elf_fdpic.c
(and call it elf_prstatus_fdpic there)
[name stolen from nico]
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Two new helpers: given a process and regset, dump into a buffer.
regset_get() takes a buffer and size, regset_get_alloc() takes size
and allocates a buffer.
Return value in both cases is the amount of data actually dumped in
case of success or -E... on error.
In both cases the size is capped by regset->n * regset->size, so
->get() is called with offset 0 and size no more than what regset
expects.
binfmt_elf.c callers of ->get() are switched to using those; the other
caller (copy_regset_to_user()) will need some preparations to switch.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The 'inode' argument to handle_event(), sometimes referred to as
'to_tell' is somewhat obsolete.
It is a remnant from the times when a group could only have an inode mark
associated with an event.
We now pass an iter_info array to the callback, with all marks associated
with an event.
Most backends ignore this argument, with two exceptions:
1. dnotify uses it for sanity check that event is on directory
2. fanotify uses it to report fid of directory on directory entry
modification events
Remove the 'inode' argument and add a 'dir' argument.
The callback function signature is deliberately changed, because
the meaning of the argument has changed and the arguments have
been documented.
The 'dir' argument is set to when 'file_name' is specified and it is
referring to the directory that the 'file_name' entry belongs to.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
When removing an extent map at try_release_extent_mapping(), called through
the page release callback (btrfs_releasepage()), we always set the full
sync flag on the inode, which forces the next fsync to use a slower code
path.
This hurts performance for workloads that dirty an amount of data that
exceeds or is very close to the system's RAM memory and do frequent fsync
operations (like database servers can for example). In particular if there
are concurrent fsyncs against different files, by falling back to a full
fsync we do a lot more checksum lookups in the checksums btree, as we do
it for all the extents created in the current transaction, instead of only
the new ones since the last fsync. These checksums lookups not only take
some time but, more importantly, they also cause contention on the
checksums btree locks due to the concurrency with checksum insertions in
the btree by ordered extents from other inodes.
We actually don't need to set the full sync flag on the inode, because we
only remove extent maps that are in the list of modified extents if they
were created in a past transaction, in which case an fsync skips them as
it's pointless to log them. So stop setting the full fsync flag on the
inode whenever we remove an extent map.
This patch is part of a patchset that consists of 3 patches, which have
the following subjects:
1/3 btrfs: fix race between page release and a fast fsync
2/3 btrfs: release old extent maps during page release
3/3 btrfs: do not set the full sync flag on the inode during page release
Performance tests were ran against a branch (misc-next) containing the
whole patchset. The test exercises a workload where there are multiple
processes writing to files and fsyncing them (each writing and fsyncing
its own file), and in total the amount of data dirtied ranges from 2x to
4x the system's RAM memory (16GiB), so that the page release callback is
invoked frequently.
The following script, using fio, was used to perform the tests:
$ cat test-fsync.sh
#!/bin/bash
DEV=/dev/sdk
MNT=/mnt/sdk
MOUNT_OPTIONS="-o ssd"
MKFS_OPTIONS="-d single -m single"
if [ $# -ne 3 ]; then
echo "Use $0 NUM_JOBS FILE_SIZE FSYNC_FREQ"
exit 1
fi
NUM_JOBS=$1
FILE_SIZE=$2
FSYNC_FREQ=$3
cat <<EOF > /tmp/fio-job.ini
[writers]
rw=write
fsync=$FSYNC_FREQ
fallocate=none
group_reporting=1
direct=0
bs=64k
ioengine=sync
size=$FILE_SIZE
directory=$MNT
numjobs=$NUM_JOBS
thread
EOF
echo "Using config:"
echo
cat /tmp/fio-job.ini
echo
mkfs.btrfs -f $MKFS_OPTIONS $DEV &> /dev/null
mount $MOUNT_OPTIONS $DEV $MNT
fio /tmp/fio-job.ini
umount $MNT
The tests were performed for different numbers of jobs, file sizes and
fsync frequency. A qemu VM using kvm was used, with 8 cores (the host has
12 cores, with cpu governance set to performance mode on all cores), 16GiB
of ram (the host has 64GiB) and using a NVMe device directly (without an
intermediary filesystem in the host). While running the tests, the host
was not used for anything else, to avoid disturbing the tests.
The obtained results were the following, and the last line printed by
fio is pasted (includes aggregated throughput and test run time).
*****************************************************
**** 1 job, 32GiB file, fsync frequency 1 ****
*****************************************************
Before patchset:
WRITE: bw=29.1MiB/s (30.5MB/s), 29.1MiB/s-29.1MiB/s (30.5MB/s-30.5MB/s), io=32.0GiB (34.4GB), run=1127557-1127557msec
After patchset:
WRITE: bw=29.3MiB/s (30.7MB/s), 29.3MiB/s-29.3MiB/s (30.7MB/s-30.7MB/s), io=32.0GiB (34.4GB), run=1119042-1119042msec
(+0.7% throughput, -0.8% run time)
*****************************************************
**** 2 jobs, 16GiB files, fsync frequency 1 ****
*****************************************************
Before patchset:
WRITE: bw=33.5MiB/s (35.1MB/s), 33.5MiB/s-33.5MiB/s (35.1MB/s-35.1MB/s), io=32.0GiB (34.4GB), run=979000-979000msec
After patchset:
WRITE: bw=39.9MiB/s (41.8MB/s), 39.9MiB/s-39.9MiB/s (41.8MB/s-41.8MB/s), io=32.0GiB (34.4GB), run=821283-821283msec
(+19.1% throughput, -16.1% runtime)
*****************************************************
**** 4 jobs, 8GiB files, fsync frequency 1 ****
*****************************************************
Before patchset:
WRITE: bw=52.1MiB/s (54.6MB/s), 52.1MiB/s-52.1MiB/s (54.6MB/s-54.6MB/s), io=32.0GiB (34.4GB), run=629130-629130msec
After patchset:
WRITE: bw=71.8MiB/s (75.3MB/s), 71.8MiB/s-71.8MiB/s (75.3MB/s-75.3MB/s), io=32.0GiB (34.4GB), run=456357-456357msec
(+37.8% throughput, -27.5% runtime)
*****************************************************
**** 8 jobs, 4GiB files, fsync frequency 1 ****
*****************************************************
Before patchset:
WRITE: bw=76.1MiB/s (79.8MB/s), 76.1MiB/s-76.1MiB/s (79.8MB/s-79.8MB/s), io=32.0GiB (34.4GB), run=430708-430708msec
After patchset:
WRITE: bw=133MiB/s (140MB/s), 133MiB/s-133MiB/s (140MB/s-140MB/s), io=32.0GiB (34.4GB), run=245458-245458msec
(+74.7% throughput, -43.0% run time)
*****************************************************
**** 16 jobs, 2GiB files, fsync frequency 1 ****
*****************************************************
Before patchset:
WRITE: bw=74.7MiB/s (78.3MB/s), 74.7MiB/s-74.7MiB/s (78.3MB/s-78.3MB/s), io=32.0GiB (34.4GB), run=438625-438625msec
After patchset:
WRITE: bw=184MiB/s (193MB/s), 184MiB/s-184MiB/s (193MB/s-193MB/s), io=32.0GiB (34.4GB), run=177864-177864msec
(+146.3% throughput, -59.5% run time)
*****************************************************
**** 32 jobs, 2GiB files, fsync frequency 1 ****
*****************************************************
Before patchset:
WRITE: bw=72.6MiB/s (76.1MB/s), 72.6MiB/s-72.6MiB/s (76.1MB/s-76.1MB/s), io=64.0GiB (68.7GB), run=902615-902615msec
After patchset:
WRITE: bw=227MiB/s (238MB/s), 227MiB/s-227MiB/s (238MB/s-238MB/s), io=64.0GiB (68.7GB), run=288936-288936msec
(+212.7% throughput, -68.0% run time)
*****************************************************
**** 64 jobs, 1GiB files, fsync frequency 1 ****
*****************************************************
Before patchset:
WRITE: bw=98.8MiB/s (104MB/s), 98.8MiB/s-98.8MiB/s (104MB/s-104MB/s), io=64.0GiB (68.7GB), run=663126-663126msec
After patchset:
WRITE: bw=294MiB/s (308MB/s), 294MiB/s-294MiB/s (308MB/s-308MB/s), io=64.0GiB (68.7GB), run=222940-222940msec
(+197.6% throughput, -66.4% run time)
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When removing an extent map at try_release_extent_mapping(), called through
the page release callback (btrfs_releasepage()), we never release an extent
map that is in the list of modified extents. This is to prevent races with
a concurrent fsync using the fast path, which could lead to not logging an
extent created in the current transaction.
However we can safely remove an extent map created in a past transaction
that is still in the list of modified extents (because no one fsynced yet
the inode after that transaction got commited), because such extents are
skipped during an fsync as it is pointless to log them. This change does
that.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When releasing an extent map, done through the page release callback, we
can race with an ongoing fast fsync and cause the fsync to miss a new
extent and not log it. The steps for this to happen are the following:
1) A page is dirtied for some inode I;
2) Writeback for that page is triggered by a path other than fsync, for
example by the system due to memory pressure;
3) When the ordered extent for the extent (a single 4K page) finishes,
we unpin the corresponding extent map and set its generation to N,
the current transaction's generation;
4) The btrfs_releasepage() callback is invoked by the system due to
memory pressure for that no longer dirty page of inode I;
5) At the same time, some task calls fsync on inode I, joins transaction
N, and at btrfs_log_inode() it sees that the inode does not have the
full sync flag set, so we proceed with a fast fsync. But before we get
into btrfs_log_changed_extents() and lock the inode's extent map tree:
6) Through btrfs_releasepage() we end up at try_release_extent_mapping()
and we remove the extent map for the new 4Kb extent, because it is
neither pinned anymore nor locked. By calling remove_extent_mapping(),
we remove the extent map from the list of modified extents, since the
extent map does not have the logging flag set. We unlock the inode's
extent map tree;
7) The task doing the fast fsync now enters btrfs_log_changed_extents(),
locks the inode's extent map tree and iterates its list of modified
extents, which no longer has the 4Kb extent in it, so it does not log
the extent;
8) The fsync finishes;
9) Before transaction N is committed, a power failure happens. After
replaying the log, the 4K extent of inode I will be missing, since
it was not logged due to the race with try_release_extent_mapping().
So fix this by teaching try_release_extent_mapping() to not remove an
extent map if it's still in the list of modified extents.
Fixes: ff44c6e36d ("Btrfs: do not hold the write_lock on the extent tree while logging")
CC: stable@vger.kernel.org # 5.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When we're (re)mounting a btrfs filesystem we set the
BTRFS_FS_STATE_REMOUNTING state in fs_info to serialize against async
reclaim or defrags.
This flag is set in btrfs_remount_prepare() called by btrfs_remount().
As btrfs_remount_prepare() does nothing but setting this flag and
doesn't have a second caller, we can just open-code the flag setting in
btrfs_remount().
Similarly do for so clearing of the flag by moving it out of
btrfs_remount_cleanup() into btrfs_remount() to be symmetrical.
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Previously we depended on some weird behavior in our chunk allocator to
force the allocation of new stripes, so by the time we got to doing the
reduce we would usually already have a chunk with the proper target.
However that behavior causes other problems and needs to be removed.
First however we need to remove this check to only restripe if we
already have those available profiles, because if we're allocating our
first chunk it obviously will not be available. Simply use the target
as specified, and if that fails it'll be because we're out of space.
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs/061 has been failing consistently for me recently with a
transaction abort. We run out of space in the system chunk array, which
means we've allocated way too many system chunks than we need.
Chris added this a long time ago for balance as a poor mans restriping.
If you had a single disk and then added another disk and then did a
balance, update_block_group_flags would then figure out which RAID level
you needed.
Fast forward to today and we have restriping behavior, so we can
explicitly tell the fs that we're trying to change the raid level. This
is accomplished through the normal get_alloc_profile path.
Furthermore this code actually causes btrfs/061 to fail, because we do
things like mkfs -m dup -d single with multiple devices. This trips
this check
alloc_flags = update_block_group_flags(fs_info, cache->flags);
if (alloc_flags != cache->flags) {
ret = btrfs_chunk_alloc(trans, alloc_flags, CHUNK_ALLOC_FORCE);
in btrfs_inc_block_group_ro. Because we're balancing and scrubbing, but
not actually restriping, we keep forcing chunk allocation of RAID1
chunks. This eventually causes us to run out of system space and the
file system aborts and flips read only.
We don't need this poor mans restriping any more, simply use the normal
get_alloc_profile helper, which will get the correct alloc_flags and
thus make the right decision for chunk allocation. This keeps us from
allocating a billion system chunks and falling over.
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We are currently getting this lockdep splat in btrfs/161:
======================================================
WARNING: possible circular locking dependency detected
5.8.0-rc5+ #20 Tainted: G E
------------------------------------------------------
mount/678048 is trying to acquire lock:
ffff9b769f15b6e0 (&fs_devs->device_list_mutex){+.+.}-{3:3}, at: clone_fs_devices+0x4d/0x170 [btrfs]
but task is already holding lock:
ffff9b76abdb08d0 (&fs_info->chunk_mutex){+.+.}-{3:3}, at: btrfs_read_chunk_tree+0x6a/0x800 [btrfs]
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #1 (&fs_info->chunk_mutex){+.+.}-{3:3}:
__mutex_lock+0x8b/0x8f0
btrfs_init_new_device+0x2d2/0x1240 [btrfs]
btrfs_ioctl+0x1de/0x2d20 [btrfs]
ksys_ioctl+0x87/0xc0
__x64_sys_ioctl+0x16/0x20
do_syscall_64+0x52/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #0 (&fs_devs->device_list_mutex){+.+.}-{3:3}:
__lock_acquire+0x1240/0x2460
lock_acquire+0xab/0x360
__mutex_lock+0x8b/0x8f0
clone_fs_devices+0x4d/0x170 [btrfs]
btrfs_read_chunk_tree+0x330/0x800 [btrfs]
open_ctree+0xb7c/0x18ce [btrfs]
btrfs_mount_root.cold+0x13/0xfa [btrfs]
legacy_get_tree+0x30/0x50
vfs_get_tree+0x28/0xc0
fc_mount+0xe/0x40
vfs_kern_mount.part.0+0x71/0x90
btrfs_mount+0x13b/0x3e0 [btrfs]
legacy_get_tree+0x30/0x50
vfs_get_tree+0x28/0xc0
do_mount+0x7de/0xb30
__x64_sys_mount+0x8e/0xd0
do_syscall_64+0x52/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&fs_info->chunk_mutex);
lock(&fs_devs->device_list_mutex);
lock(&fs_info->chunk_mutex);
lock(&fs_devs->device_list_mutex);
*** DEADLOCK ***
3 locks held by mount/678048:
#0: ffff9b75ff5fb0e0 (&type->s_umount_key#63/1){+.+.}-{3:3}, at: alloc_super+0xb5/0x380
#1: ffffffffc0c2fbc8 (uuid_mutex){+.+.}-{3:3}, at: btrfs_read_chunk_tree+0x54/0x800 [btrfs]
#2: ffff9b76abdb08d0 (&fs_info->chunk_mutex){+.+.}-{3:3}, at: btrfs_read_chunk_tree+0x6a/0x800 [btrfs]
stack backtrace:
CPU: 2 PID: 678048 Comm: mount Tainted: G E 5.8.0-rc5+ #20
Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./890FX Deluxe5, BIOS P1.40 05/03/2011
Call Trace:
dump_stack+0x96/0xd0
check_noncircular+0x162/0x180
__lock_acquire+0x1240/0x2460
? asm_sysvec_apic_timer_interrupt+0x12/0x20
lock_acquire+0xab/0x360
? clone_fs_devices+0x4d/0x170 [btrfs]
__mutex_lock+0x8b/0x8f0
? clone_fs_devices+0x4d/0x170 [btrfs]
? rcu_read_lock_sched_held+0x52/0x60
? cpumask_next+0x16/0x20
? module_assert_mutex_or_preempt+0x14/0x40
? __module_address+0x28/0xf0
? clone_fs_devices+0x4d/0x170 [btrfs]
? static_obj+0x4f/0x60
? lockdep_init_map_waits+0x43/0x200
? clone_fs_devices+0x4d/0x170 [btrfs]
clone_fs_devices+0x4d/0x170 [btrfs]
btrfs_read_chunk_tree+0x330/0x800 [btrfs]
open_ctree+0xb7c/0x18ce [btrfs]
? super_setup_bdi_name+0x79/0xd0
btrfs_mount_root.cold+0x13/0xfa [btrfs]
? vfs_parse_fs_string+0x84/0xb0
? rcu_read_lock_sched_held+0x52/0x60
? kfree+0x2b5/0x310
legacy_get_tree+0x30/0x50
vfs_get_tree+0x28/0xc0
fc_mount+0xe/0x40
vfs_kern_mount.part.0+0x71/0x90
btrfs_mount+0x13b/0x3e0 [btrfs]
? cred_has_capability+0x7c/0x120
? rcu_read_lock_sched_held+0x52/0x60
? legacy_get_tree+0x30/0x50
legacy_get_tree+0x30/0x50
vfs_get_tree+0x28/0xc0
do_mount+0x7de/0xb30
? memdup_user+0x4e/0x90
__x64_sys_mount+0x8e/0xd0
do_syscall_64+0x52/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
This is because btrfs_read_chunk_tree() can come upon DEV_EXTENT's and
then read the device, which takes the device_list_mutex. The
device_list_mutex needs to be taken before the chunk_mutex, so this is a
problem. We only really need the chunk mutex around adding the chunk,
so move the mutex around read_one_chunk.
An argument could be made that we don't even need the chunk_mutex here
as it's during mount, and we are protected by various other locks.
However we already have special rules for ->device_list_mutex, and I'd
rather not have another special case for ->chunk_mutex.
CC: stable@vger.kernel.org # 4.19+
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There's long existed a lockdep splat because we open our bdev's under
the ->device_list_mutex at mount time, which acquires the bd_mutex.
Usually this goes unnoticed, but if you do loopback devices at all
suddenly the bd_mutex comes with a whole host of other dependencies,
which results in the splat when you mount a btrfs file system.
======================================================
WARNING: possible circular locking dependency detected
5.8.0-0.rc3.1.fc33.x86_64+debug #1 Not tainted
------------------------------------------------------
systemd-journal/509 is trying to acquire lock:
ffff970831f84db0 (&fs_info->reloc_mutex){+.+.}-{3:3}, at: btrfs_record_root_in_trans+0x44/0x70 [btrfs]
but task is already holding lock:
ffff97083144d598 (sb_pagefaults){.+.+}-{0:0}, at: btrfs_page_mkwrite+0x59/0x560 [btrfs]
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #6 (sb_pagefaults){.+.+}-{0:0}:
__sb_start_write+0x13e/0x220
btrfs_page_mkwrite+0x59/0x560 [btrfs]
do_page_mkwrite+0x4f/0x130
do_wp_page+0x3b0/0x4f0
handle_mm_fault+0xf47/0x1850
do_user_addr_fault+0x1fc/0x4b0
exc_page_fault+0x88/0x300
asm_exc_page_fault+0x1e/0x30
-> #5 (&mm->mmap_lock#2){++++}-{3:3}:
__might_fault+0x60/0x80
_copy_from_user+0x20/0xb0
get_sg_io_hdr+0x9a/0xb0
scsi_cmd_ioctl+0x1ea/0x2f0
cdrom_ioctl+0x3c/0x12b4
sr_block_ioctl+0xa4/0xd0
block_ioctl+0x3f/0x50
ksys_ioctl+0x82/0xc0
__x64_sys_ioctl+0x16/0x20
do_syscall_64+0x52/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #4 (&cd->lock){+.+.}-{3:3}:
__mutex_lock+0x7b/0x820
sr_block_open+0xa2/0x180
__blkdev_get+0xdd/0x550
blkdev_get+0x38/0x150
do_dentry_open+0x16b/0x3e0
path_openat+0x3c9/0xa00
do_filp_open+0x75/0x100
do_sys_openat2+0x8a/0x140
__x64_sys_openat+0x46/0x70
do_syscall_64+0x52/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #3 (&bdev->bd_mutex){+.+.}-{3:3}:
__mutex_lock+0x7b/0x820
__blkdev_get+0x6a/0x550
blkdev_get+0x85/0x150
blkdev_get_by_path+0x2c/0x70
btrfs_get_bdev_and_sb+0x1b/0xb0 [btrfs]
open_fs_devices+0x88/0x240 [btrfs]
btrfs_open_devices+0x92/0xa0 [btrfs]
btrfs_mount_root+0x250/0x490 [btrfs]
legacy_get_tree+0x30/0x50
vfs_get_tree+0x28/0xc0
vfs_kern_mount.part.0+0x71/0xb0
btrfs_mount+0x119/0x380 [btrfs]
legacy_get_tree+0x30/0x50
vfs_get_tree+0x28/0xc0
do_mount+0x8c6/0xca0
__x64_sys_mount+0x8e/0xd0
do_syscall_64+0x52/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #2 (&fs_devs->device_list_mutex){+.+.}-{3:3}:
__mutex_lock+0x7b/0x820
btrfs_run_dev_stats+0x36/0x420 [btrfs]
commit_cowonly_roots+0x91/0x2d0 [btrfs]
btrfs_commit_transaction+0x4e6/0x9f0 [btrfs]
btrfs_sync_file+0x38a/0x480 [btrfs]
__x64_sys_fdatasync+0x47/0x80
do_syscall_64+0x52/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #1 (&fs_info->tree_log_mutex){+.+.}-{3:3}:
__mutex_lock+0x7b/0x820
btrfs_commit_transaction+0x48e/0x9f0 [btrfs]
btrfs_sync_file+0x38a/0x480 [btrfs]
__x64_sys_fdatasync+0x47/0x80
do_syscall_64+0x52/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #0 (&fs_info->reloc_mutex){+.+.}-{3:3}:
__lock_acquire+0x1241/0x20c0
lock_acquire+0xb0/0x400
__mutex_lock+0x7b/0x820
btrfs_record_root_in_trans+0x44/0x70 [btrfs]
start_transaction+0xd2/0x500 [btrfs]
btrfs_dirty_inode+0x44/0xd0 [btrfs]
file_update_time+0xc6/0x120
btrfs_page_mkwrite+0xda/0x560 [btrfs]
do_page_mkwrite+0x4f/0x130
do_wp_page+0x3b0/0x4f0
handle_mm_fault+0xf47/0x1850
do_user_addr_fault+0x1fc/0x4b0
exc_page_fault+0x88/0x300
asm_exc_page_fault+0x1e/0x30
other info that might help us debug this:
Chain exists of:
&fs_info->reloc_mutex --> &mm->mmap_lock#2 --> sb_pagefaults
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(sb_pagefaults);
lock(&mm->mmap_lock#2);
lock(sb_pagefaults);
lock(&fs_info->reloc_mutex);
*** DEADLOCK ***
3 locks held by systemd-journal/509:
#0: ffff97083bdec8b8 (&mm->mmap_lock#2){++++}-{3:3}, at: do_user_addr_fault+0x12e/0x4b0
#1: ffff97083144d598 (sb_pagefaults){.+.+}-{0:0}, at: btrfs_page_mkwrite+0x59/0x560 [btrfs]
#2: ffff97083144d6a8 (sb_internal){.+.+}-{0:0}, at: start_transaction+0x3f8/0x500 [btrfs]
stack backtrace:
CPU: 0 PID: 509 Comm: systemd-journal Not tainted 5.8.0-0.rc3.1.fc33.x86_64+debug #1
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
Call Trace:
dump_stack+0x92/0xc8
check_noncircular+0x134/0x150
__lock_acquire+0x1241/0x20c0
lock_acquire+0xb0/0x400
? btrfs_record_root_in_trans+0x44/0x70 [btrfs]
? lock_acquire+0xb0/0x400
? btrfs_record_root_in_trans+0x44/0x70 [btrfs]
__mutex_lock+0x7b/0x820
? btrfs_record_root_in_trans+0x44/0x70 [btrfs]
? kvm_sched_clock_read+0x14/0x30
? sched_clock+0x5/0x10
? sched_clock_cpu+0xc/0xb0
btrfs_record_root_in_trans+0x44/0x70 [btrfs]
start_transaction+0xd2/0x500 [btrfs]
btrfs_dirty_inode+0x44/0xd0 [btrfs]
file_update_time+0xc6/0x120
btrfs_page_mkwrite+0xda/0x560 [btrfs]
? sched_clock+0x5/0x10
do_page_mkwrite+0x4f/0x130
do_wp_page+0x3b0/0x4f0
handle_mm_fault+0xf47/0x1850
do_user_addr_fault+0x1fc/0x4b0
exc_page_fault+0x88/0x300
? asm_exc_page_fault+0x8/0x30
asm_exc_page_fault+0x1e/0x30
RIP: 0033:0x7fa3972fdbfe
Code: Bad RIP value.
Fix this by not holding the ->device_list_mutex at this point. The
device_list_mutex exists to protect us from modifying the device list
while the file system is running.
However it can also be modified by doing a scan on a device. But this
action is specifically protected by the uuid_mutex, which we are holding
here. We cannot race with opening at this point because we have the
->s_mount lock held during the mount. Not having the
->device_list_mutex here is perfectly safe as we're not going to change
the devices at this point.
CC: stable@vger.kernel.org # 4.19+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add some comments ]
Signed-off-by: David Sterba <dsterba@suse.com>
Eric reported seeing this message while running generic/475
BTRFS: error (device dm-3) in btrfs_sync_log:3084: errno=-117 Filesystem corrupted
Full stack trace:
BTRFS: error (device dm-0) in btrfs_commit_transaction:2323: errno=-5 IO failure (Error while writing out transaction)
BTRFS info (device dm-0): forced readonly
BTRFS warning (device dm-0): Skipping commit of aborted transaction.
------------[ cut here ]------------
BTRFS: error (device dm-0) in cleanup_transaction:1894: errno=-5 IO failure
BTRFS: Transaction aborted (error -117)
BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c6480 len 4096 err no 10
BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c6488 len 4096 err no 10
BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c6490 len 4096 err no 10
BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c6498 len 4096 err no 10
BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c64a0 len 4096 err no 10
BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c64a8 len 4096 err no 10
BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c64b0 len 4096 err no 10
BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c64b8 len 4096 err no 10
BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c64c0 len 4096 err no 10
BTRFS warning (device dm-0): direct IO failed ino 3572 rw 0,0 sector 0x1b85e8 len 4096 err no 10
BTRFS warning (device dm-0): direct IO failed ino 3572 rw 0,0 sector 0x1b85f0 len 4096 err no 10
WARNING: CPU: 3 PID: 23985 at fs/btrfs/tree-log.c:3084 btrfs_sync_log+0xbc8/0xd60 [btrfs]
BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d4288 len 4096 err no 10
BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d4290 len 4096 err no 10
BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d4298 len 4096 err no 10
BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d42a0 len 4096 err no 10
BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d42a8 len 4096 err no 10
BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d42b0 len 4096 err no 10
BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d42b8 len 4096 err no 10
BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d42c0 len 4096 err no 10
BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d42c8 len 4096 err no 10
BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d42d0 len 4096 err no 10
CPU: 3 PID: 23985 Comm: fsstress Tainted: G W L 5.8.0-rc4-default+ #1181
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014
RIP: 0010:btrfs_sync_log+0xbc8/0xd60 [btrfs]
RSP: 0018:ffff909a44d17bd0 EFLAGS: 00010286
RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000001
RDX: ffff8f3be41cb940 RSI: ffffffffb0108d2b RDI: ffffffffb0108ff7
RBP: ffff909a44d17e70 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000037988 R12: ffff8f3bd20e4000
R13: ffff8f3bd20e4428 R14: 00000000ffffff8b R15: ffff909a44d17c70
FS: 00007f6a6ed3fb80(0000) GS:ffff8f3c3dc00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f6a6ed3e000 CR3: 00000000525c0003 CR4: 0000000000160ee0
Call Trace:
? finish_wait+0x90/0x90
? __mutex_unlock_slowpath+0x45/0x2a0
? lock_acquire+0xa3/0x440
? lockref_put_or_lock+0x9/0x30
? dput+0x20/0x4a0
? dput+0x20/0x4a0
? do_raw_spin_unlock+0x4b/0xc0
? _raw_spin_unlock+0x1f/0x30
btrfs_sync_file+0x335/0x490 [btrfs]
do_fsync+0x38/0x70
__x64_sys_fsync+0x10/0x20
do_syscall_64+0x50/0xe0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f6a6ef1b6e3
Code: Bad RIP value.
RSP: 002b:00007ffd01e20038 EFLAGS: 00000246 ORIG_RAX: 000000000000004a
RAX: ffffffffffffffda RBX: 000000000007a120 RCX: 00007f6a6ef1b6e3
RDX: 00007ffd01e1ffa0 RSI: 00007ffd01e1ffa0 RDI: 0000000000000003
RBP: 0000000000000003 R08: 0000000000000001 R09: 00007ffd01e2004c
R10: 0000000000000000 R11: 0000000000000246 R12: 000000000000009f
R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
irq event stamp: 0
hardirqs last enabled at (0): [<0000000000000000>] 0x0
hardirqs last disabled at (0): [<ffffffffb007fe0b>] copy_process+0x67b/0x1b00
softirqs last enabled at (0): [<ffffffffb007fe0b>] copy_process+0x67b/0x1b00
softirqs last disabled at (0): [<0000000000000000>] 0x0
---[ end trace af146e0e38433456 ]---
BTRFS: error (device dm-0) in btrfs_sync_log:3084: errno=-117 Filesystem corrupted
This ret came from btrfs_write_marked_extents(). If we get an aborted
transaction via EIO before, we'll see it in btree_write_cache_pages()
and return EUCLEAN, which gets printed as "Filesystem corrupted".
Except we shouldn't be returning EUCLEAN here, we need to be returning
EROFS because EUCLEAN is reserved for actual corruption, not IO errors.
We are inconsistent about our handling of BTRFS_FS_STATE_ERROR
elsewhere, but we want to use EROFS for this particular case. The
original transaction abort has the real error code for why we ended up
with an aborted transaction, all subsequent actions just need to return
EROFS because they may not have a trans handle and have no idea about
the original cause of the abort.
After patch "btrfs: don't WARN if we abort a transaction with EROFS" the
stacktrace will not be dumped either.
Reported-by: Eric Sandeen <esandeen@redhat.com>
CC: stable@vger.kernel.org # 5.4+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add full test stacktrace ]
Signed-off-by: David Sterba <dsterba@suse.com>
We've had some discussions about what to do in certain scenarios for
error codes, specifically EUCLEAN and EROFS. Document these near the
error handling code so its clear what their intentions are.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If we got some sort of corruption via a read and call
btrfs_handle_fs_error() we'll set BTRFS_FS_STATE_ERROR on the fs and
complain. If a subsequent trans handle trips over this it'll get EROFS
and then abort. However at that point we're not aborting for the
original reason, we're aborting because we've been flipped read only.
We do not need to WARN_ON() here.
CC: stable@vger.kernel.org # 5.4+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The possibility of extents being shared (through clone and deduplication
operations) requires special care when logging data checksums, to avoid
having a log tree with different checksum items that cover ranges which
overlap (which resulted in missing checksums after replaying a log tree).
Such problems were fixed in the past by the following commits:
commit 40e046acbd ("Btrfs: fix missing data checksums after replaying a
log tree")
commit e289f03ea7 ("btrfs: fix corrupt log due to concurrent fsync of
inodes with shared extents")
Test case generic/588 exercises the scenario solved by the first commit
(purely sequential and deterministic) while test case generic/457 often
triggered the case fixed by the second commit (not deterministic, requires
specific timings under concurrency).
The problems were addressed by deleting, from the log tree, any existing
checksums before logging the new ones. And also by doing the deletion and
logging of the cheksums while locking the checksum range in an extent io
tree (root->log_csum_range), to deal with the case where we have concurrent
fsyncs against files with shared extents.
That however causes more contention on the leaves of a log tree where we
store checksums (and all the nodes in the paths leading to them), even
when we do not have shared extents, or all the shared extents were created
by past transactions. It also adds a bit of contention on the spin lock of
the log_csums_range extent io tree of the log root.
This change adds a 'last_reflink_trans' field to the inode to keep track
of the last transaction where a new extent was shared between inodes
(through clone and deduplication operations). It is updated for both the
source and destination inodes of reflink operations whenever a new extent
(created in the current transaction) becomes shared by the inodes. This
field is kept in memory only, not persisted in the inode item, similar
to other existing fields (last_unlink_trans, logged_trans).
When logging checksums for an extent, if the value of 'last_reflink_trans'
is smaller then the current transaction's generation/id, we skip locking
the extent range and deletion of checksums from the log tree, since we
know we do not have new shared extents. This reduces contention on the
log tree's leaves where checksums are stored.
The following script, which uses fio, was used to measure the impact of
this change:
$ cat test-fsync.sh
#!/bin/bash
DEV=/dev/sdk
MNT=/mnt/sdk
MOUNT_OPTIONS="-o ssd"
MKFS_OPTIONS="-d single -m single"
if [ $# -ne 3 ]; then
echo "Use $0 NUM_JOBS FILE_SIZE FSYNC_FREQ"
exit 1
fi
NUM_JOBS=$1
FILE_SIZE=$2
FSYNC_FREQ=$3
cat <<EOF > /tmp/fio-job.ini
[writers]
rw=write
fsync=$FSYNC_FREQ
fallocate=none
group_reporting=1
direct=0
bs=64k
ioengine=sync
size=$FILE_SIZE
directory=$MNT
numjobs=$NUM_JOBS
EOF
echo "Using config:"
echo
cat /tmp/fio-job.ini
echo
mkfs.btrfs -f $MKFS_OPTIONS $DEV
mount $MOUNT_OPTIONS $DEV $MNT
fio /tmp/fio-job.ini
umount $MNT
The tests were performed for different numbers of jobs, file sizes and
fsync frequency. A qemu VM using kvm was used, with 8 cores (the host has
12 cores, with cpu governance set to performance mode on all cores), 16GiB
of ram (the host has 64GiB) and using a NVMe device directly (without an
intermediary filesystem in the host). While running the tests, the host
was not used for anything else, to avoid disturbing the tests.
The obtained results were the following (the last line of fio's output was
pasted). Starting with 16 jobs is where a significant difference is
observable in this particular setup and hardware (differences highlighted
below). The very small differences for tests with less than 16 jobs are
possibly just noise and random.
**** 1 job, file size 1G, fsync frequency 1 ****
before this change:
WRITE: bw=23.8MiB/s (24.9MB/s), 23.8MiB/s-23.8MiB/s (24.9MB/s-24.9MB/s), io=1024MiB (1074MB), run=43075-43075msec
after this change:
WRITE: bw=24.4MiB/s (25.6MB/s), 24.4MiB/s-24.4MiB/s (25.6MB/s-25.6MB/s), io=1024MiB (1074MB), run=41938-41938msec
**** 2 jobs, file size 1G, fsync frequency 1 ****
before this change:
WRITE: bw=37.7MiB/s (39.5MB/s), 37.7MiB/s-37.7MiB/s (39.5MB/s-39.5MB/s), io=2048MiB (2147MB), run=54351-54351msec
after this change:
WRITE: bw=37.7MiB/s (39.5MB/s), 37.6MiB/s-37.6MiB/s (39.5MB/s-39.5MB/s), io=2048MiB (2147MB), run=54428-54428msec
**** 4 jobs, file size 1G, fsync frequency 1 ****
before this change:
WRITE: bw=67.5MiB/s (70.8MB/s), 67.5MiB/s-67.5MiB/s (70.8MB/s-70.8MB/s), io=4096MiB (4295MB), run=60669-60669msec
after this change:
WRITE: bw=68.6MiB/s (71.0MB/s), 68.6MiB/s-68.6MiB/s (71.0MB/s-71.0MB/s), io=4096MiB (4295MB), run=59678-59678msec
**** 8 jobs, file size 1G, fsync frequency 1 ****
before this change:
WRITE: bw=128MiB/s (134MB/s), 128MiB/s-128MiB/s (134MB/s-134MB/s), io=8192MiB (8590MB), run=64048-64048msec
after this change:
WRITE: bw=129MiB/s (135MB/s), 129MiB/s-129MiB/s (135MB/s-135MB/s), io=8192MiB (8590MB), run=63405-63405msec
**** 16 jobs, file size 1G, fsync frequency 1 ****
before this change:
WRITE: bw=78.5MiB/s (82.3MB/s), 78.5MiB/s-78.5MiB/s (82.3MB/s-82.3MB/s), io=16.0GiB (17.2GB), run=208676-208676msec
after this change:
WRITE: bw=110MiB/s (115MB/s), 110MiB/s-110MiB/s (115MB/s-115MB/s), io=16.0GiB (17.2GB), run=149295-149295msec
(+40.1% throughput, -28.5% runtime)
**** 32 jobs, file size 1G, fsync frequency 1 ****
before this change:
WRITE: bw=58.8MiB/s (61.7MB/s), 58.8MiB/s-58.8MiB/s (61.7MB/s-61.7MB/s), io=32.0GiB (34.4GB), run=557134-557134msec
after this change:
WRITE: bw=76.1MiB/s (79.8MB/s), 76.1MiB/s-76.1MiB/s (79.8MB/s-79.8MB/s), io=32.0GiB (34.4GB), run=430550-430550msec
(+29.4% throughput, -22.7% runtime)
**** 64 jobs, file size 512M, fsync frequency 1 ****
before this change:
WRITE: bw=65.8MiB/s (68.0MB/s), 65.8MiB/s-65.8MiB/s (68.0MB/s-68.0MB/s), io=32.0GiB (34.4GB), run=498055-498055msec
after this change:
WRITE: bw=85.1MiB/s (89.2MB/s), 85.1MiB/s-85.1MiB/s (89.2MB/s-89.2MB/s), io=32.0GiB (34.4GB), run=385116-385116msec
(+29.3% throughput, -22.7% runtime)
**** 128 jobs, file size 256M, fsync frequency 1 ****
before this change:
WRITE: bw=54.7MiB/s (57.3MB/s), 54.7MiB/s-54.7MiB/s (57.3MB/s-57.3MB/s), io=32.0GiB (34.4GB), run=599373-599373msec
after this change:
WRITE: bw=121MiB/s (126MB/s), 121MiB/s-121MiB/s (126MB/s-126MB/s), io=32.0GiB (34.4GB), run=271907-271907msec
(+121.2% throughput, -54.6% runtime)
**** 256 jobs, file size 256M, fsync frequency 1 ****
before this change:
WRITE: bw=69.2MiB/s (72.5MB/s), 69.2MiB/s-69.2MiB/s (72.5MB/s-72.5MB/s), io=64.0GiB (68.7GB), run=947536-947536msec
after this change:
WRITE: bw=121MiB/s (127MB/s), 121MiB/s-121MiB/s (127MB/s-127MB/s), io=64.0GiB (68.7GB), run=541916-541916msec
(+74.9% throughput, -42.8% runtime)
**** 512 jobs, file size 128M, fsync frequency 1 ****
before this change:
WRITE: bw=85.4MiB/s (89.5MB/s), 85.4MiB/s-85.4MiB/s (89.5MB/s-89.5MB/s), io=64.0GiB (68.7GB), run=767734-767734msec
after this change:
WRITE: bw=141MiB/s (147MB/s), 141MiB/s-141MiB/s (147MB/s-147MB/s), io=64.0GiB (68.7GB), run=466022-466022msec
(+65.1% throughput, -39.3% runtime)
**** 1024 jobs, file size 128M, fsync frequency 1 ****
before this change:
WRITE: bw=115MiB/s (120MB/s), 115MiB/s-115MiB/s (120MB/s-120MB/s), io=128GiB (137GB), run=1143775-1143775msec
after this change:
WRITE: bw=171MiB/s (180MB/s), 171MiB/s-171MiB/s (180MB/s-180MB/s), io=128GiB (137GB), run=764843-764843msec
(+48.7% throughput, -33.1% runtime)
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since there is not common cleanup run after the label it makes it
somewhat redundant.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This enum is the interface exposed to developers.
Although we have a detailed comment explaining the whole idea of space
flushing at the beginning of space-info.c, the exposed enum interface
doesn't have any comment.
Some corner cases, like BTRFS_RESERVE_FLUSH_ALL and
BTRFS_RESERVE_FLUSH_ALL_STEAL can be interrupted by fatal signals, are
not explained at all.
So add some simple comments for these enums as a quick reference.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since most metadata reservation calls can return -EINTR when get
interrupted by fatal signal, we need to review the all the metadata
reservation call sites.
In relocation code, the metadata reservation happens in the following
sites:
- btrfs_block_rsv_refill() in merge_reloc_root()
merge_reloc_root() is a pretty critical section, we don't want to be
interrupted by signal, so change the flush status to
BTRFS_RESERVE_FLUSH_LIMIT, so it won't get interrupted by signal.
Since such change can be ENPSPC-prone, also shrink the amount of
metadata to reserve least amount avoid deadly ENOSPC there.
- btrfs_block_rsv_refill() in reserve_metadata_space()
It calls with BTRFS_RESERVE_FLUSH_LIMIT, which won't get interrupted
by signal.
- btrfs_block_rsv_refill() in prepare_to_relocate()
- btrfs_block_rsv_add() in prepare_to_relocate()
- btrfs_block_rsv_refill() in relocate_block_group()
- btrfs_delalloc_reserve_metadata() in relocate_file_extent_cluster()
- btrfs_start_transaction() in relocate_block_group()
- btrfs_start_transaction() in create_reloc_inode()
Can be interrupted by fatal signal and we can handle it easily.
For these call sites, just catch the -EINTR value in btrfs_balance()
and count them as canceled.
CC: stable@vger.kernel.org # 5.4+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
There is a bug report about bad signal timing could lead to read-only
fs during balance:
BTRFS info (device xvdb): balance: start -d -m -s
BTRFS info (device xvdb): relocating block group 73001861120 flags metadata
BTRFS info (device xvdb): found 12236 extents, stage: move data extents
BTRFS info (device xvdb): relocating block group 71928119296 flags data
BTRFS info (device xvdb): found 3 extents, stage: move data extents
BTRFS info (device xvdb): found 3 extents, stage: update data pointers
BTRFS info (device xvdb): relocating block group 60922265600 flags metadata
BTRFS: error (device xvdb) in btrfs_drop_snapshot:5505: errno=-4 unknown
BTRFS info (device xvdb): forced readonly
BTRFS info (device xvdb): balance: ended with status: -4
[CAUSE]
The direct cause is the -EINTR from the following call chain when a
fatal signal is pending:
relocate_block_group()
|- clean_dirty_subvols()
|- btrfs_drop_snapshot()
|- btrfs_start_transaction()
|- btrfs_delayed_refs_rsv_refill()
|- btrfs_reserve_metadata_bytes()
|- __reserve_metadata_bytes()
|- wait_reserve_ticket()
|- prepare_to_wait_event();
|- ticket->error = -EINTR;
Normally this behavior is fine for most btrfs_start_transaction()
callers, as they need to catch any other error, same for the signal, and
exit ASAP.
However for balance, especially for the clean_dirty_subvols() case, we're
already doing cleanup works, getting -EINTR from btrfs_drop_snapshot()
could cause a lot of unexpected problems.
From the mentioned forced read-only report, to later balance error due
to half dropped reloc trees.
[FIX]
Fix this problem by using btrfs_join_transaction() if
btrfs_drop_snapshot() is called from relocation context.
Since btrfs_join_transaction() won't get interrupted by signal, we can
continue the cleanup.
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>3
Signed-off-by: David Sterba <dsterba@suse.com>
Although btrfs balance can be canceled with "btrfs balance cancel"
command, it's still almost muscle memory to press Ctrl-C to cancel a
long running btrfs balance.
So allow btrfs balance to check signal to determine if it should exit.
The cancellation points are in known location and we're only adding one
more reason, so this should be safe.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There's no cleanup that occurs so we can simply return 0 directly.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
User Forza reported on IRC that some invalid combinations of file
attributes are accepted by chattr.
The NODATACOW and compression file flags/attributes are mutually
exclusive, but they could be set by 'chattr +c +C' on an empty file. The
nodatacow will be in effect because it's checked first in
btrfs_run_delalloc_range.
Extend the flag validation to catch the following cases:
- input flags are conflicting
- old and new flags are conflicting
- initialize the local variable with inode flags after inode ls locked
Inode attributes take precedence over mount options and are an
independent setting.
Nocompress would be a no-op with nodatacow, but we don't want to mix
any compression-related options with nodatacow.
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: David Sterba <dsterba@suse.com>
->show_devname currently shows the lowest devid in the list. As the seed
devices have the lowest devid in the sprouted filesystem, the userland
tool such as findmnt end up seeing seed device instead of the device from
the read-writable sprouted filesystem. As shown below.
mount /dev/sda /btrfs
mount: /btrfs: WARNING: device write-protected, mounted read-only.
findmnt --output SOURCE,TARGET,UUID /btrfs
SOURCE TARGET UUID
/dev/sda /btrfs 899f7027-3e46-4626-93e7-7d4c9ad19111
btrfs dev add -f /dev/sdb /btrfs
umount /btrfs
mount /dev/sdb /btrfs
findmnt --output SOURCE,TARGET,UUID /btrfs
SOURCE TARGET UUID
/dev/sda /btrfs 899f7027-3e46-4626-93e7-7d4c9ad19111
All sprouts from a single seed will show the same seed device and the
same fsid. That's confusing.
This is causing problems in our prototype as there isn't any reference
to the sprout file-system(s) which is being used for actual read and
write.
This was added in the patch which implemented the show_devname in btrfs
commit 9c5085c147 ("Btrfs: implement ->show_devname").
I tried to look for any particular reason that we need to show the seed
device, there isn't any.
So instead, do not traverse through the seed devices, just show the
lowest devid in the sprouted fsid.
After the patch:
mount /dev/sda /btrfs
mount: /btrfs: WARNING: device write-protected, mounted read-only.
findmnt --output SOURCE,TARGET,UUID /btrfs
SOURCE TARGET UUID
/dev/sda /btrfs 899f7027-3e46-4626-93e7-7d4c9ad19111
btrfs dev add -f /dev/sdb /btrfs
mount -o rw,remount /dev/sdb /btrfs
findmnt --output SOURCE,TARGET,UUID /btrfs
SOURCE TARGET UUID
/dev/sdb /btrfs 595ca0e6-b82e-46b5-b9e2-c72a6928be48
mount /dev/sda /btrfs1
mount: /btrfs1: WARNING: device write-protected, mounted read-only.
btrfs dev add -f /dev/sdc /btrfs1
findmnt --output SOURCE,TARGET,UUID /btrfs1
SOURCE TARGET UUID
/dev/sdc /btrfs1 ca1dbb7a-8446-4f95-853c-a20f3f82bdbb
cat /proc/self/mounts | grep btrfs
/dev/sdb /btrfs btrfs rw,relatime,noacl,space_cache,subvolid=5,subvol=/ 0 0
/dev/sdc /btrfs1 btrfs ro,relatime,noacl,space_cache,subvolid=5,subvol=/ 0 0
Reported-by: Martin K. Petersen <martin.petersen@oracle.com>
CC: stable@vger.kernel.org # 4.19+
Tested-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
Sometime fsstress could lead to qgroup warning for case like
generic/013:
BTRFS warning (device dm-3): qgroup 0/259 has unreleased space, type 1 rsv 81920
------------[ cut here ]------------
WARNING: CPU: 9 PID: 24535 at fs/btrfs/disk-io.c:4142 close_ctree+0x1dc/0x323 [btrfs]
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
RIP: 0010:close_ctree+0x1dc/0x323 [btrfs]
Call Trace:
btrfs_put_super+0x15/0x17 [btrfs]
generic_shutdown_super+0x72/0x110
kill_anon_super+0x18/0x30
btrfs_kill_super+0x17/0x30 [btrfs]
deactivate_locked_super+0x3b/0xa0
deactivate_super+0x40/0x50
cleanup_mnt+0x135/0x190
__cleanup_mnt+0x12/0x20
task_work_run+0x64/0xb0
__prepare_exit_to_usermode+0x1bc/0x1c0
__syscall_return_slowpath+0x47/0x230
do_syscall_64+0x64/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
---[ end trace 6c341cdf9b6cc3c1 ]---
BTRFS error (device dm-3): qgroup reserved space leaked
While that subvolume 259 is no longer in that filesystem.
[CAUSE]
Normally per-trans qgroup reserved space is freed when a transaction is
committed, in commit_fs_roots().
However for completely dropped subvolume, that subvolume is completely
gone, thus is no longer in the fs_roots_radix, and its per-trans
reserved qgroup will never be freed.
Since the subvolume is already gone, leaked per-trans space won't cause
any trouble for end users.
[FIX]
Just call btrfs_qgroup_free_meta_all_pertrans() before a subvolume is
completely dropped.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
clang static analysis flags this error
fs/btrfs/ref-verify.c:290:3: warning: Potential leak of memory pointed to by 're' [unix.Malloc]
kfree(be);
^~~~~
The problem is in this block of code:
if (root_objectid) {
struct root_entry *exist_re;
exist_re = insert_root_entry(&exist->roots, re);
if (exist_re)
kfree(re);
}
There is no 'else' block freeing when root_objectid is 0. Add the
missing kfree to the else branch.
Fixes: fd708b81d9 ("Btrfs: add a extent ref verify tool")
CC: stable@vger.kernel.org # 4.19+
Signed-off-by: Tom Rix <trix@redhat.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The whole chunk tree is read at mount time so we can utilize readahead
to get the tree blocks to memory before we read the items. The idea is
from Robbie, but instead of updating search slot readahead, this patch
implements the chunk tree readahead manually from nodes on level 1.
We've decided to do specific readahead optimizations and then unify them
under a common API so we don't break everything by changing the search
slot readahead logic.
Higher chunk trees grow on large filesystems (many terabytes), and
prefetching just level 1 seems to be sufficient. Provided example was
from a 200TiB filesystem with chunk tree level 2.
CC: Robbie Ko <robbieko@synology.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add retrieval of the filesystem's metadata UUID to the fsinfo ioctl.
This is driven by setting the BTRFS_FS_INFO_FLAG_METADATA_UUID flag in
btrfs_ioctl_fs_info_args::flags.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add retrieval of the filesystem's generation to the fsinfo ioctl. This is
driven by setting the BTRFS_FS_INFO_FLAG_GENERATION flag in
btrfs_ioctl_fs_info_args::flags.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
With the recent addition of filesystem checksum types other than CRC32c,
it is not anymore hard-coded which checksum type a btrfs filesystem uses.
Up to now there is no good way to read the filesystem checksum, apart from
reading the filesystem UUID and then query sysfs for the checksum type.
Add a new csum_type and csum_size fields to the BTRFS_IOC_FS_INFO ioctl
command which usually is used to query filesystem features. Also add a
flags member indicating that the kernel responded with a set csum_type and
csum_size field.
For compatibility reasons, only return the csum_type and csum_size if
the BTRFS_FS_INFO_FLAG_CSUM_INFO flag was passed to the kernel. Also
clear any unknown flags so we don't pass false positives to user-space
newer than the kernel.
To simplify further additions to the ioctl, also switch the padding to a
u8 array. Pahole was used to verify the result of this switch:
The csum members are added before flags, which might look odd, but this
is to keep the alignment requirements and not to introduce holes in the
structure.
$ pahole -C btrfs_ioctl_fs_info_args fs/btrfs/btrfs.ko
struct btrfs_ioctl_fs_info_args {
__u64 max_id; /* 0 8 */
__u64 num_devices; /* 8 8 */
__u8 fsid[16]; /* 16 16 */
__u32 nodesize; /* 32 4 */
__u32 sectorsize; /* 36 4 */
__u32 clone_alignment; /* 40 4 */
__u16 csum_type; /* 44 2 */
__u16 csum_size; /* 46 2 */
__u64 flags; /* 48 8 */
__u8 reserved[968]; /* 56 968 */
/* size: 1024, cachelines: 16, members: 10 */
};
Fixes: 3951e7f050 ("btrfs: add xxhash64 to checksumming algorithms")
Fixes: 3831bf0094 ("btrfs: add sha256 to checksumming algorithm")
CC: stable@vger.kernel.org # 5.5+
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
commit a514d63882 ("btrfs: qgroup: Commit transaction in advance to
reduce early EDQUOT") tries to reduce the early EDQUOT problems by
checking the qgroup free against threshold and tries to wake up commit
kthread to free some space.
The problem of that mechanism is, it can only free qgroup per-trans
metadata space, can't do anything to data, nor prealloc qgroup space.
Now since we have the ability to flush qgroup space, and implemented
retry-after-EDQUOT behavior, such mechanism can be completely replaced.
So this patch will cleanup such mechanism in favor of
retry-after-EDQUOT.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[PROBLEM]
There are known problem related to how btrfs handles qgroup reserved
space. One of the most obvious case is the the test case btrfs/153,
which do fallocate, then write into the preallocated range.
btrfs/153 1s ... - output mismatch (see xfstests-dev/results//btrfs/153.out.bad)
--- tests/btrfs/153.out 2019-10-22 15:18:14.068965341 +0800
+++ xfstests-dev/results//btrfs/153.out.bad 2020-07-01 20:24:40.730000089 +0800
@@ -1,2 +1,5 @@
QA output created by 153
+pwrite: Disk quota exceeded
+/mnt/scratch/testfile2: Disk quota exceeded
+/mnt/scratch/testfile2: Disk quota exceeded
Silence is golden
...
(Run 'diff -u xfstests-dev/tests/btrfs/153.out xfstests-dev/results//btrfs/153.out.bad' to see the entire diff)
[CAUSE]
Since commit c6887cd111 ("Btrfs: don't do nocow check unless we have to"),
we always reserve space no matter if it's COW or not.
Such behavior change is mostly for performance, and reverting it is not
a good idea anyway.
For preallcoated extent, we reserve qgroup data space for it already,
and since we also reserve data space for qgroup at buffered write time,
it needs twice the space for us to write into preallocated space.
This leads to the -EDQUOT in buffered write routine.
And we can't follow the same solution, unlike data/meta space check,
qgroup reserved space is shared between data/metadata.
The EDQUOT can happen at the metadata reservation, so doing NODATACOW
check after qgroup reservation failure is not a solution.
[FIX]
To solve the problem, we don't return -EDQUOT directly, but every time
we got a -EDQUOT, we try to flush qgroup space:
- Flush all inodes of the root
NODATACOW writes will free the qgroup reserved at run_dealloc_range().
However we don't have the infrastructure to only flush NODATACOW
inodes, here we flush all inodes anyway.
- Wait for ordered extents
This would convert the preallocated metadata space into per-trans
metadata, which can be freed in later transaction commit.
- Commit transaction
This will free all per-trans metadata space.
Also we don't want to trigger flush multiple times, so here we introduce
a per-root wait list and a new root status, to ensure only one thread
starts the flushing.
Fixes: c6887cd111 ("Btrfs: don't do nocow check unless we have to")
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[PROBLEM]
Before this patch, when btrfs_qgroup_reserve_data() fails, we free all
reserved space of the changeset.
For example:
ret = btrfs_qgroup_reserve_data(inode, changeset, 0, SZ_1M);
ret = btrfs_qgroup_reserve_data(inode, changeset, SZ_1M, SZ_1M);
ret = btrfs_qgroup_reserve_data(inode, changeset, SZ_2M, SZ_1M);
If the last btrfs_qgroup_reserve_data() failed, it will release the
entire [0, 3M) range.
This behavior is kind of OK for now, as when we hit -EDQUOT, we normally
go error handling and need to release all reserved ranges anyway.
But this also means the following call is not possible:
ret = btrfs_qgroup_reserve_data();
if (ret == -EDQUOT) {
/* Do something to free some qgroup space */
ret = btrfs_qgroup_reserve_data();
}
As if the first btrfs_qgroup_reserve_data() fails, it will free all
reserved qgroup space.
[CAUSE]
This is because we release all reserved ranges when
btrfs_qgroup_reserve_data() fails.
[FIX]
This patch will implement a new function, qgroup_unreserve_range(), to
iterate through the ulist nodes, to find any nodes in the failure range,
and remove the EXTENT_QGROUP_RESERVED bits from the io_tree, and
decrease the extent_changeset::bytes_changed, so that we can revert to
previous state.
This allows later patches to retry btrfs_qgroup_reserve_data() if EDQUOT
happens.
Suggested-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have refcount_t now with the associated library to handle refcounts,
which gives us extra debugging around reference count mistakes that may
be made. For example it'll warn on any transition from 0->1 or 0->-1,
which is handy for noticing cases where we've messed up reference
counting. Convert the block group ref counting from an atomic_t to
refcount_t and use the appropriate helpers.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Multi-statement macros should be enclosed in do/while(0) block to make
their use safe in single statement if conditions. All current uses of
the macros are safe, so this change is for future protection.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
While at it use the opportunity to simplify find_logical_bio_stripe by
reducing the scope of 'stripe_start' variable and squash the
sector-to-bytes conversion on one line.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Unify the style in the file such that return value of bio_list_pop is
assigned directly in the while loop. This is in line with the rest of
the kernel.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The merging logic is always executed if the current stripe's device
is not missing. So there's no point in duplicating the check. Simply
remove it, while at it reduce the scope of the 'last_end' variable.
If the current stripe's device is missing we fail the stripe early on.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since btrfs_bio always contains the extra space for the tgtdev_map and
raid_maps it's pointless to make the assignment iff specific conditions
are met.
Instead, always assign the pointers to their correct value at allocation
time. To accommodate this change also move code a bit in
__btrfs_map_block so that btrfs_bio::stripes array is always initialized
before the raid_map, subsequently move the call to sort_parity_stripes
in the 'if' building the raid_map, retaining the old behavior.
To better understand the change, there are 2 aspects to this:
1. The original code is harder to grasp because the calculations for
initializing raid_map/tgtdev ponters are apart from the initial
allocation of memory. Having them predicated on 2 separate checks
doesn't help that either... So by moving the initialisation in
alloc_btrfs_bio puts everything together.
2. tgtdev/raid_maps are now always initialized despite sometimes they
might be equal i.e __btrfs_map_block_for_discard calls
alloc_btrfs_bio with tgtdev = 0 but their usage should be predicated
on external checks i.e. just because those pointers are non-null
doesn't mean they are valid per-se. And actually while taking another
look at __btrfs_map_block I saw a discrepancy:
Original code initialised tgtdev_map if the following check is true:
if (dev_replace_is_ongoing && dev_replace->tgtdev != NULL)
However, further down tgtdev_map is only used if the following check
is true:
if (dev_replace_is_ongoing && dev_replace->tgtdev != NULL && need_full_stripe(op))
e.g. the additional need_full_stripe(op) predicate is there.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ copy more details from mail discussion ]
Signed-off-by: David Sterba <dsterba@suse.com>
Since BTRFS uses a private bdi it makes sense to create a link to this
bdi under /sys/fs/btrfs/<UUID>/bdi. This allows size of read ahead to
be controlled. Without this patch it's not possible to uniquely identify
which bdi pertains to which btrfs filesystem in the case of multiple
btrfs filesystems.
It's fine to simply call sysfs_remove_link without checking if the
link indeed has been created. The call path
sysfs_remove_link
kernfs_remove_by_name
kernfs_remove_by_name_ns
will simply return -ENOENT in case it doesn't exist.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If a compressed read fails due to checksum error only a line is printed
to dmesg, device corrupt counter is not modified.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
compressed_bio::orig_bio is always set in btrfs_submit_compressed_read
before any bio submission is performed. Since that function is always
called with a valid bio it renders the ASSERT unnecessary.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that btrfs_io_bio have access to btrfs_device we can safely
increment the device corruption counter on error. There is one notable
exception - repair bios for raid. Since those don't go through the
normal submit_stripe_bio callpath but through raid56_parity_recover thus
repair bios won't have their device set.
Scrub increments the corruption counter for checksum mismatch as well
but does not call this function.
Link: https://lore.kernel.org/linux-btrfs/4857863.FCrPRfMyHP@liv/
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_map_bio ensures that all submitted bios to devices have valid
btrfs_device::bdev so this check can be removed from btrfs_end_bio. This
check was added in june 2012 597a60fade ("Btrfs: don't count I/O
statistic read errors for missing devices") but then in October of the
same year another commit de1ee92ac3 ("Btrfs: recheck bio against
block device when we map the bio") started checking for the presence of
btrfs_device::bdev before actually issuing the bio.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Instead of recording stripe_index and using that to access correct
btrfs_device from btrfs_bio::stripes record the btrfs_device in
btrfs_io_bio. This will enable endio handlers to increment device
error counters on checksum errors.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Make the function directly return a pointer to a failure record and
adjust callers to handle it. Also refactor the logic inside so that
the case which allocates the failure record for the first time is not
handled in an 'if' arm, saving us a level of indentation. Finally make
the function static as it's not used outside of extent_io.c .
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Only failure that get_state_failrec can get is if there is no failure
for the given address. There is no reason why the function should return
a status code and use a separate parameter for returning the actual
failure rec (if one is found). Simplify it by making the return type
a pointer and return ERR_PTR value in case of errors.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The option subvolrootid used to be a workaround for mounting subvolumes
and ineffective since 5e2a4b25da ("btrfs: deprecate subvolrootid mount
option"). We have subvol= that works and we don't need to keep the
cruft, let's remove it.
Signed-off-by: David Sterba <dsterba@suse.com>
The mount option alloc_start has no effect since 0d0c71b317 ("btrfs:
obsolete and remove mount option alloc_start") which has details why
it's been deprecated. We can remove it.
Signed-off-by: David Sterba <dsterba@suse.com>
When syncing the log, we used to update the log root tree without holding
neither the log_mutex of the subvolume root nor the log_mutex of log root
tree.
We used to have two critical sections delimited by the log_mutex of the
log root tree, so in the first one we incremented the log_writers of the
log root tree and on the second one we decremented it and waited for the
log_writers counter to go down to zero. This was because the update of
the log root tree happened between the two critical sections.
The use of two critical sections allowed a little bit more of parallelism
and required the use of the log_writers counter, necessary to make sure
we didn't miss any log root tree update when we have multiple tasks trying
to sync the log in parallel.
However after commit 06989c799f ("Btrfs: fix race updating log root
item during fsync") the log root tree update was moved into a critical
section delimited by the subvolume's log_mutex. Later another commit
moved the log tree update from that critical section into the second
critical section delimited by the log_mutex of the log root tree. Both
commits addressed different bugs.
The end result is that the first critical section delimited by the
log_mutex of the log root tree became pointless, since there's nothing
done between it and the second critical section, we just have an unlock
of the log_mutex followed by a lock operation. This means we can merge
both critical sections, as the first one does almost nothing now, and we
can stop using the log_writers counter of the log root tree, which was
incremented in the first critical section and decremented in the second
criticial section, used to make sure no one in the second critical section
started writeback of the log root tree before some other task updated it.
So just remove the mutex_unlock() followed by mutex_lock() of the log root
tree, as well as the use of the log_writers counter for the log root tree.
This patch is part of a series that has the following patches:
1/4 btrfs: only commit the delayed inode when doing a full fsync
2/4 btrfs: only commit delayed items at fsync if we are logging a directory
3/4 btrfs: stop incremening log_batch for the log root tree when syncing log
4/4 btrfs: remove no longer needed use of log_writers for the log root tree
After the entire patchset applied I saw about 12% decrease on max latency
reported by dbench. The test was done on a qemu vm, with 8 cores, 16Gb of
ram, using kvm and using a raw NVMe device directly (no intermediary fs on
the host). The test was invoked like the following:
mkfs.btrfs -f /dev/sdk
mount -o ssd -o nospace_cache /dev/sdk /mnt/sdk
dbench -D /mnt/sdk -t 300 8
umount /mnt/dsk
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We are incrementing the log_batch atomic counter of the root log tree but
we never use that counter, it's used only for the log trees of subvolume
roots. We started doing it when we moved the log_batch and log_write
counters from the global, per fs, btrfs_fs_info structure, into the
btrfs_root structure in commit 7237f18336 ("Btrfs: fix tree logs
parallel sync").
So just stop doing it for the log root tree and add a comment over the
field declaration so inform it's used only for log trees of subvolume
roots.
This patch is part of a series that has the following patches:
1/4 btrfs: only commit the delayed inode when doing a full fsync
2/4 btrfs: only commit delayed items at fsync if we are logging a directory
3/4 btrfs: stop incremening log_batch for the log root tree when syncing log
4/4 btrfs: remove no longer needed use of log_writers for the log root tree
After the entire patchset applied I saw about 12% decrease on max latency
reported by dbench. The test was done on a qemu vm, with 8 cores, 16Gb of
ram, using kvm and using a raw NVMe device directly (no intermediary fs on
the host). The test was invoked like the following:
mkfs.btrfs -f /dev/sdk
mount -o ssd -o nospace_cache /dev/sdk /mnt/sdk
dbench -D /mnt/sdk -t 300 8
umount /mnt/dsk
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When logging an inode we are committing its delayed items if either the
inode is a directory or if it is a new inode, created in the current
transaction.
We need to do it for directories, since new directory indexes are stored
as delayed items of the inode and when logging a directory we need to be
able to access all indexes from the fs/subvolume tree in order to figure
out which index ranges need to be logged.
However for new inodes that are not directories, we do not need to do it
because the only type of delayed item they can have is the inode item, and
we are guaranteed to always log an up to date version of the inode item:
*) for a full fsync we do it by committing the delayed inode and then
copying the item from the fs/subvolume tree with
copy_inode_items_to_log();
*) for a fast fsync we always log the inode item based on the contents of
the in-memory struct btrfs_inode. We guarantee this is always done since
commit e4545de5b0 ("Btrfs: fix fsync data loss after append write").
So stop running delayed items for a new inodes that are not directories,
since that forces committing the delayed inode into the fs/subvolume tree,
wasting time and adding contention to the tree when a full fsync is not
required. We will only do it in case a fast fsync is needed.
This patch is part of a series that has the following patches:
1/4 btrfs: only commit the delayed inode when doing a full fsync
2/4 btrfs: only commit delayed items at fsync if we are logging a directory
3/4 btrfs: stop incremening log_batch for the log root tree when syncing log
4/4 btrfs: remove no longer needed use of log_writers for the log root tree
After the entire patchset applied I saw about 12% decrease on max latency
reported by dbench. The test was done on a qemu vm, with 8 cores, 16Gb of
ram, using kvm and using a raw NVMe device directly (no intermediary fs on
the host). The test was invoked like the following:
mkfs.btrfs -f /dev/sdk
mount -o ssd -o nospace_cache /dev/sdk /mnt/sdk
dbench -D /mnt/sdk -t 300 8
umount /mnt/dsk
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Commit 2c2c452b0c ("Btrfs: fix fsync when extend references are added
to an inode") forced a commit of the delayed inode when logging an inode
in order to ensure we would end up logging the inode item during a full
fsync. By committing the delayed inode, we updated the inode item in the
fs/subvolume tree and then later when copying items from leafs modified in
the current transaction into the log tree (with copy_inode_items_to_log())
we ended up copying the inode item from the fs/subvolume tree into the log
tree. Logging an up to date version of the inode item is required to make
sure at log replay time we get the link count fixup triggered among other
things (replay xattr deletes, etc). The test case generic/040 from fstests
exercises the bug which that commit fixed.
However for a fast fsync we don't need to commit the delayed inode because
we always log an up to date version of the inode item based on the struct
btrfs_inode we have in-memory. We started doing this for fast fsyncs since
commit e4545de5b0 ("Btrfs: fix fsync data loss after append write").
So just stop committing the delayed inode if we are doing a fast fsync,
we are only wasting time and adding contention on fs/subvolume tree.
This patch is part of a series that has the following patches:
1/4 btrfs: only commit the delayed inode when doing a full fsync
2/4 btrfs: only commit delayed items at fsync if we are logging a directory
3/4 btrfs: stop incremening log_batch for the log root tree when syncing log
4/4 btrfs: remove no longer needed use of log_writers for the log root tree
After the entire patchset applied I saw about 12% decrease on max latency
reported by dbench. The test was done on a qemu vm, with 8 cores, 16Gb of
ram, using kvm and using a raw NVMe device directly (no intermediary fs on
the host). The test was invoked like the following:
mkfs.btrfs -f /dev/sdk
mount -o ssd -o nospace_cache /dev/sdk /mnt/sdk
dbench -D /mnt/sdk -t 300 8
umount /mnt/dsk
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
When the anonymous block device pool is exhausted, subvolume/snapshot
creation fails with EMFILE (Too many files open). This has been reported
by a user. The allocation happens in the second phase during transaction
commit where it's only way out is to abort the transaction
BTRFS: Transaction aborted (error -24)
WARNING: CPU: 17 PID: 17041 at fs/btrfs/transaction.c:1576 create_pending_snapshot+0xbc4/0xd10 [btrfs]
RIP: 0010:create_pending_snapshot+0xbc4/0xd10 [btrfs]
Call Trace:
create_pending_snapshots+0x82/0xa0 [btrfs]
btrfs_commit_transaction+0x275/0x8c0 [btrfs]
btrfs_mksubvol+0x4b9/0x500 [btrfs]
btrfs_ioctl_snap_create_transid+0x174/0x180 [btrfs]
btrfs_ioctl_snap_create_v2+0x11c/0x180 [btrfs]
btrfs_ioctl+0x11a4/0x2da0 [btrfs]
do_vfs_ioctl+0xa9/0x640
ksys_ioctl+0x67/0x90
__x64_sys_ioctl+0x1a/0x20
do_syscall_64+0x5a/0x110
entry_SYSCALL_64_after_hwframe+0x44/0xa9
---[ end trace 33f2f83f3d5250e9 ]---
BTRFS: error (device sda1) in create_pending_snapshot:1576: errno=-24 unknown
BTRFS info (device sda1): forced readonly
BTRFS warning (device sda1): Skipping commit of aborted transaction.
BTRFS: error (device sda1) in cleanup_transaction:1831: errno=-24 unknown
[CAUSE]
When the global anonymous block device pool is exhausted, the following
call chain will fail, and lead to transaction abort:
btrfs_ioctl_snap_create_v2()
|- btrfs_ioctl_snap_create_transid()
|- btrfs_mksubvol()
|- btrfs_commit_transaction()
|- create_pending_snapshot()
|- btrfs_get_fs_root()
|- btrfs_init_fs_root()
|- get_anon_bdev()
[FIX]
Although we can't enlarge the anonymous block device pool, at least we
can preallocate anon_dev for subvolume/snapshot in the first phase,
outside of transaction context and exactly at the moment the user calls
the creation ioctl.
Reported-by: Greed Rong <greedrong@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/CA+UqX+NTrZ6boGnWHhSeZmEY5J76CTqmYjO2S+=tHJX7nb9DPw@mail.gmail.com/
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
When a lot of subvolumes are created, there is a user report about
transaction aborted caused by slow anonymous block device reclaim:
BTRFS: Transaction aborted (error -24)
WARNING: CPU: 17 PID: 17041 at fs/btrfs/transaction.c:1576 create_pending_snapshot+0xbc4/0xd10 [btrfs]
RIP: 0010:create_pending_snapshot+0xbc4/0xd10 [btrfs]
Call Trace:
create_pending_snapshots+0x82/0xa0 [btrfs]
btrfs_commit_transaction+0x275/0x8c0 [btrfs]
btrfs_mksubvol+0x4b9/0x500 [btrfs]
btrfs_ioctl_snap_create_transid+0x174/0x180 [btrfs]
btrfs_ioctl_snap_create_v2+0x11c/0x180 [btrfs]
btrfs_ioctl+0x11a4/0x2da0 [btrfs]
do_vfs_ioctl+0xa9/0x640
ksys_ioctl+0x67/0x90
__x64_sys_ioctl+0x1a/0x20
do_syscall_64+0x5a/0x110
entry_SYSCALL_64_after_hwframe+0x44/0xa9
---[ end trace 33f2f83f3d5250e9 ]---
BTRFS: error (device sda1) in create_pending_snapshot:1576: errno=-24 unknown
BTRFS info (device sda1): forced readonly
BTRFS warning (device sda1): Skipping commit of aborted transaction.
BTRFS: error (device sda1) in cleanup_transaction:1831: errno=-24 unknown
[CAUSE]
The anonymous device pool is shared and its size is 1M. It's possible to
hit that limit if the subvolume deletion is not fast enough and the
subvolumes to be cleaned keep the ids allocated.
[WORKAROUND]
We can't avoid the anon device pool exhaustion but we can shorten the
time the id is attached to the subvolume root once the subvolume becomes
invisible to the user.
Reported-by: Greed Rong <greedrong@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/CA+UqX+NTrZ6boGnWHhSeZmEY5J76CTqmYjO2S+=tHJX7nb9DPw@mail.gmail.com/
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
When a lot of subvolumes are created, there is a user report about
transaction aborted:
BTRFS: Transaction aborted (error -24)
WARNING: CPU: 17 PID: 17041 at fs/btrfs/transaction.c:1576 create_pending_snapshot+0xbc4/0xd10 [btrfs]
RIP: 0010:create_pending_snapshot+0xbc4/0xd10 [btrfs]
Call Trace:
create_pending_snapshots+0x82/0xa0 [btrfs]
btrfs_commit_transaction+0x275/0x8c0 [btrfs]
btrfs_mksubvol+0x4b9/0x500 [btrfs]
btrfs_ioctl_snap_create_transid+0x174/0x180 [btrfs]
btrfs_ioctl_snap_create_v2+0x11c/0x180 [btrfs]
btrfs_ioctl+0x11a4/0x2da0 [btrfs]
do_vfs_ioctl+0xa9/0x640
ksys_ioctl+0x67/0x90
__x64_sys_ioctl+0x1a/0x20
do_syscall_64+0x5a/0x110
entry_SYSCALL_64_after_hwframe+0x44/0xa9
---[ end trace 33f2f83f3d5250e9 ]---
BTRFS: error (device sda1) in create_pending_snapshot:1576: errno=-24 unknown
BTRFS info (device sda1): forced readonly
BTRFS warning (device sda1): Skipping commit of aborted transaction.
BTRFS: error (device sda1) in cleanup_transaction:1831: errno=-24 unknown
[CAUSE]
The error is EMFILE (Too many files open) and comes from the anonymous
block device allocation. The ids are in a shared pool of size 1<<20.
The ids are assigned to live subvolumes, ie. the root structure exists
in memory (eg. after creation or after the root appears in some path).
The pool could be exhausted if the numbers are not reclaimed fast
enough, after subvolume deletion or if other system component uses the
anon block devices.
[WORKAROUND]
Since it's not possible to completely solve the problem, we can only
minimize the time the id is allocated to a subvolume root.
Firstly, we can reduce the use of anon_dev by trees that are not
subvolume roots, like data reloc tree.
This patch will do extra check on root objectid, to skip roots that
don't need anon_dev. Currently it's only data reloc tree and orphan
roots.
Reported-by: Greed Rong <greedrong@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/CA+UqX+NTrZ6boGnWHhSeZmEY5J76CTqmYjO2S+=tHJX7nb9DPw@mail.gmail.com/
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This patch will add the following sysfs interface:
/sys/fs/btrfs/<UUID>/qgroups/<qgroup_id>/referenced
/sys/fs/btrfs/<UUID>/qgroups/<qgroup_id>/exclusive
/sys/fs/btrfs/<UUID>/qgroups/<qgroup_id>/max_referenced
/sys/fs/btrfs/<UUID>/qgroups/<qgroup_id>/max_exclusive
/sys/fs/btrfs/<UUID>/qgroups/<qgroup_id>/limit_flags
Which is also available in output of "btrfs qgroup show".
/sys/fs/btrfs/<UUID>/qgroups/<qgroup_id>/rsv_data
/sys/fs/btrfs/<UUID>/qgroups/<qgroup_id>/rsv_meta_pertrans
/sys/fs/btrfs/<UUID>/qgroups/<qgroup_id>/rsv_meta_prealloc
The last 3 rsv related members are not visible to users, but can be very
useful to debug qgroup limit related bugs.
Also, to avoid '/' used in <qgroup_id>, the separator between qgroup
level and qgroup id is changed to '_'.
The interface is not hidden behind 'debug' as we want this interface to
be included into production build and to provide another way to read the
qgroup information besides the ioctls.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The qgroup level is limited to u16, so no need to use u64 for it.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
vfs_inode is used only for the inode number everything else requires
btrfs_inode.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ use btrfs_ino ]
Signed-off-by: David Sterba <dsterba@suse.com>
Instead of making multiple calls to BTRFS_I simply take btrfs_inode as
an input paramter.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The vfs inode is only used for a pair of inode_lock/unlock calls all
other uses call for btrfs_inode.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
All of its children functions use btrfs_inode.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
All of its children take btrfs_inode so bubble up this requirement to
btrfs_delalloc_reserve_space's interface and stop calling BTRFS_I
internally.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Instead of calling BTRFS_I on the passed vfs_inode take btrfs_inode
directly.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It needs btrfs_inode so take it as a parameter directly.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It only uses btrfs_inode internally so take it as a parameter.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
No point in taking an inode only to get btrfs_fs_info from it, instead
take btrfs_fs_info directly.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There's only a single use of vfs_inode in a tracepoint so let's take
btrfs_inode directly.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There is a single use of the generic vfs_inode so let's take btrfs_inode
as a parameter and remove couple of redundant BTRFS_I() calls.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Preparation to make btrfs_dirty_pages take btrfs_inode as parameter.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Only find_lock_delalloc_range uses vfs_inode so let's take the
btrfs_inode as a parameter.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It has only a single use for a generic vfs inode vs 3 for btrfs_inode.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function really needs a btrfs_inode and not a generic vfs one. Take
it as a parameter and get rid of superfluous BTRFS_I() calls.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Take btrfs_inode directly and stop using superfulous BTRFS_I calls.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Simply forwards its argument so let's get rid of one extra BTRFS_I by
taking btrfs_inode directly.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
All children now take btrfs_inode so convert it to taking it as a
parameter as well.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Gets rid of superfulous BTRFS_I() calls and prepare for converting
btrfs_run_delalloc_range to using btrfs_inode.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Simply gets rid of superfluous BTRFS_I() calls.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Gets rid of superfluous BTRFS_I() calls.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Preparation to converting btrfs_run_delalloc_range to using btrfs_inode
without BTRFS_I() calls.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It really wants btrfs_inode.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It doesn't really need vfs_inode but btrfs_inode.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It only uses vfs inode for assigning it to the async_chunk function.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It only really uses btrfs_inode.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It really wants btrfs_inode and is prepration to converting
run_delalloc_nocow to taking btrfs_inode.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It just forwards its argument to __btrfs_qgroup_release_data.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
All but 3 uses require vfs_inode so convert the logic to have
btrfs_inode be the main inode struct.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Majority of its uses are for btrfs_inode so take it as an argument
directly.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It simpy forwards its inode argument to __btrfs_add_ordered_extent which
already takes btrfs_inode.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
All its children functions take btrfs_inode so convert it to taking
btrfs_inode.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Preparation to converting its callers to taking btrfs_inode.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It has only 2 uses for the vfs_inode - insert_inline_extent and
i_size_read. On the flipside it will allow converting its callers to
btrfs_inode, so convert it to taking btrfs_inode.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It passes btrfs_inode to its callee so change the interface.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It uses vfs_inode only for a tracepoint so convert its interface to take
btrfs_inode directly.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It only uses btrfs_inode so can just as easily take it as an argument.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
On a filesystem with exhausted metadata, but still enough to start
balance, it's possible to hit this error:
[324402.053842] BTRFS info (device loop0): 1 enospc errors during balance
[324402.060769] BTRFS info (device loop0): balance: ended with status: -28
[324402.172295] BTRFS: error (device loop0) in reset_balance_state:3321: errno=-28 No space left
It fails inside reset_balance_state and turns the filesystem to
read-only, which is unnecessary and should be fixed too, but the problem
is caused by lack for space when the balance item is deleted. This is a
one-time operation and from the same rank as unlink that is allowed to
use the global block reserve. So do the same for the balance item.
Status of the filesystem (100GiB) just after the balance fails:
$ btrfs fi df mnt
Data, single: total=80.01GiB, used=38.58GiB
System, single: total=4.00MiB, used=16.00KiB
Metadata, single: total=19.99GiB, used=19.48GiB
GlobalReserve, single: total=512.00MiB, used=50.11MiB
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function btrfs_check_can_nocow() now has two completely different
call patterns.
For nowait variant, callers don't need to do any cleanup. While for
wait variant, callers need to release the lock if they can do nocow
write.
This is somehow confusing, and is already a problem for the exported
btrfs_check_can_nocow().
So this patch will separate the different patterns into different
functions.
For nowait variant, the function will be called check_nocow_nolock().
For wait variant, the function pair will be btrfs_check_nocow_lock()
btrfs_check_nocow_unlock().
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
These two functions have extra conditions that their callers need to
meet, and some not-that-common parameters used for return value.
So adding some comments may save reviewers some time.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
When the data space is exhausted, even if the inode has NOCOW attribute,
we will still refuse to truncate unaligned range due to ENOSPC.
The following script can reproduce it pretty easily:
#!/bin/bash
dev=/dev/test/test
mnt=/mnt/btrfs
umount $dev &> /dev/null
umount $mnt &> /dev/null
mkfs.btrfs -f $dev -b 1G
mount -o nospace_cache $dev $mnt
touch $mnt/foobar
chattr +C $mnt/foobar
xfs_io -f -c "pwrite -b 4k 0 4k" $mnt/foobar > /dev/null
xfs_io -f -c "pwrite -b 4k 0 1G" $mnt/padding &> /dev/null
sync
xfs_io -c "fpunch 0 2k" $mnt/foobar
umount $mnt
Currently this will fail at the fpunch part.
[CAUSE]
Because btrfs_truncate_block() always reserves space without checking
the NOCOW attribute.
Since the writeback path follows NOCOW bit, we only need to bother the
space reservation code in btrfs_truncate_block().
[FIX]
Make btrfs_truncate_block() follow btrfs_buffered_write() to try to
reserve data space first, and fall back to NOCOW check only when we
don't have enough space.
Such always-try-reserve is an optimization introduced in
btrfs_buffered_write(), to avoid expensive btrfs_check_can_nocow() call.
This patch will export check_can_nocow() as btrfs_check_can_nocow(), and
use it in btrfs_truncate_block() to fix the problem.
Reported-by: Martin Doucha <martin.doucha@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Estimated time of removal of the functionality is 5.11, the option will
be still parsed but will have no effect.
Reasons for deprecation and removal:
- very poor naming choice of the mount option, it's supposed to cache
and reuse the inode _numbers_, but it sounds a some generic cache for
inodes
- the only known usecase where this option would make sense is on a
32bit architecture where inode numbers in one subvolume would be
exhausted due to 32bit inode::i_ino
- the cache is stored on disk, consumes space, needs to be loaded and
written back
- new inode number allocation is slower due to lookups into the cache
(compared to a simple increment which is the default)
- uses the free-space-cache code that is going to be deprecated as well
in the future
Known problems:
- since 2011, returning EEXIST when there's not enough space in a page
to store all checksums, see commit 4b9465cb9e ("Btrfs: add mount -o
inode_cache")
Remaining issues:
- if the option was enabled, new inodes created, the option disabled
again, the cache is still stored on the devices and there's currently
no way to remove it
Signed-off-by: David Sterba <dsterba@suse.com>
Last touched in 2013 by commit de78b51a28 ("btrfs: remove cache only
arguments from defrag path") that was the only code that used the value.
Now it's only set but never used for anything, so we can remove it.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The fiemap callback is not part of UAPI interface and the prototypes
don't have the __u64 types either.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
num_extents is already checked in the next if condition and can
be safely removed.
Signed-off-by: Denis Efremov <efremov@linux.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In btrfs_put_root() we're freeing a btrfs_root's 'node' and 'commit_root'
extent buffers manually via kfree(), while we're using
free_root_extent_buffers() in the free_root_pointers() function above.
free_root_extent_buffers() also NULLs the pointers after freeing, which
mitigates potential double frees.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function iterates all extents in the extent cluster, make this
intention obvious by using a for loop. No functional chanes.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_alloc_data_chunk_ondemand and btrfs_free_reserved_data_space_noquota
don't really use the guts of the inodes being passed to them. This
implies it's not required to call them under extent lock. Move code
around in prealloc_file_extent_cluster to do the heavy, data alloc/free
operations outside of the lock. This also makes the 'out' label
unnecessary, so remove it.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Extents in the extent cluster are guaranteed to be contiguous as such
the hole check inside the loop can never trigger. In fact this check was
never functional since it was added in 18513091af ("btrfs: update
btrfs_space_info's bytes_may_use timely") which came after the commit
introducing clustered/contiguous extents 0257bb82d2 ("Btrfs: relocate
file extents in clusters").
Let's just remove it as it adds noise to the source.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It has only 4 uses of a vfs_inode for inode_sub_bytes but unifies the
interface with the non __ prefixed version. Will also makes converting
its callers to btrfs_inode easier.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Will enable converting btrfs_submit_compressed_write to btrfs_inode more
easily.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It has one VFS and 1 btrfs inode usages but converting it to btrfs_inode
interface will allow seamless conversion of its callers.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It really wants a btrfs_inode and will allow submit_compressed_extents
to be completely converted to btrfs_inode in follow up patches.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It really wants btrfs_inode and not a vfs inode.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It doesn't use the generic vfs inode for anything use btrfs_inode
directly.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It doesn't use the vfs inode for anything, can just as easily take
btrfs_inode. Follow up patches will convert callers as well.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is internal btrfs function what really needs the vfs_inode only for
igrab and a tracepoint.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The 'trans_list' member of an ordered extent was used to keep track of the
ordered extents for which a transaction commit had to wait. These were
ordered extents that were started and logged by an fsync. However we don't
do that anymore and before we stopped doing it we changed the approach to
wait for the ordered extents in commit 161c3549b4 ("Btrfs: change how
we wait for pending ordered extents"), which stopped using that list and
therefore the 'trans_list' member is not used anymore since that commit.
So just remove it since it's doing nothing and making each ordered extent
structure waste memory (2 pointers).
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The 'log_list' member of an ordered extent was used keep track of which
ordered extents we needed to wait after logging metadata, but is not used
anymore since commit 5636cf7d6d ("btrfs: remove the logged extents
infrastructure"), as we now always wait on ordered extent completion
before logging metadata. So just remove it since it's doing nothing and
making each ordered extent structure waste more memory (2 pointers).
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The CPU and on-disk keys are mapped to two different structures because
of the endianness. There's an intermediate buffer used to do the
conversion, but this is not necessary when CPU and on-disk endianness
match.
Add optimized versions of helpers that take disk_key and use the buffer
directly for CPU keys or drop the intermediate buffer and conversion.
This saves a lot of stack space accross many functions and removes about
6K of generated binary code:
text data bss dec hex filename
1090439 17468 14912 1122819 112203 pre/btrfs.ko
1084613 17456 14912 1116981 110b35 post/btrfs.ko
Delta: -5826
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Before this patch, qgroup completely relies on per-inode extent io tree
to detect reserved data space leak.
However previous bug has already shown how release page before
btrfs_finish_ordered_io() could lead to leak, and since it's
QGROUP_RESERVED bit cleared without triggering qgroup rsv, it can't be
detected by per-inode extent io tree.
So this patch adds another (and hopefully the final) safety net to catch
qgroup data reserved space leak. At least the new safety net catches
all the leaks during development, so it should be pretty useful in the
real world.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
The following simple workload from fsstress can lead to qgroup reserved
data space leak:
0/0: creat f0 x:0 0 0
0/0: creat add id=0,parent=-1
0/1: write f0[259 1 0 0 0 0] [600030,27288] 0
0/4: dwrite - xfsctl(XFS_IOC_DIOINFO) f0[259 1 0 0 64 627318] return 25, fallback to stat()
0/4: dwrite f0[259 1 0 0 64 627318] [610304,106496] 0
This would cause btrfs qgroup to leak 20480 bytes for data reserved
space. If btrfs qgroup limit is enabled, such leak can lead to
unexpected early EDQUOT and unusable space.
[CAUSE]
When doing direct IO, kernel will try to writeback existing buffered
page cache, then invalidate them:
generic_file_direct_write()
|- filemap_write_and_wait_range();
|- invalidate_inode_pages2_range();
However for btrfs, the bi_end_io hook doesn't finish all its heavy work
right after bio ends. In fact, it delays its work further:
submit_extent_page(end_io_func=end_bio_extent_writepage);
end_bio_extent_writepage()
|- btrfs_writepage_endio_finish_ordered()
|- btrfs_init_work(finish_ordered_fn);
<<< Work queue execution >>>
finish_ordered_fn()
|- btrfs_finish_ordered_io();
|- Clear qgroup bits
This means, when filemap_write_and_wait_range() returns,
btrfs_finish_ordered_io() is not guaranteed to be executed, thus the
qgroup bits for related range are not cleared.
Now into how the leak happens, this will only focus on the overlapping
part of buffered and direct IO part.
1. After buffered write
The inode had the following range with QGROUP_RESERVED bit:
596 616K
|///////////////|
Qgroup reserved data space: 20K
2. Writeback part for range [596K, 616K)
Write back finished, but btrfs_finish_ordered_io() not get called
yet.
So we still have:
596K 616K
|///////////////|
Qgroup reserved data space: 20K
3. Pages for range [596K, 616K) get released
This will clear all qgroup bits, but don't update the reserved data
space.
So we have:
596K 616K
| |
Qgroup reserved data space: 20K
That number doesn't match the qgroup bit range anymore.
4. Dio prepare space for range [596K, 700K)
Qgroup reserved data space for that range, we got:
596K 616K 700K
|///////////////|///////////////////////|
Qgroup reserved data space: 20K + 104K = 124K
5. btrfs_finish_ordered_range() gets executed for range [596K, 616K)
Qgroup free reserved space for that range, we got:
596K 616K 700K
| |///////////////////////|
We need to free that range of reserved space.
Qgroup reserved data space: 124K - 20K = 104K
6. btrfs_finish_ordered_range() gets executed for range [596K, 700K)
However qgroup bit for range [596K, 616K) is already cleared in
previous step, so we only free 84K for qgroup reserved space.
596K 616K 700K
| | |
We need to free that range of reserved space.
Qgroup reserved data space: 104K - 84K = 20K
Now there is no way to release that 20K unless disabling qgroup or
unmounting the fs.
[FIX]
This patch will change the timing of btrfs_qgroup_release/free_data()
call. Here it uses buffered COW write as an example.
The new timing | The old timing
----------------------------------------+---------------------------------------
btrfs_buffered_write() | btrfs_buffered_write()
|- btrfs_qgroup_reserve_data() | |- btrfs_qgroup_reserve_data()
|
btrfs_run_delalloc_range() | btrfs_run_delalloc_range()
|- btrfs_add_ordered_extent() |
|- btrfs_qgroup_release_data() |
The reserved is passed into |
btrfs_ordered_extent structure |
|
btrfs_finish_ordered_io() | btrfs_finish_ordered_io()
|- The reserved space is passed to | |- btrfs_qgroup_release_data()
btrfs_qgroup_record | The resereved space is passed
| to btrfs_qgroup_recrod
|
btrfs_qgroup_account_extents() | btrfs_qgroup_account_extents()
|- btrfs_qgroup_free_refroot() | |- btrfs_qgroup_free_refroot()
The point of such change is to ensure, when ordered extents are
submitted, the qgroup reserved space is already released, to keep the
timing aligned with file_write_and_wait_range().
So that qgroup data reserved space is all bound to btrfs_ordered_extent
and solve the timing mismatch.
Fixes: f695fdcef8 ("btrfs: qgroup: Introduce functions to release/free qgroup reserve data space")
Suggested-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The incoming qgroup reserved space timing will move the data reservation
to ordered extent completely.
However in btrfs_punch_hole_lock_range() will call
btrfs_invalidate_page(), which will clear QGROUP_RESERVED bit for the
range.
In current stage it's OK, but if we're making ordered extents handle the
reserved space, then btrfs_punch_hole_lock_range() can clear the
QGROUP_RESERVED bit before we submit ordered extent, leading to qgroup
reserved space leakage.
So here change the timing to make reserve data space after
btrfs_punch_hole_lock_range().
The new timing is fine for either current code or the new code.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is to prepare for the incoming timing change of qgroup reserved
data space and ordered extent.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Function insert_reserved_file_extent() takes a long list of parameters,
which are all for btrfs_file_extent_item, even including two reserved
members, encryption and other_encoding.
This makes the parameter list unnecessary long for a function which only
gets called twice.
This patch will refactor the parameter list, by using
btrfs_file_extent_item as parameter directly to hugely reduce the number
of parameters.
Also, since there are only two callers, one in btrfs_finish_ordered_io()
which inserts file extent for ordered extent, and one
__btrfs_prealloc_file_range().
These two call sites have completely different context, where ordered
extent can be compressed, but will always be regular extent, while the
preallocated one is never going to be compressed and always has PREALLOC
type.
So use two small wrapper for these two different call sites to improve
readability.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add proper variable for the scrub page and use it instead of repeatedly
dereferencing the other structures.
Signed-off-by: David Sterba <dsterba@suse.com>
Use a simpler iteration over tree block pages, same what csum_tree_block
does: first page always exists, loop over the rest.
Signed-off-by: David Sterba <dsterba@suse.com>
Add proper variable for the scrub page and use it instead of repeatedly
dereferencing the other structures.
Signed-off-by: David Sterba <dsterba@suse.com>
Add proper variable for the scrub page and use it instead of repeatedly
dereferencing the other structures.
Signed-off-by: David Sterba <dsterba@suse.com>
The page contents with the checksum is available during the entire
function so we don't need to make a copy.
Signed-off-by: David Sterba <dsterba@suse.com>
BTRFS_SUPER_INFO_SIZE is 4096, and fits to a page on all supported
architectures, so we can calculate the checksum in one go.
Signed-off-by: David Sterba <dsterba@suse.com>
As the page mapping has been removed, rename the variables to 'kaddr'
that we use everywhere else. The type is changed to 'char *' so pointer
arithmetic works without casts.
Signed-off-by: David Sterba <dsterba@suse.com>
All pages that scrub uses in the scrub_block::pagev array are allocated
with GFP_KERNEL and never part of any mapping, so kmap is not necessary,
we only need to know the page address.
In scrub_write_page_to_dev_replace we don't even need to call
flush_dcache_page because of the same reason as above.
Signed-off-by: David Sterba <dsterba@suse.com>
This patch introduces a new "rescue=" mount option group for all mount
options for data recovery.
Different rescue sub options are seperated by ':'. E.g
"ro,rescue=nologreplay:usebackuproot".
The original plan was to use ';', but ';' needs to be escaped/quoted,
or it will be interpreted by bash, similar to '|'.
And obviously, user can specify rescue options one by one like:
"ro,rescue=nologreplay,rescue=usebackuproot".
The following mount options are converted to "rescue=", old mount
options are deprecated but still available for compatibility purpose:
- usebackuproot
Now it's "rescue=usebackuproot"
- nologreplay
Now it's "rescue=nologreplay"
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We currently use btrfs_check_data_free_space() when allocating space for
relocating data extents, but that is not necessary because that function
combines btrfs_alloc_data_chunk_ondemand(), which does the actual space
reservation, and btrfs_qgroup_reserve_data().
We can use btrfs_alloc_data_chunk_ondemand() directly because we know we
do not need to reserve qgroup space since we are dealing with a relocation
tree, which can never have qgroups (btrfs_qgroup_reserve_data() does
nothing as is_fstree() returns false for a relocation tree).
Conversely we can use btrfs_free_reserved_data_space_noquota() directly
instead of btrfs_free_reserved_data_space(), since we had no qgroup
reservation when allocating space.
This change is preparatory work for another patch in this series that
makes relocation reserve the exact amount of space it needs to relocate
a data block group. The function btrfs_check_data_free_space() has
the incovenient of requiring a start offset argument and we will want to
be able to allocate space for multiple ranges, which are not consecutive,
at once.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The start argument for btrfs_free_reserved_data_space_noquota() is only
used to make sure the amount of bytes we decrement from the bytes_may_use
counter of the data space_info object is aligned to the filesystem's
sector size. It serves no other purpose.
All its current callers always pass a length argument that is already
aligned to the sector size, so we can make the start argument go away.
In fact its presence makes it impossible to use it in a context where we
just want to free a number of bytes for a range for which either we do
not know its start offset or for freeing multiple ranges at once (which
are not contiguous).
This change is preparatory work for a patch (third patch in this series)
that makes relocation of data block groups that are not full reserve less
data space.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
As there is a dump_stack() done on memory allocation failures, these
messages might as well be deleted instead.
Signed-off-by: Liao Pingfang <liao.pingfang@zte.com.cn>
Reviewed-by: David Sterba <dsterba@suse.com>
[ minor tweaks ]
Signed-off-by: David Sterba <dsterba@suse.com>
Use the helper function where it is open coded to increment the
block_group reference count As btrfs_get_block_group() is a one-liner we
could have open-coded it, but its partner function
btrfs_put_block_group() isn't one-liner which does the free part in it.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
__btrfs_return_cluster_to_free_space() returns only 0. And all its
parent functions don't need the return value either so make this a void
function.
Further, as none of the callers of btrfs_return_cluster_to_free_space()
is actually using the return from this function, make this function also
return void.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Initially when the 'removed' flag was added to a block group to avoid
races between block group removal and fitrim, by commit 04216820fe
("Btrfs: fix race between fs trimming and block group remove/allocation"),
we had to lock the chunks mutex because we could be moving the block
group from its current list, the pending chunks list, into the pinned
chunks list, or we could just be adding it to the pinned chunks if it was
not in the pending chunks list. Both lists were protected by the chunk
mutex.
However we no longer have those lists since commit 1c11b63eff
("btrfs: replace pending/pinned chunks lists with io tree"), and locking
the chunk mutex is no longer necessary because of that. The same happens
at btrfs_unfreeze_block_group(), we lock the chunk mutex because the block
group's extent map could be part of the pinned chunks list and the call
to remove_extent_mapping() could be deleting it from that list, which
used to be protected by that mutex.
So just remove those lock and unlock calls as they are not needed anymore.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When find_first_block_group() finds a block group item in the extent-tree,
it does a lookup of the object in the extent mapping tree and does further
checks on the item.
Factor out this step from find_first_block_group() so we can further
simplify the code.
While we're at it, we can also just return early in
find_first_block_group(), if the tree slot isn't found.
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We already have an fs_info in our function parameters, there's no need
to do the maths again and get fs_info from the extent_root just to get
the mapping_tree.
Instead directly grab the mapping_tree from fs_info.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Adresses held in 'logical' array are always guaranteed to fall within
the boundaries of the block group. That is, 'start' can never be
smaller than cache->start. This invariant follows from the way the
address are calculated in btrfs_rmap_block:
stripe_nr = physical - map->stripes[i].physical;
stripe_nr = div64_u64(stripe_nr, map->stripe_len);
bytenr = chunk_start + stripe_nr * io_stripe_size;
I.e it's always some IO stripe within the given chunk.
Exploit this invariant to simplify the body of the loop by removing the
unnecessary 'if' since its 'else' part is the one always executed.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
extent_map::orig_block_len contains the size of a physical stripe when
it's used to describe block groups (calculated in read_one_chunk via
calc_stripe_length or calculated in decide_stripe_size and then assigned
to extent_map::orig_block_len in create_chunk). Exploit this fact to get
the size directly rather than opencoding the calculations. No functional
changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The call to btrfs_btree_balance_dirty has been there since the early
days of BTRFS, when the btree was directly modified from the write path,
hence dirtied btree inode pages. With the implementation of b888db2bd7
("Btrfs: Add delayed allocation to the extent based page tree code")
13 years ago the btree is no longer modified from the write path, hence
there is no point in calling this function. Just remove it.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently, we allocate temp pages which is used to pad hole in
cluster during read IO submission, it may take long time before
releasing them in f2fs_decompress_pages(), since they are only
used as temp output buffer in decompression context, so let's
just do the allocation in that context to reduce time of memory
pool resource occupation.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
We missed to update isize of compressed file in write_end() with
below case:
cluster size is 16KB
- write 14KB data from offset 0
- overwrite 16KB data from offset 0
Fixes: 4c8ff7095b ("f2fs: support data compression")
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Just for code style, no logic change
1. delete useless space
2. change spaces into tab
Signed-off-by: Jack Qiu <jack.qiu@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch refactored target bpf_iter_init_seq_priv_t callback
function to accept additional information. This will be needed
in later patches for map element targets since a particular
map should be passed to traverse elements for that particular
map. In the future, other information may be passed to target
as well, e.g., pid, cgroup id, etc. to customize the iterator.
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200723184110.590156-1-yhs@fb.com
The UDP reuseport conflict was a little bit tricky.
The net-next code, via bpf-next, extracted the reuseport handling
into a helper so that the BPF sk lookup code could invoke it.
At the same time, the logic for reuseport handling of unconnected
sockets changed via commit efc6b6f6c3
which changed the logic to carry on the reuseport result into the
rest of the lookup loop if we do not return immediately.
This requires moving the reuseport_has_conns() logic into the callers.
While we are here, get rid of inline directives as they do not belong
in foo.c files.
The other changes were cases of more straightforward overlapping
modifications.
Signed-off-by: David S. Miller <davem@davemloft.net>
- Fix the layering violation in the use of the EFI runtime services
availability mask in users of the 'efivars' abstraction
- Revert build fix for GCC v4.8 which is no longer supported
- Clean up some x86 EFI stub details, some of which are borderline bugs
that copy around garbage into padding fields - let's fix these
out of caution.
- Fix build issues while working on RISC-V support
- Avoid --whole-archive when linking the stub on arm64
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAl8cCIkRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1hGsxAAsPwL5ghYykViUQId0OjNmI7eRgISKRso
3n3BaFmgYMp9eq1gndzPp7A5Ty0NPsr8FiwuZ7setk9YoLuTUT1MHVtAzd6xkxlR
838CwTDvW5HvB69uxPQDHA/1mcH1smH4Iew/J7QXP+o6Zrg+BWOjKNtTiKFwawNC
m4Tkdvvq52wzykqbeuRhXxLetKFOH//R1V0s4M6nNySuY6gQGJQ2LPyaiMN6eq1V
LE8+wGcNlIRUOeC8RkEA7CE9g92jkGZ07uJDA09OP0J5WBNWLcxMJM2mBZ1Ho6uc
sNNOeTy76sjuXQvUBWCnjBr3/qqXnVGSuIq8NS1hS4L7HagdQ/peqFJRzvWBHT90
b9wUZv09ioFm9lb6/P6NL16sn/WPknCD7umxpfp5HKrlL2p7puvsvDXtuWgyhhPG
M+X9ZX1iaA54cA9bU6cXFzrNMw/DnYjHFsECF915EXjItNZToGs0rd1lf7ArgH2P
+3HgvLods73ufObKH2pUfY7EU2Ly1oJsNpK3RmpoOUzehW+++S80KdytkaAB10kT
dKp9LlTj6gC+lnC45J9NtpHlCodz/Rc0lBHQpxIqlk/p9grUuH4zn714Ii/FfNQg
i29GX9cCdgv+4KzmJCNTqyZvGFZ5m3K8f41x7iD+/ygqPLsP9PfV5RGRGau+FfL8
ezInhKdITqM=
=n3nF
-----END PGP SIGNATURE-----
Merge tag 'efi-urgent-2020-07-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into master
Pull EFI fixes from Ingo Molnar:
"Various EFI fixes:
- Fix the layering violation in the use of the EFI runtime services
availability mask in users of the 'efivars' abstraction
- Revert build fix for GCC v4.8 which is no longer supported
- Clean up some x86 EFI stub details, some of which are borderline
bugs that copy around garbage into padding fields - let's fix these
out of caution.
- Fix build issues while working on RISC-V support
- Avoid --whole-archive when linking the stub on arm64"
* tag 'efi-urgent-2020-07-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
efi: Revert "efi/x86: Fix build with gcc 4"
efi/efivars: Expose RT service availability via efivars abstraction
efi/libstub: Move the function prototypes to header file
efi/libstub: Fix gcc error around __umoddi3 for 32 bit builds
efi/libstub/arm64: link stub lib.a conditionally
efi/x86: Only copy upto the end of setup_header
efi/x86: Remove unused variables
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAl8bmiQACgkQiiy9cAdy
T1EfFwv+IeD4kNshoBF5NT//+7j+9n+R8z74v5bJUEmoabZMuCaWuYl/uUcBqIBH
l29Xd60aF8V2jd3xIt/SEPNFOAvvBSQIjn8X52Lkquo+QzgiOElSUL8aNZe17T9L
N1S+nhwk2udlveBtWZRxHT50WNdKQCGKEwUDn9ouUbuKxLrcwWc1gRCxkegmJXey
NRaynzdCRYFx7uLdctvesTDB4P9MJn+5vX7L1BdI7MxXzsSEI1yHpM5OBL9e13ki
swZ7P16qyCA6TK8EYuYLCeXKEK0X+wGeu/y8JZ7RjO9ozqrvECvguumNbcUCs+zf
vLWQkaeC2M1L5L4WV71sY5Pi7CUGw9WGaU/FBJUK65+hYqUWETt351BwbIRwVlrx
mX7C7Dh50AM84/54u169Sk1p/S8vjdvkmx8YyRWK+t7kmZx6TP7XKXvYN1vEHS12
Y5ceRKmNHLdRqS3nY6FK7TKdgVWK2hgNODDdVbUZ+iPoJztSmH7j2RkTzxJz6qBi
vzBAPJiG
=Cvv4
-----END PGP SIGNATURE-----
Merge tag '5.8-rc6-cifs-fix' of git://git.samba.org/sfrench/cifs-2.6 into master
Pull cifs fix from Steve French:
"A fix for a recently discovered regression in rename to older servers
caused by a recent patch"
* tag '5.8-rc6-cifs-fix' of git://git.samba.org/sfrench/cifs-2.6:
Revert "cifs: Fix the target file was deleted when rename failed."
Linked requests are hashed, remove a comment stating otherwise. Also
move hash bits to emphasise that we don't carry it through loop
iteration and set it every time.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Whoever called io_prep_linked_timeout() should also do
io_queue_linked_timeout(). __io_queue_sqe() doesn't follow that for the
punting path leaving linked timeouts prepared but never queued.
Fixes: 6df1db6b54 ("io_uring: fix mis-refcounting linked timeouts")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Remove REQ_F_WORK_INITIALIZED after io_req_clean_work(). That's a cold
path but is safer for those using io_req_clean_work() out of
*dismantle_req()/*io_free(). And for the same reason zero work.fs
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
/proc/fs/nfsd/client/../state at the wrong moment.
-----BEGIN PGP SIGNATURE-----
iQJJBAABCAAzFiEEYtFWavXG9hZotryuJ5vNeUKO4b4FAl8bWF4VHGJmaWVsZHNA
ZmllbGRzZXMub3JnAAoJECebzXlCjuG+CL4P/1z2k9OPSTxNpcTv/+5+DCg5gEVp
raF3zumeMc8b0gvSRzrMLpYWRZBO/ZcDrH10WFPvySQxQGhMzPNDYsOvRcGabUOs
SDMV7RCQe7fPNlRSZejdTLnWD1ftHgNLXpBKSXPR0GbmML4BRoKwHjTM4rrr3nox
bluQHFyAHMwrNSjjRufi5qi14ThyW5qIanSEpV99qS+aeFwx0U8FL4f4eyFMA1Pq
sV/H2k9N1v/xoPOqQWF20EOZqI+AMxexMY2EQBm7LS3kv5IReSV6WTM4bcTNDl/7
jtFA3A9pIngwfh2d8BOEC12BzPmw1F1vo2kbBjvPzeo1EgtD+EN8RjJ9MP9Lwz3v
kOL09hkJrvtM24+g+VOzdwGEknxm09vmps3vyhth5EvLSLNzJwGfY7HisRxIOrmT
HZBzFNM/wX6RkzsIyrx+AjoqwmOYiz/diElE1oLGLVFrp3IVhk9Grv0IdNlWl8U5
ImtlewJY9vF6+HAsOToXZaPJGS9PBuLWwGj0zrXn5Gr+gNqZfFGLUBfz3sTVxku5
5xa34kECCV7ECGwtHGQ+X0Lp/1jhbAUxdTW0ZI1rJiWtdmx9WJttlmdxmX3dASso
PmBGYCVynMDIaGrnnQ5kig7X9aUy1GF+P9OnZ+tE5FkFdHgG3i4kjOwyOGjtTTen
YmsqNI4KBwV1wBZf
=lzOL
-----END PGP SIGNATURE-----
Merge tag 'nfsd-5.8-2' of git://linux-nfs.org/~bfields/linux into master
Pull nfsd fix from Bruce Fields:
"Just one fix for a NULL dereference if someone happens to read
/proc/fs/nfsd/client/../state at the wrong moment"
* tag 'nfsd-5.8-2' of git://linux-nfs.org/~bfields/linux:
nfsd4: fix NULL dereference in nfsd/clients display code
Drop the repeated word "the" in a comment.
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: linux-nfs@vger.kernel.org
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Merge misc fixes from Andrew Morton:
"Subsystems affected by this patch series: mm/pagemap, mm/shmem,
mm/hotfixes, mm/memcg, mm/hugetlb, mailmap, squashfs, scripts,
io-mapping, MAINTAINERS, and gdb"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
scripts/gdb: fix lx-symbols 'gdb.error' while loading modules
MAINTAINERS: add KCOV section
io-mapping: indicate mapping failure
scripts/decode_stacktrace: strip basepath from all paths
squashfs: fix length field overlap check in metadata reading
mailmap: add entry for Mike Rapoport
khugepaged: fix null-pointer dereference due to race
mm/hugetlb: avoid hardcoding while checking if cma is enabled
mm: memcg/slab: fix memory leak at non-root kmem_cache destroy
mm/memcg: fix refcount error while moving and swapping
mm/memcontrol: fix OOPS inside mem_cgroup_get_nr_swap_pages()
mm: initialize return of vm_insert_pages
vfs/xattr: mm/shmem: kernfs: release simple xattr entry in a right way
mm/mmap.c: close race between munmap() and expand_upwards()/downwards()
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl8auzgACgkQxWXV+ddt
WDv0CRAAooFO+hloV+br40eEfJwZJJk+iIvc3tyq3TRUrmt1D0G4F7nUtiHjb8JU
ch2HK+GNZkIK4747OCgcFREpYZV2m0hrKybzf/j4mYb7OXzHmeHTMfGVut1g80e7
dlpvP7q4VZbBP8BTo/8wqdSAdCUiNhLFy5oYzyUwyflJ5S8FpjY+3dXIRHUnhxPU
lxMANWhX9y/qQEceGvxqwqJBiYT6WI7dwONiULc1klWDIug/2BGZQR0WuC5PVr0G
YNuxcEU6rluWzKWJ5k3104t+N1Nc5+xglIgBLeLKAyTVYq8zAMf+P8bBPnQ3QDkV
zniNIH9ND8tYSjmGkmO0ltExFrE2o9NRnjapOFXfB0WGXee5LfzFfzd5Hk9YV+Ua
bs98VNGR4B12Iw++DvrbhbFAMxBHiBfAX/O44xJ81uAYVUs21OfefxHWrLzTJK+1
xYfiyfCDxZDGpC/weg9GOPcIZAzzoSAvqDqWHyWY5cCZdB60RaelGJprdG5fP/gA
Y+hDIdutVXMHfhaX0ktWsDvhPRXcC7MT0bjasljkN5WUJ/xZZQr6QmgngY+FA8G/
0n/dv0pYdOTK/8YVZAMO+VklzrDhziqzc2sBrH1k3MA9asa/Ls5v+r2PU+qBKZJm
cBJGtxxsx72CHbkIhtd5oGj5LNTXFdXeHph37ErzW3ajeamO4X0=
=51h/
-----END PGP SIGNATURE-----
Merge tag 'for-5.8-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into master
Pull btrfs fixes from David Sterba:
"A few resouce leak fixes from recent patches, all are stable material.
The problems have been observed during testing or have a reproducer"
* tag 'for-5.8-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: fix mount failure caused by race with umount
btrfs: fix page leaks after failure to lock page for delalloc
btrfs: qgroup: fix data leak caused by race between writeback and truncate
btrfs: fix double free on ulist after backref resolution failure
Two fixes, the first one to remove compilation warnings and the second
to avoid potentially inefficient allocation of BIOs for direct writes
into sequential zones.
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQSRPv8tYSvhwAzJdzjdoc3SxdoYdgUCXxqPOwAKCRDdoc3SxdoY
dkUzAQCp8p1ijemK+t2pN35dz+J9TG2idX0iUkZA5dAUkwsZmQD+NT/U52WLTqCH
eKc72BZTfdMhTK/Sk6fnUtFzVXvGhgM=
=RERh
-----END PGP SIGNATURE-----
Merge tag 'zonefs-5.8-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs into master
Pull zonefs fixes from Damien Le Moal:
"Two fixes, the first one to remove compilation warnings and the second
to avoid potentially inefficient allocation of BIOs for direct writes
into sequential zones"
* tag 'zonefs-5.8-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs:
zonefs: count pages after truncating the iterator
zonefs: Fix compilation warning
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl8bEJwQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpgCOEADPQhFAkQvKfaAKBXriQTFluDbET1PjAUNX
oa1ZRjP67O/Oxb4bEW6V6NRlteno1vqx0PD+DQfvnzlSly0WM0LUzQXLz4KDcAGW
55EdoZq3ev50bQOPkDzeqDmER0NEt5S+haVv3dvdEfqQaN9GBVQvEzk3elI5/+Sn
PO9hFhH8u3I16HgCFiQu4E4q+zK2D5j+rr82/wSsQ8wtdjzVHfaoBBTwCFaNq3r1
nGYyRY9rM/iq61l0bNKF0fqlx5sNGqHzqrxEaVsc7xbgcbn5ivDXReQ4fm3BsuUF
uabItKEyZjyr6dB0N5Eq24+S9S9i5V+9nhuZtRgjivANll/goVAVVZxqbvwrdB3q
w0SBDzOMu8LAlEYaNjnheQ+xbzedrI564MDBpyLdf4yuJmnIEmk6TR1fVtgmB+aa
AW1vplw+uBM25rkjltzdVFGpzBVVb478GgJkVbtzNaat7jUamg46s3qq02ReZXVi
W/ga9JZ87zIVE5/ClynUcGQoLSmeJRmQnnvVZjMAgw2lzE2i9xG5RwBU5cOpDKal
RwbnxoRG6FkLMXs0mAIXBsP5EvNuOSyItyCyk/LYqbjYjDTHvnYA/UFvE9MiJX+3
2S4Mt3aHzf+rBX5NDOrRBq0Ri4KF0AgXv/xNSYfrlDV5iB5NQMTv5ZQLaPoBBu/I
mph+EGVssQ==
=TBlP
-----END PGP SIGNATURE-----
Merge tag 'io_uring-5.8-2020-07-24' of git://git.kernel.dk/linux-block into master
Pull io_uring fixes from Jens Axboe:
- Fix discrepancy in how sqe->flags are treated for a few requests,
this makes it consistent (Daniele)
- Ensure that poll driven retry works with double waitqueue poll users
- Fix a missing io_req_init_async() (Pavel)
* tag 'io_uring-5.8-2020-07-24' of git://git.kernel.dk/linux-block:
io_uring: missed req_init_async() for IOSQE_ASYNC
io_uring: always allow drain/link/hardlink/async sqe flags
io_uring: ensure double poll additions work with both request types
This is a regression introduced by the "migrate from ll_rw_block usage
to BIO" patch.
Squashfs packs structures on byte boundaries, and due to that the length
field (of the metadata block) may not be fully in the current block.
The new code rewrote and introduced a faulty check for that edge case.
Fixes: 93e72b3c61 ("squashfs: migrate from ll_rw_block usage to BIO")
Reported-by: Bernd Amend <bernd.amend@gmail.com>
Signed-off-by: Phillip Lougher <phillip@squashfs.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Adrien Schildknecht <adrien+dev@schischi.me>
Cc: Guenter Roeck <groeck@chromium.org>
Cc: Daniel Rosenberg <drosen@google.com>
Link: http://lkml.kernel.org/r/20200717195536.16069-1-phillip@squashfs.org.uk
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Move io_req_init_async() into io_grab_files(), it's safer this way. Note
that io_queue_async_work() does *init_async(), so it's valid to move out
of __io_queue_sqe() punt path. Also, add a helper around io_grab_files().
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Calling into opcode prep handlers may be dangerous, as they re-read
SQE but might not re-initialise requests completely. If io_req_defer()
passed fast checks and is done with preparations, punt it async.
As all other cases are covered with nulling @sqe, this guarantees that
io_[opcode]_prep() are visited only once per request.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In io_sq_thread(), if there are task works to handle, current codes
will skip schedule() and go on polling sq again, but forget to clear
IORING_SQ_NEED_WAKEUP flag, fix this issue. Also add two helpers to
set and clear IORING_SQ_NEED_WAKEUP flag,
Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
As every iopoll request have a task ref, it becomes expensive to put
them one by one, instead we can put several at once integrating that
into io_req_free_batch().
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Locked and pinned memory accounting in io_{,un}account_mem() depends on
having ->sqo_mm, which is NULL after a recent change for non SQPOLL'ed
io_ring. That disables the accounting.
Return ->sqo_mm initialisation back, and do __io_sq_thread_acquire_mm()
based on IORING_SETUP_SQPOLL flag.
Fixes: 8eb06d7e8d ("io_uring: fix missing ->mm on exit")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
io_sqe_buffer_unregister() uses cxt->sqo_mm for memory accounting, but
io_ring_ctx_free() drops ->sqo_mm before leaving pinned_vm
over-accounted. Postpone mm cleanup for when it's not needed anymore.
Fixes: 309758254e ("io_uring: report pinned memory usage")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Don't implement fast path of kbuf freeing and management inlined into
io_recv{,msg}(), that's error prone and duplicates handling. Replace it
with a helper io_put_recv_kbuf(), which mimics io_put_rw_kbuf() in the
io_read/write().
This also keeps cflags calculation in one place, removing duplication
between rw and recv/send.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Extract a common helper for cleaning up a selected buffer, this will be
used shortly. By the way, correct cflags types to unsigned and, as kbufs
are anyway tracked by a flag, remove useless zeroing req->rw.addr.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Move REQ_F_BUFFER_SELECT flag check out of io_recv_buffer_select(), and
do that in its call sites That saves us from double error checking and
possibly an extra function call.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
io_clean_op() may be skipped even if there is a selected io_buffer,
that's because *select_buffer() funcions never set REQ_F_NEED_CLEANUP.
Trigger io_clean_op() when REQ_F_BUFFER_SELECTED is set as well, and
and clear the flag if was freed out of it.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Instead of returning error from io_recv(), go through generic cleanup
path, because it'll retain cflags for userspace. Do the same for
io_send() for consistency.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
With the return on a bad socket, kmsg is always non-null by the end
of the function, prune left extra checks and initialisations.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Flip over "if (sock)" condition with return on error, the upper layer
will take care. That change will be handy later, but already removes
an extra jump from hot path.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently, file refs in struct io_submit_state are tracked with 2 vars:
@has_refs -- how many refs were initially taken
@used_refs -- number of refs used
Replace it with a single variable counting how many refs left at the
current moment.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
RLIMIT_SIZE in needed only for execution from an io-wq context, hence
move all preparations from hot path to io-wq work setup.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Every call to io_req_defer_prep() is prepended with allocating ->io,
just do that in the function. And while we're at it, mark error paths
with unlikey and replace "if (ret < 0)" with "if (ret)".
There is only one change in the observable behaviour, that's instead of
killing the head request right away on error, it postpones it until the
link is assembled, that looks more preferable.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
A switch in __io_clean_op() doesn't have default, it's pointless to list
opcodes that doesn't do any cleanup. Remove IORING_OP_OPEN* from there.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The only caller of io_req_work_grab_env() is io_prep_async_work(), and
they are both initialising req->work. Inline grab_env(), it's easier
to keep this way, moreover there already were bugs with misplacing
io_req_init_async().
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
req->cflags is used only for defer-completion path, just use completion
data to store it. With the 4 bytes from the ->sequence patch and
compacting io_kiocb, this frees 8 bytes.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
req->sequence is used only for deferred (i.e. DRAIN) requests, but
initialised for every request. Remove req->sequence from io_kiocb
together with its initialisation in io_init_req().
Replace it with a new field in struct io_defer_entry, that will be
calculated only when needed in io_req_defer(), which is a slow path.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The only left user of req->list is DRAIN, hence instead of keeping a
separate per request list for it, do that with old fashion non-intrusive
lists allocated on demand. That's a really slow path, so that's OK.
This removes req->list and so sheds 16 bytes from io_kiocb.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Instead of using shared req->list, hang timeouts up on their own list
entry. struct io_timeout have enough extra space for it, but if that
will be a problem ->inflight_entry can reused for that.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
As with the completion path, also use compl.list for overflowed
requests. If cleaned up properly, nobody needs per-op data there
anymore.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
req->inflight_entry is used to track requests that grabbed files_struct.
Let's share it with iopoll list, because the only iopoll'ed ops are
reads and writes, which don't need a file table.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
It supports both polling and I/O polling. Rename ctx->poll to clearly
show that it's only in I/O poll case.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Calling io_req_complete(req) means that the request is done, and there
is nothing left but to clean it up. That also means that per-op data
after that should not be used, so we're free to reuse it in completion
path, e.g. to store overflow_list as done in this patch.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
As for import_iovec(), return !=NULL iovec from io_import_iovec() only
when it should be freed. That includes returning NULL when iovec is
already in req->io, because it should be deallocated by other means,
e.g. inside op handler. After io_setup_async_rw() local iovec to ->io,
just mark it NULL, to follow the idea in io_{read,write} as well.
That's easier to follow, and especially useful if we want to reuse
per-op space for completion data.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
[axboe: only call kfree() on non-NULL pointer]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Preparing reads/writes for async is a bit tricky. Extract a helper to
not repeat it twice.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Don't deref req->io->rw every time, but put it in a local variable. This
looks prettier, generates less instructions, and doesn't break alias
analysis.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
io_kiocb::task_work was de-unionised, and is not planned to be shared
back, because it's too useful and commonly used. Hence, instead of
keeping a separate task_work in struct io_async_rw just reuse
req->task_work.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
send/recv msghdr initialisation works with struct io_async_msghdr, but
pulls the whole struct io_async_ctx for no reason. That complicates it
with composite accessing, e.g. io->msg.
Use and pass the most specific type, which is struct io_async_msghdr.
It is the larget field in union io_async_ctx and doesn't save stack
space, but looks clearer.
The most of the changes are replacing "io->msg." with "iomsg->"
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Every second field in send/recv is called msg, make it a bit more
understandable by renaming ->msg, which is a user provided ptr,
to ->umsg.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
rings_size() sets sq_offset to the total size of the rings (the returned
value which is used for memory allocation). This is wrong: sq array should
be located within the rings, not after them. Set sq_offset to where it
should be.
Fixes: 75b28affdd ("io_uring: allocate the two rings together")
Signed-off-by: Dmitry Vyukov <dvyukov@google.com>
Acked-by: Hristo Venev <hristo@venev.name>
Cc: io-uring@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Merge in io_uring-5.8 fixes, as changes/cleanups to how we do locked
mem accounting require a fixup, and only one of the spots are noticed
by git as the other merges cleanly. The flags fix from io_uring-5.8
also causes a merge conflict, the leak fix for recvmsg, the double poll
fix, and the link failure locking fix.
* io_uring-5.8:
io_uring: fix lockup in io_fail_links()
io_uring: fix ->work corruption with poll_add
io_uring: missed req_init_async() for IOSQE_ASYNC
io_uring: always allow drain/link/hardlink/async sqe flags
io_uring: ensure double poll additions work with both request types
io_uring: fix recvmsg memory leak with buffer selection
io_uring: fix not initialised work->flags
io_uring: fix missing msg_name assignment
io_uring: account user memory freed when exit has been queued
io_uring: fix memleak in io_sqe_files_register()
io_uring: fix memleak in __io_sqe_files_update()
io_uring: export cq overflow status to userspace
Signed-off-by: Jens Axboe <axboe@kernel.dk>
req->work might be already initialised by the time it gets into
__io_arm_poll_handler(), which will corrupt it by using fields that are
in an union with req->work. Luckily, the only side effect is missing
put_creds(). Clean req->work before going there.
Suggested-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
During umount, f2fs_put_super() unregisters procfs entries after
f2fs_destroy_segment_manager(), it may cause use-after-free
issue when umount races with procfs accessing, fix it by relocating
f2fs_unregister_sysfs().
[Chao Yu: change commit title/message a bit]
Signed-off-by: Li Guifu <bluce.liguifu@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The return value of f2fs_flush_inline_data() is not used,
so delete it.
Signed-off-by: Jia Yang <jiayang5@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This reverts commit 9ffad9263b.
Upon additional testing with older servers, it was found that
the original commit introduced a regression when using the old SMB1
dialect and rsyncing over an existing file.
The patch will need to be respun to address this, likely including
a larger refactoring of the SMB1 and SMB3 rename code paths to make
it less confusing and also to address some additional rename error
cases that SMB3 may be able to workaround.
Signed-off-by: Steve French <stfrench@microsoft.com>
Reported-by: Patrick Fernie <patrick.fernie@gmail.com>
CC: Stable <stable@vger.kernel.org>
Acked-by: Ronnie Sahlberg <lsahlber@redhat.com>
Acked-by: Pavel Shilovsky <pshilov@microsoft.com>
Acked-by: Zhang Xiaoxu <zhangxiaoxu5@huawei.com>
IOSQE_ASYNC branch of io_queue_sqe() is another place where an
unitialised req->work can be accessed (i.e. prior io_req_init_async()).
Nothing really bad though, it just looses IO_WQ_WORK_CONCURRENT flag.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Since debugfs include sensitive information it need to be treated
carefully. But it also has many very useful debug functions for userspace.
With this option we can have same configuration for system with
need of debugfs and a way to turn it off. This gives a extra protection
for exposure on systems where user-space services with system
access are attacked.
It is controlled by a configurable default value that can be override
with a kernel command line parameter. (debugfs=)
It can be on or off, but also internally on but not seen from user-space.
This no-mount mode do not register a debugfs as filesystem, but client can
register their parts in the internal structures. This data can be readed
with a debugger or saved with a crashkernel. When it is off clients
get EPERM error when accessing the functions for registering their
components.
Signed-off-by: Peter Enderborg <peter.enderborg@sony.com>
Link: https://lore.kernel.org/r/20200716071511.26864-3-peter.enderborg@sony.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
We hold the cl_lock here, and that's enough to keep stateid's from going
away, but it's not enough to prevent the files they point to from going
away. Take fi_lock and a reference and check for NULL, as we do in
other code.
Reported-by: NeilBrown <neilb@suse.de>
Fixes: 78599c42ae ("nfsd4: add file to display list of client's opens")
Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Normally smp_store_release() or cmpxchg_release() is paired with
smp_load_acquire(). Sometimes smp_load_acquire() can be replaced with
the more lightweight READ_ONCE(). However, for this to be safe, all the
published memory must only be accessed in a way that involves the
pointer itself. This may not be the case if allocating the object also
involves initializing a static or global variable, for example.
fsverity_info::tree_params.hash_alg->tfm is a crypto_ahash object that's
internal to and is allocated by the crypto subsystem. So by using
READ_ONCE() for ->i_verity_info, we're relying on internal
implementation details of the crypto subsystem.
Remove this fragile assumption by using smp_load_acquire() instead.
Also fix the cmpxchg logic to correctly execute an ACQUIRE barrier when
losing the cmpxchg race, since cmpxchg doesn't guarantee a memory
barrier on failure.
(Note: I haven't seen any real-world problems here. This change is just
fixing the code to be guaranteed correct and less fragile.)
Fixes: fd2d1acfca ("fs-verity: add the hook for file ->open()")
Link: https://lore.kernel.org/r/20200721225920.114347-6-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
Normally smp_store_release() or cmpxchg_release() is paired with
smp_load_acquire(). Sometimes smp_load_acquire() can be replaced with
the more lightweight READ_ONCE(). However, for this to be safe, all the
published memory must only be accessed in a way that involves the
pointer itself. This may not be the case if allocating the object also
involves initializing a static or global variable, for example.
fscrypt_info includes various sub-objects which are internal to and are
allocated by other kernel subsystems such as keyrings and crypto. So by
using READ_ONCE() for ->i_crypt_info, we're relying on internal
implementation details of these other kernel subsystems.
Remove this fragile assumption by using smp_load_acquire() instead.
(Note: I haven't seen any real-world problems here. This change is just
fixing the code to be guaranteed correct and less fragile.)
Fixes: e37a784d8b ("fscrypt: use READ_ONCE() to access ->i_crypt_info")
Link: https://lore.kernel.org/r/20200721225920.114347-5-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
Normally smp_store_release() or cmpxchg_release() is paired with
smp_load_acquire(). Sometimes smp_load_acquire() can be replaced with
the more lightweight READ_ONCE(). However, for this to be safe, all the
published memory must only be accessed in a way that involves the
pointer itself. This may not be the case if allocating the object also
involves initializing a static or global variable, for example.
super_block::s_master_keys is a keyring, which is internal to and is
allocated by the keyrings subsystem. By using READ_ONCE() for it, we're
relying on internal implementation details of the keyrings subsystem.
Remove this fragile assumption by using smp_load_acquire() instead.
(Note: I haven't seen any real-world problems here. This change is just
fixing the code to be guaranteed correct and less fragile.)
Fixes: 22d94f493b ("fscrypt: add FS_IOC_ADD_ENCRYPTION_KEY ioctl")
Link: https://lore.kernel.org/r/20200721225920.114347-4-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
Normally smp_store_release() or cmpxchg_release() is paired with
smp_load_acquire(). Sometimes smp_load_acquire() can be replaced with
the more lightweight READ_ONCE(). However, for this to be safe, all the
published memory must only be accessed in a way that involves the
pointer itself. This may not be the case if allocating the object also
involves initializing a static or global variable, for example.
fscrypt_prepared_key includes a pointer to a crypto_skcipher object,
which is internal to and is allocated by the crypto subsystem. By using
READ_ONCE() for it, we're relying on internal implementation details of
the crypto subsystem.
Remove this fragile assumption by using smp_load_acquire() instead.
(Note: I haven't seen any real-world problems here. This change is just
fixing the code to be guaranteed correct and less fragile.)
Fixes: 5fee36095c ("fscrypt: add inline encryption support")
Cc: Satya Tangirala <satyat@google.com>
Link: https://lore.kernel.org/r/20200721225920.114347-3-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
fscrypt_do_sha256() is only used for hashing encrypted filenames to
create no-key tokens, which isn't performance-critical. Therefore a C
implementation of SHA-256 is sufficient.
Also, the logic to create no-key tokens is always potentially needed.
This differs from fscrypt's other dependencies on crypto API algorithms,
which are conditionally needed depending on what encryption policies
userspace is using. Therefore, for fscrypt there isn't much benefit to
allowing SHA-256 to be a loadable module.
So, make fscrypt_do_sha256() use the SHA-256 library instead of the
crypto_shash API. This is much simpler, since it avoids having to
implement one-time-init (which is hard to do correctly, and in fact was
implemented incorrectly) and handle failures to allocate the
crypto_shash object.
Fixes: edc440e3d2 ("fscrypt: improve format of no-key names")
Cc: Daniel Rosenberg <drosen@google.com>
Link: https://lore.kernel.org/r/20200721225920.114347-2-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
It is possible to cause a btrfs mount to fail by racing it with a slow
umount. The crux of the sequence is generic_shutdown_super not yet
calling sop->put_super before btrfs_mount_root calls btrfs_open_devices.
If that occurs, btrfs_open_devices will decide the opened counter is
non-zero, increment it, and skip resetting fs_devices->total_rw_bytes to
0. From here, mount will call sget which will result in grab_super
trying to take the super block umount semaphore. That semaphore will be
held by the slow umount, so mount will block. Before up-ing the
semaphore, umount will delete the super block, resulting in mount's sget
reliably allocating a new one, which causes the mount path to dutifully
fill it out, and increment total_rw_bytes a second time, which causes
the mount to fail, as we see double the expected bytes.
Here is the sequence laid out in greater detail:
CPU0 CPU1
down_write sb->s_umount
btrfs_kill_super
kill_anon_super(sb)
generic_shutdown_super(sb);
shrink_dcache_for_umount(sb);
sync_filesystem(sb);
evict_inodes(sb); // SLOW
btrfs_mount_root
btrfs_scan_one_device
fs_devices = device->fs_devices
fs_info->fs_devices = fs_devices
// fs_devices-opened makes this a no-op
btrfs_open_devices(fs_devices, mode, fs_type)
s = sget(fs_type, test, set, flags, fs_info);
find sb in s_instances
grab_super(sb);
down_write(&s->s_umount); // blocks
sop->put_super(sb)
// sb->fs_devices->opened == 2; no-op
spin_lock(&sb_lock);
hlist_del_init(&sb->s_instances);
spin_unlock(&sb_lock);
up_write(&sb->s_umount);
return 0;
retry lookup
don't find sb in s_instances (deleted by CPU0)
s = alloc_super
return s;
btrfs_fill_super(s, fs_devices, data)
open_ctree // fs_devices total_rw_bytes improperly set!
btrfs_read_chunk_tree
read_one_dev // increment total_rw_bytes again!!
super_total_bytes < fs_devices->total_rw_bytes // ERROR!!!
To fix this, we clear total_rw_bytes from within btrfs_read_chunk_tree
before the calls to read_one_dev, while holding the sb umount semaphore
and the uuid mutex.
To reproduce, it is sufficient to dirty a decent number of inodes, then
quickly umount and mount.
for i in $(seq 0 500)
do
dd if=/dev/zero of="/mnt/foo/$i" bs=1M count=1
done
umount /mnt/foo&
mount /mnt/foo
does the trick for me.
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When locking pages for delalloc, we check if it's dirty and mapping still
matches. If it does not match, we need to return -EAGAIN and release all
pages. Only the current page was put though, iterate over all the
remaining pages too.
CC: stable@vger.kernel.org # 4.14+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Robbie Ko <robbieko@synology.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
When running tests like generic/013 on test device with btrfs quota
enabled, it can normally lead to data leak, detected at unmount time:
BTRFS warning (device dm-3): qgroup 0/5 has unreleased space, type 0 rsv 4096
------------[ cut here ]------------
WARNING: CPU: 11 PID: 16386 at fs/btrfs/disk-io.c:4142 close_ctree+0x1dc/0x323 [btrfs]
RIP: 0010:close_ctree+0x1dc/0x323 [btrfs]
Call Trace:
btrfs_put_super+0x15/0x17 [btrfs]
generic_shutdown_super+0x72/0x110
kill_anon_super+0x18/0x30
btrfs_kill_super+0x17/0x30 [btrfs]
deactivate_locked_super+0x3b/0xa0
deactivate_super+0x40/0x50
cleanup_mnt+0x135/0x190
__cleanup_mnt+0x12/0x20
task_work_run+0x64/0xb0
__prepare_exit_to_usermode+0x1bc/0x1c0
__syscall_return_slowpath+0x47/0x230
do_syscall_64+0x64/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
---[ end trace caf08beafeca2392 ]---
BTRFS error (device dm-3): qgroup reserved space leaked
[CAUSE]
In the offending case, the offending operations are:
2/6: writev f2X[269 1 0 0 0 0] [1006997,67,288] 0
2/7: truncate f2X[269 1 0 0 48 1026293] 18388 0
The following sequence of events could happen after the writev():
CPU1 (writeback) | CPU2 (truncate)
-----------------------------------------------------------------
btrfs_writepages() |
|- extent_write_cache_pages() |
|- Got page for 1003520 |
| 1003520 is Dirty, no writeback |
| So (!clear_page_dirty_for_io()) |
| gets called for it |
|- Now page 1003520 is Clean. |
| | btrfs_setattr()
| | |- btrfs_setsize()
| | |- truncate_setsize()
| | New i_size is 18388
|- __extent_writepage() |
| |- page_offset() > i_size |
|- btrfs_invalidatepage() |
|- Page is clean, so no qgroup |
callback executed
This means, the qgroup reserved data space is not properly released in
btrfs_invalidatepage() as the page is Clean.
[FIX]
Instead of checking the dirty bit of a page, call
btrfs_qgroup_free_data() unconditionally in btrfs_invalidatepage().
As qgroup rsv are completely bound to the QGROUP_RESERVED bit of
io_tree, not bound to page status, thus we won't cause double freeing
anyway.
Fixes: 0b34c261e2 ("btrfs: qgroup: Prevent qgroup->reserved from going subzero")
CC: stable@vger.kernel.org # 4.14+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Added a new ioctl to send discard commands or/and zero out
to selected data area of a regular file for security reason.
The way of handling range.len of F2FS_IOC_SEC_TRIM_FILE:
1. Added -1 value support for range.len to secure trim the whole blocks
starting from range.start regardless of i_size.
2. If the end of the range passes over the end of file, it means until
the end of file (i_size).
3. ignored the case of that range.len is zero to prevent the function
from making end_addr zero and triggering different behaviour of
the function.
Signed-off-by: Daeho Jeong <daehojeong@google.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
https://bugzilla.kernel.org/show_bug.cgi?id=208565
PID: 257 TASK: ecdd0000 CPU: 0 COMMAND: "init"
#0 [<c0b420ec>] (__schedule) from [<c0b423c8>]
#1 [<c0b423c8>] (schedule) from [<c0b459d4>]
#2 [<c0b459d4>] (rwsem_down_read_failed) from [<c0b44fa0>]
#3 [<c0b44fa0>] (down_read) from [<c044233c>]
#4 [<c044233c>] (f2fs_truncate_blocks) from [<c0442890>]
#5 [<c0442890>] (f2fs_truncate) from [<c044d408>]
#6 [<c044d408>] (f2fs_evict_inode) from [<c030be18>]
#7 [<c030be18>] (evict) from [<c030a558>]
#8 [<c030a558>] (iput) from [<c047c600>]
#9 [<c047c600>] (f2fs_sync_node_pages) from [<c0465414>]
#10 [<c0465414>] (f2fs_write_checkpoint) from [<c04575f4>]
#11 [<c04575f4>] (f2fs_sync_fs) from [<c0441918>]
#12 [<c0441918>] (f2fs_do_sync_file) from [<c0441098>]
#13 [<c0441098>] (f2fs_sync_file) from [<c0323fa0>]
#14 [<c0323fa0>] (vfs_fsync_range) from [<c0324294>]
#15 [<c0324294>] (do_fsync) from [<c0324014>]
#16 [<c0324014>] (sys_fsync) from [<c0108bc0>]
This can be caused by flush_dirty_inode() in f2fs_sync_node_pages() where
iput() requires f2fs_lock_op() again resulting in livelock.
Reported-by: Zhiguo Niu <Zhiguo.Niu@unisoc.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
IV_INO_LBLK_* exist only because of hardware limitations, and currently
the only known use case for them involves AES-256-XTS. Therefore, for
now only allow them in combination with AES-256-XTS. This way we don't
have to worry about them being combined with other encryption modes.
(To be clear, combining IV_INO_LBLK_* with other encryption modes
*should* work just fine. It's just not being tested, so we can't be
100% sure it works. So with no known use case, it's best to disallow it
for now, just like we don't allow other weird combinations like
AES-256-XTS contents encryption with Adiantum filenames encryption.)
This can be relaxed later if a use case for other combinations arises.
Fixes: b103fb7653 ("fscrypt: add support for IV_INO_LBLK_64 policies")
Fixes: e3b1078bed ("fscrypt: add support for IV_INO_LBLK_32 policies")
Link: https://lore.kernel.org/r/20200721181012.39308-1-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
To allow the kernel not to play games with set_fs to call exec
implement kernel_execve. The function kernel_execve takes pointers
into kernel memory and copies the values pointed to onto the new
userspace stack.
The calls with arguments from kernel space of do_execve are replaced
with calls to kernel_execve.
The calls do_execve and do_execveat are made static as there are now
no callers outside of exec.
The comments that mention do_execve are updated to refer to
kernel_execve or execve depending on the circumstances. In addition
to correcting the comments, this makes it easy to grep for do_execve
and verify it is not used.
Inspired-by: https://lkml.kernel.org/r/20200627072704.2447163-1-hch@lst.de
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lkml.kernel.org/r/87wo365ikj.fsf@x220.int.ebiederm.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
In preparation for implementiong kernel_execve (which will take kernel
pointers not userspace pointers) factor out bprm_stack_limits out of
prepare_arg_pages. This separates the counting which depends upon the
getting data from userspace from the calculations of the stack limits
which is usable in kernel_execve.
The remove prepare_args_pages and compute bprm->argc and bprm->envc
directly in do_execveat_common, before bprm_stack_limits is called.
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lkml.kernel.org/r/87365u6x60.fsf@x220.int.ebiederm.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Currently it is necessary for the usermode helper code and the code
that launches init to use set_fs so that pages coming from the kernel
look like they are coming from userspace.
To allow that usage of set_fs to be removed cleanly the argument
copying from userspace needs to happen earlier. Factor bprm_execve
out of do_execve_common to separate out the copying of arguments
to the newe stack, and the rest of exec.
In separating bprm_execve from do_execve_common the copying
of the arguments onto the new stack happens earlier.
As the copying of the arguments does not depend any security hooks,
files, the file table, current->in_execve, current->fs->in_exec,
bprm->unsafe, or creds this is safe.
Likewise the security hook security_creds_for_exec does not depend upon
preventing the argument copying from happening.
In addition to making it possible to implement kernel_execve that
performs the copying differently, this separation of bprm_execve from
do_execve_common makes for a nice separation of responsibilities making
the exec code easier to navigate.
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lkml.kernel.org/r/878sfm6x6x.fsf@x220.int.ebiederm.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Currently it is necessary for the usermode helper code and the code that
launches init to use set_fs so that pages coming from the kernel look like
they are coming from userspace.
To allow that usage of set_fs to be removed cleanly the argument copying
from userspace needs to happen earlier. Move the allocation and
initialization of bprm->mm into alloc_bprm so that the bprm->mm is
available early to store the new user stack into. This is a prerequisite
for copying argv and envp into the new user stack early before ther rest of
exec.
To keep the things consistent the cleanup of bprm->mm is moved into
free_bprm. So that bprm->mm will be cleaned up whenever bprm->mm is
allocated and free_bprm are called.
Moving bprm_mm_init earlier is safe as it does not depend on any files,
current->in_execve, current->fs->in_exec, bprm->unsafe, or the if the file
table is shared. (AKA bprm_mm_init does not depend on any of the code that
happens between alloc_bprm and where it was previously called.)
This moves bprm->mm cleanup after current->fs->in_exec is set to 0. This
is safe because current->fs->in_exec is only used to preventy taking an
additional reference on the fs_struct.
This moves bprm->mm cleanup after current->in_execve is set to 0. This is
safe because current->in_execve is only used by the lsms (apparmor and
tomoyou) and always for LSM specific functions, never for anything to do
with the mm.
This adds bprm->mm cleanup into the successful return path. This is safe
because being on the successful return path implies that begin_new_exec
succeeded and set brpm->mm to NULL. As bprm->mm is NULL bprm cleanup I am
moving into free_bprm will do nothing.
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lkml.kernel.org/r/87eepe6x7p.fsf@x220.int.ebiederm.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Currently it is necessary for the usermode helper code and the code
that launches init to use set_fs so that pages coming from the kernel
look like they are coming from userspace.
To allow that usage of set_fs to be removed cleanly the argument
copying from userspace needs to happen earlier. Move the computation
of bprm->filename and possible allocation of a name in the case
of execveat into alloc_bprm to make that possible.
The exectuable name, the arguments, and the environment are
copied into the new usermode stack which is stored in bprm
until exec passes the point of no return.
As the executable name is copied first onto the usermode stack
it needs to be known. As there are no dependencies to computing
the executable name, compute it early in alloc_bprm.
As an implementation detail if the filename needs to be generated
because it embeds a file descriptor store that filename in a new field
bprm->fdpath, and free it in free_bprm. Previously this was done in
an independent variable pathbuf. I have renamed pathbuf fdpath
because fdpath is more suggestive of what kind of path is in the
variable. I moved fdpath into struct linux_binprm because it is
tightly tied to the other variables in struct linux_binprm, and as
such is needed to allow the call alloc_binprm to move.
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lkml.kernel.org/r/87k0z66x8f.fsf@x220.int.ebiederm.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Currently it is necessary for the usermode helper code and the code
that launches init to use set_fs so that pages coming from the kernel
look like they are coming from userspace.
To allow that usage of set_fs to be removed cleanly the argument
copying from userspace needs to happen earlier. Move the allocation
of the bprm into it's own function (alloc_bprm) and move the call of
alloc_bprm before unshare_files so that bprm can ultimately be
allocated, the arguments can be placed on the new stack, and then the
bprm can be passed into the core of exec.
Neither the allocation of struct binprm nor the unsharing depend upon each
other so swapping the order in which they are called is trivially safe.
To keep things consistent the order of cleanup at the end of
do_execve_common swapped to match the order of initialization.
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lkml.kernel.org/r/87pn8y6x9a.fsf@x220.int.ebiederm.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
On-disk format for name_hash field is LE, so it must be explicitly
transformed on BE system for proper result.
Fixes: 370e812b3e ("exfat: add nls operations")
Cc: stable@vger.kernel.org # v5.7
Signed-off-by: Chen Minqiang <ptpt52@gmail.com>
Signed-off-by: Ilya Ponetayev <i.ponetaev@ndmsystems.com>
Reviewed-by: Sungjong Seo <sj1557.seo@samsung.com>
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
The stream.size field is updated to the value of create timestamp
of the file entry. Fix this to use correct stream entry pointer.
Fixes: 29bbb14bfc ("exfat: fix incorrect update of stream entry in __exfat_truncate()")
Signed-off-by: Hyeongseok Kim <hyeongseok@gmail.com>
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
We found the wrong hint_stat initialization in exfat_find_dir_entry().
It should be initialized when cluster is EXFAT_EOF_CLUSTER.
Fixes: ca06197382 ("exfat: add directory operations")
Cc: stable@vger.kernel.org # v5.7
Reviewed-by: Sungjong Seo <sj1557.seo@samsung.com>
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
An overflow issue can occur while calculating sector in
exfat_cluster_to_sector(). It needs to cast clus's type to sector_t
before left shifting.
Fixes: 1acf1a564b ("exfat: add in-memory and on-disk structures and headers")
Cc: stable@vger.kernel.org # v5.7
Reviewed-by: Sungjong Seo <sj1557.seo@samsung.com>
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
The name "FS_KEY_DERIVATION_NONCE_SIZE" is a bit outdated since due to
the addition of FSCRYPT_POLICY_FLAG_DIRECT_KEY, the file nonce may now
be used as a tweak instead of for key derivation. Also, we're now
prefixing the fscrypt constants with "FSCRYPT_" instead of "FS_".
Therefore, rename this constant to FSCRYPT_FILE_NONCE_SIZE.
Link: https://lore.kernel.org/r/20200708215722.147154-1-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
Each HKDF context byte is associated with a specific format of the
remaining part of the application-specific info string. Add comments so
that it's easier to keep track of what these all are.
Link: https://lore.kernel.org/r/20200708215529.146890-1-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
Drop the repeated word "the" in a comment.
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Chao Yu <chao@kernel.org>
Cc: linux-f2fs-devel@lists.sourceforge.net
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Memory allocated for storing compressed pages' poitner should be
released after f2fs_write_compressed_pages(), otherwise it will
cause memory leak issue.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Fixes: 4c8ff7095b ("f2fs: support data compression")
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Don't define F2FS_IOC_* aliases to ioctls that already have a generic
FS_IOC_* name. These aliases are unnecessary, and they make it unclear
which ioctls are f2fs-specific and which are generic.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Count pages after possibly truncating the iterator to the maximum zone
append size, not before.
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Avoid the compilation warning "Variable 'ret' is reassigned a value
before the old one has been used." in zonefs_create_zgroup() by setting
ret for the error path only if an error happens.
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Opening files in /proc/pid/map_files when the current user is
CAP_CHECKPOINT_RESTORE capable in the root namespace is useful for
checkpointing and restoring to recover files that are unreachable via
the file system such as deleted files, or memfd files.
Signed-off-by: Adrian Reber <areber@redhat.com>
Signed-off-by: Nicolas Viennot <Nicolas.Viennot@twosigma.com>
Reviewed-by: Cyrill Gorcunov <gorcunov@gmail.com>
Reviewed-by: Serge Hallyn <serge@hallyn.com>
Link: https://lore.kernel.org/r/20200719100418.2112740-5-areber@redhat.com
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
These codes have been commented out since 2007 and lay in kernel
since then. So, it's better to remove them.
Link: http://lkml.kernel.org/r/20200628074337.45895-1-jianyong.wu@arm.com
Signed-off-by: Jianyong Wu <jianyong.wu@arm.com>
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
In the current setattr implementation in 9p, fid is always retrieved
from dentry no matter file instance exists or not. If so, there may be
some info related to opened file instance dropped. So it's better
to retrieve fid from file instance when it is passed to setattr.
for example:
fd=open("tmp", O_RDWR);
ftruncate(fd, 10);
The file context related with the fd will be lost as fid is always
retrieved from dentry, then the backend can't get the info of
file context. It is against the original intention of user and
may lead to bug.
Link: http://lkml.kernel.org/r/20200710101548.10108-1-jianyong.wu@arm.com
Signed-off-by: Jianyong Wu <jianyong.wu@arm.com>
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
We currently filter these for timeout_remove/async_cancel/files_update,
but we only should be filtering for fixed file and buffer select. This
also causes a second read of sqe->flags, which isn't needed.
Just check req->flags for the relevant bits. This then allows these
commands to be used in links, for example, like everything else.
Signed-off-by: Daniele Albano <d.albano@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The double poll additions were centered around doing POLL_ADD on file
descriptors that use more than one waitqueue (typically one for read,
one for write) when being polled. However, it can also end up being
triggered for when we use poll triggered retry. For that case, we cannot
safely use req->io, as that could be used by the request type itself.
Add a second io_poll_iocb pointer in the structure we allocate for poll
based retry, and ensure we use the right one from the two paths.
Fixes: 18bceab101 ("io_uring: allow POLL_ADD with double poll_wait() users")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The MS_I_VERSION mount flag is exposed via the VFS, as documented
in the mount manpages etc; see the iversion and noiversion mount
options in mount(8).
As a result, mount -o remount looks for this option in /proc/mounts
and will only send the I_VERSION flag back in during remount it it
is present. Since it's not there, a remount will /remove/ the
I_VERSION flag at the vfs level, and iversion functionality is lost.
xfs v5 superblocks intend to always have i_version enabled; it is
set as a default at mount time, but is lost during remount for the
reasons above.
The generic fix would be to expose this documented option in
/proc/mounts, but since that was rejected, fix it up again in the
xfs remount path instead, so that at least xfs won't suffer from
this misbehavior.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reverting commit d03727b248 "NFSv4 fix CLOSE not waiting for
direct IO compeletion". This patch made it so that fput() by calling
inode_dio_done() in nfs_file_release() would wait uninterruptably
for any outstanding directIO to the file (but that wait on IO should
be killable).
The problem the patch was also trying to address was REMOVE returning
ERR_ACCESS because the file is still opened, is supposed to be resolved
by server returning ERR_FILE_OPEN and not ERR_ACCESS.
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl8RyOwQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpuNmD/sFxMpo0Q4szKSdFY16RxmLbeeCG8eQC+6P
Zqqd4t4tpr1tamSf4pya8zh7ivkfPlm+IFQopEEXbDAZ5P8TwF59KvABRbUYbCFM
ldQzJgvRwoTIhs0ojIY6CPMAxbpDLx8mpwgbzcjuKxbGDHEnndXPDbNO/8olxAaa
Ace5zk7TpY9YDtEXr1qe3y0riw11o/E9S/iX+M/z1KGKQcx01jU4hwesuzssde4J
rEG3TYFiHCkhfB0AtGj3zYInCYIXqqJRqEv9NP0npWB1IWbyLy9XatEDCx8aIblA
HICy09+4v5HR5h4vByRGOvT28rl//7ZB4tdzkunLWYrxYkYOqypsRI8NeDelxtWa
Iv+1Og94lQnjwOF9Iqz/q2z/OfpxlJpOvy8d5xWjhiNr9oc5ugAqVUiFjuQ6XnVG
mNJA21pJwzpesggOErIYjI13JvwW3aFylAB3fBPitHcmCusElnLunSs3/zhr9NY7
BJomUwC/KCmcp/X/WX2W5LoKEnG9WnVrJJmDWjz1wQLziKa7dvHAGUGFArpGJmJ3
TUGefdBi6q2nC7o+K26pwFfjQpA1Myf8Vp6qS957YQ7kZoI1a0bCuxp/rrq1gNFt
8HeKf4jmfqcBZeTPlZDyMWwC5F1MpK9V+KComBqSA8x5/Q0mV6BsOvY5mMHfCgky
HuD7ERgs3g==
=Aucc
-----END PGP SIGNATURE-----
Merge tag 'io_uring-5.8-2020-07-17' of git://git.kernel.dk/linux-block into master
Pull io_uring fix from Jens Axboe:
"Fix for a case where, with automatic buffer selection, we can leak the
buffer descriptor for recvmsg"
* tag 'io_uring-5.8-2020-07-17' of git://git.kernel.dk/linux-block:
io_uring: fix recvmsg memory leak with buffer selection
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCXxGFQwAKCRDh3BK/laaZ
PDYzAP9bXxHQaRdetnj6lOGNWjmVmiHfntxHqkl6QjZf6e1WlwD+NRXayVTc+Lzw
M1pBK6kqovMQVWkyFfA3dTq/BZMzfAc=
=9GPn
-----END PGP SIGNATURE-----
Merge tag 'fuse-fixes-5.8-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse into master
Pull fuse fixes from Miklos Szeredi:
- two regressions in this cycle caused by the conversion of writepage
list to an rb_tree
- two regressions in v5.4 cause by the conversion to the new mount API
- saner behavior of fsconfig(2) for the reconfigure case
- an ancient issue with FS_IOC_{GET,SET}FLAGS ioctls
* tag 'fuse-fixes-5.8-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
fuse: Fix parameter for FS_IOC_{GET,SET}FLAGS
fuse: don't ignore errors from fuse_writepages_fill()
fuse: clean up condition for writepage sending
fuse: reject options on reconfigure via fsconfig(2)
fuse: ignore 'data' argument of mount(..., MS_REMOUNT)
fuse: use ->reconfigure() instead of ->remount_fs()
fuse: fix warning in tree_insert() and clean up writepage insertion
fuse: move rb_erase() before tree_insert()
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCXxGF+QAKCRDh3BK/laaZ
PCHnAQCqNxcxncKMebpJ2hNIEPuSvUPRA4+iOOnc+9HTZ4A09wD/d/8ryybORTZN
IHq2PpQUtuGgASv6GrptJSmpDvG6RA0=
=lOD9
-----END PGP SIGNATURE-----
Merge tag 'ovl-fixes-5.8-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs into master
Pull overlayfs fixes from Miklos Szeredi:
- fix a regression introduced in v4.20 in handling a regenerated
squashfs lower layer
- two regression fixes for this cycle, one of which is Oops inducing
- miscellaneous issues
* tag 'ovl-fixes-5.8-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
ovl: fix lookup of indexed hardlinks with metacopy
ovl: fix unneeded call to ovl_change_flags()
ovl: fix mount option checks for nfs_export with no upperdir
ovl: force read-only sb on failure to create index dir
ovl: fix regression with re-formatted lower squashfs
ovl: fix oops in ovl_indexdir_cleanup() with nfs_export=on
ovl: relax WARN_ON() when decoding lower directory file handle
ovl: remove not used argument in ovl_check_origin
ovl: change ovl_copy_up_flags static
ovl: inode reference leak in ovl_is_inuse true case.
It looks like this "else" is just a typo. It turns off nconnect for
NFSv4.0 even though it works for every other version.
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
commit 0688e64bc6 ("NFS: Allow signal interruption of NFS4ERR_DELAYed operations")
introduces nfs4_delay_interruptible which also needs an _unsafe version to
avoid the following call trace for the same reason explained in
commit 416ad3c9c0 ("freezer: add unsafe versions of freezable helpers for NFS")
CPU: 4 PID: 3968 Comm: rm Tainted: G W 5.8.0-rc4 #1
Hardware name: Marvell OcteonTX CN96XX board (DT)
Call trace:
dump_backtrace+0x0/0x1dc
show_stack+0x20/0x30
dump_stack+0xdc/0x150
debug_check_no_locks_held+0x98/0xa0
nfs4_delay_interruptible+0xd8/0x120
nfs4_handle_exception+0x130/0x170
nfs4_proc_rmdir+0x8c/0x220
nfs_rmdir+0xa4/0x360
vfs_rmdir.part.0+0x6c/0x1b0
do_rmdir+0x18c/0x210
__arm64_sys_unlinkat+0x64/0x7c
el0_svc_common.constprop.0+0x7c/0x110
do_el0_svc+0x24/0xa0
el0_sync_handler+0x13c/0x1b8
el0_sync+0x158/0x180
Signed-off-by: He Zhe <zhe.he@windriver.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Using uninitialized_var() is dangerous as it papers over real bugs[1]
(or can in the future), and suppresses unrelated compiler warnings
(e.g. "unused variable"). If the compiler thinks it is uninitialized,
either simply initialize the variable or make compiler changes.
In preparation for removing[2] the[3] macro[4], remove all remaining
needless uses with the following script:
git grep '\buninitialized_var\b' | cut -d: -f1 | sort -u | \
xargs perl -pi -e \
's/\buninitialized_var\(([^\)]+)\)/\1/g;
s:\s*/\* (GCC be quiet|to make compiler happy) \*/$::g;'
drivers/video/fbdev/riva/riva_hw.c was manually tweaked to avoid
pathological white-space.
No outstanding warnings were found building allmodconfig with GCC 9.3.0
for x86_64, i386, arm64, arm, powerpc, powerpc64le, s390x, mips, sparc64,
alpha, and m68k.
[1] https://lore.kernel.org/lkml/20200603174714.192027-1-glider@google.com/
[2] https://lore.kernel.org/lkml/CA+55aFw+Vbj0i=1TGqCR5vQkCzWJ0QxK6CernOU6eedsudAixw@mail.gmail.com/
[3] https://lore.kernel.org/lkml/CA+55aFwgbgqhbp1fkxvRKEpzyR5J8n1vKT1VZdz9knmPuXhOeg@mail.gmail.com/
[4] https://lore.kernel.org/lkml/CA+55aFz2500WfbKXAx8s67wrm9=yVJu65TpLgN_ybYNv0VEOKA@mail.gmail.com/
Reviewed-by: Leon Romanovsky <leonro@mellanox.com> # drivers/infiniband and mlx4/mlx5
Acked-by: Jason Gunthorpe <jgg@mellanox.com> # IB
Acked-by: Kalle Valo <kvalo@codeaurora.org> # wireless drivers
Reviewed-by: Chao Yu <yuchao0@huawei.com> # erofs
Signed-off-by: Kees Cook <keescook@chromium.org>
This is an effort to eliminate the uninitialized_var() macro[1].
The use of this macro is the wrong solution because it forces off ANY
analysis by the compiler for a given variable. It even masks "unused
variable" warnings.
Quoted from Linus[2]:
"It's a horrible thing to use, in that it adds extra cruft to the
source code, and then shuts up a compiler warning (even the _reliable_
warnings from gcc)."
Fix it by remove this variable since it is not needed at all.
[1] https://github.com/KSPP/linux/issues/81
[2] https://lore.kernel.org/lkml/CA+55aFz2500WfbKXAx8s67wrm9=yVJu65TpLgN_ybYNv0VEOKA@mail.gmail.com/
Suggested-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jason Yan <yanaijie@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Link: https://lore.kernel.org/r/20200615085132.166470-1-yanaijie@huawei.com
Signed-off-by: Kees Cook <keescook@chromium.org>
bd_start_claiming duplicates a lot of the work done in __blkdev_get.
Integrate the two functions to avoid the duplicate work, and to do the
right thing for the md -ERESTARTSYS corner case.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The arcane magic in bd_start_claiming is only needed to be able to claim
a block_device that hasn't been fully set up. Switch the loop driver
that claims from the ioctl path with a fully set up struct block_device
to just use the much simpler bd_prepare_to_claim directly.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Move the locking and assignment of bd_claiming from bd_start_claiming to
bd_prepare_to_claim.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Insted of duplicating all the cleanup logic jump to the code that cleans
up anyway, and restart after that.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add a helper for struct file based chmode operations. To be used by
the initramfs code soon.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Add a helper for struct file based chown operations. To be used by
the initramfs code soon.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
We recently moved setting inode flag OVL_UPPERDATA to ovl_lookup().
When looking up an overlay dentry, upperdentry may be found by index
and not by name. In that case, we fail to read the metacopy xattr
and falsly set the OVL_UPPERDATA on the overlay inode.
This caused a regression in xfstest overlay/033 when run with
OVERLAY_MOUNT_OPTIONS="-o metacopy=on".
Fixes: 28166ab3c8 ("ovl: initialize OVL_UPPERDATA in ovl_lookup()")
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
The check if user has changed the overlay file was wrong, causing unneeded
call to ovl_change_flags() including taking f_lock on every file access.
Fixes: d989903058 ("ovl: do not generate duplicate fsnotify events for "fake" path")
Cc: <stable@vger.kernel.org> # v4.19+
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
The afs filesystem driver allows unstarted operations to be cancelled by
signal, but most of these can easily be restarted (mkdir for example). The
primary culprits for reproducing this are those applications that use
SIGALRM to display a progress counter.
File lock-extension operation is marked uninterruptible as we have a
limited time in which to do it, and the release op is marked
uninterruptible also as if we fail to unlock a file, we'll have to wait 20
mins before anyone can lock it again.
The store operation logs a warning if it gets interruption, e.g.:
kAFS: Unexpected error from FS.StoreData -4
because it's run from the background - but it can also be run from
fdatasync()-type things. However, store options aren't marked
interruptible at the moment.
Fix this in the following ways:
(1) Mark store operations as uninterruptible. It might make sense to
relax this for certain situations, but I'm not sure how to make sure
that background store ops aren't affected by signals to foreground
processes that happen to trigger them.
(2) In afs_get_io_locks(), where we're getting the serialisation lock for
talking to the fileserver, return ERESTARTSYS rather than EINTR
because a lot of the operations (e.g. mkdir) are restartable if we
haven't yet started sending the op to the server.
Fixes: e49c7b2f6d ("afs: Build an abstraction around an "operation" concept")
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Without upperdir mount option, there is no index dir and the dependency
checks nfs_export => index for mount options parsing are incorrect.
Allow the combination nfs_export=on,index=off with no upperdir and move
the check for dependency redirect_dir=nofollow for non-upper mount case
to mount options parsing.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
With index feature enabled, on failure to create index dir, overlay is
being mounted read-only. However, we do not forbid user to remount overlay
read-write. Fix that by setting ofs->workdir to NULL, which prevents
remount read-write.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Commit 9df085f3c9 ("ovl: relax requirement for non null uuid of lower
fs") relaxed the requirement for non null uuid with single lower layer to
allow enabling index and nfs_export features with single lower squashfs.
Fabian reported a regression in a setup when overlay re-uses an existing
upper layer and re-formats the lower squashfs image. Because squashfs
has no uuid, the origin xattr in upper layer are decoded from the new
lower layer where they may resolve to a wrong origin file and user may
get an ESTALE or EIO error on lookup.
To avoid the reported regression while still allowing the new features
with single lower squashfs, do not allow decoding origin with lower null
uuid unless user opted-in to one of the new features that require
following the lower inode of non-dir upper (index, xino, metacopy).
Reported-by: Fabian <godi.beat@gmx.net>
Link: https://lore.kernel.org/linux-unionfs/32532923.JtPX5UtSzP@fgdesktop/
Fixes: 9df085f3c9 ("ovl: relax requirement for non null uuid of lower fs")
Cc: stable@vger.kernel.org # v4.20+
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Mounting with nfs_export=on, xfstests overlay/031 triggers a kernel panic
since v5.8-rc1 overlayfs updates.
overlayfs: orphan index entry (index/00fb1..., ftype=4000, nlink=2)
BUG: kernel NULL pointer dereference, address: 0000000000000030
RIP: 0010:ovl_cleanup_and_whiteout+0x28/0x220 [overlay]
Bisect point at commit c21c839b84 ("ovl: whiteout inode sharing")
Minimal reproducer:
--------------------------------------------------
rm -rf l u w m
mkdir -p l u w m
mkdir -p l/testdir
touch l/testdir/testfile
mount -t overlay -o lowerdir=l,upperdir=u,workdir=w,nfs_export=on overlay m
echo 1 > m/testdir/testfile
umount m
rm -rf u/testdir
mount -t overlay -o lowerdir=l,upperdir=u,workdir=w,nfs_export=on overlay m
umount m
--------------------------------------------------
When mount with nfs_export=on, and fail to verify an orphan index, we're
cleaning this index from indexdir by calling ovl_cleanup_and_whiteout().
This dereferences ofs->workdir, that was earlier set to NULL.
The design was that ovl->workdir will point at ovl->indexdir, but we are
assigning ofs->indexdir to ofs->workdir only after ovl_indexdir_cleanup().
There is no reason not to do it sooner, because once we get success from
ofs->indexdir = ovl_workdir_create(... there is no turning back.
Reported-and-tested-by: Murphy Zhou <jencce.kernel@gmail.com>
Fixes: c21c839b84 ("ovl: whiteout inode sharing")
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Decoding a lower directory file handle to overlay path with cold
inode/dentry cache may go as follows:
1. Decode real lower file handle to lower dir path
2. Check if lower dir is indexed (was copied up)
3. If indexed, get the upper dir path from index
4. Lookup upper dir path in overlay
5. If overlay path found, verify that overlay lower is the lower dir
from step 1
On failure to verify step 5 above, user will get an ESTALE error and a
WARN_ON will be printed.
A mismatch in step 5 could be a result of lower directory that was renamed
while overlay was offline, after that lower directory has been copied up
and indexed.
This is a scripted reproducer based on xfstest overlay/052:
# Create lower subdir
create_dirs
create_test_files $lower/lowertestdir/subdir
mount_dirs
# Copy up lower dir and encode lower subdir file handle
touch $SCRATCH_MNT/lowertestdir
test_file_handles $SCRATCH_MNT/lowertestdir/subdir -p -o $tmp.fhandle
# Rename lower dir offline
unmount_dirs
mv $lower/lowertestdir $lower/lowertestdir.new/
mount_dirs
# Attempt to decode lower subdir file handle
test_file_handles $SCRATCH_MNT -p -i $tmp.fhandle
Since this WARN_ON() can be triggered by user we need to relax it.
Fixes: 4b91c30a5a ("ovl: lookup connected ancestor of dir in inode cache")
Cc: <stable@vger.kernel.org> # v4.16+
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
ovl_check_origin outparam 'ctrp' argument not used by caller. So remove
this argument.
Signed-off-by: youngjun <her0gyugyu@gmail.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
"ovl_copy_up_flags" is used in copy_up.c.
so, change it static.
Signed-off-by: youngjun <her0gyugyu@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
io_recvmsg() doesn't free memory allocated for struct io_buffer. This can
causes a leak when used with automatic buffer selection.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Break up fanotify_alloc_event() into helpers by event struct type.
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
The special overflow event is allocated as struct fanotify_path_event,
but with a null path.
Use a special event type to identify the overflow event, so the helper
fanotify_has_event_path() will always indicate a non null path.
Allocating the overflow event doesn't need any of the fancy stuff in
fanotify_alloc_event(), so create a simplified helper for allocating the
overflow event.
There is also no need to store and report the pid with an overflow event.
Link: https://lore.kernel.org/r/20200708111156.24659-7-amir73il@gmail.com
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
inotify's event->wd is the object identifier.
Compare that instead of the common fsnotidy event objectid, so
we can get rid of the objectid field later.
Link: https://lore.kernel.org/r/20200708111156.24659-6-amir73il@gmail.com
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
When creating an FS_MODIFY event on inode itself (not on parent)
the file_name argument should be NULL.
The change to send a non NULL name to inode itself was done on purpuse
as part of another commit, as Tejun writes: "...While at it, supply the
target file name to fsnotify() from kernfs_node->name.".
But this is wrong practice and inconsistent with inotify behavior when
watching a single file. When a child is being watched (as opposed to the
parent directory) the inotify event should contain the watch descriptor,
but not the file name.
Fixes: df6a58c5c5 ("kernfs: don't depend on d_find_any_alias()...")
Link: https://lore.kernel.org/r/20200708111156.24659-5-amir73il@gmail.com
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Return non const inode pointer from fsnotify_data_inode().
None of the fsnotify hooks pass const inode pointer as data and
callers often need to cast to a non const pointer.
Link: https://lore.kernel.org/r/20200708111156.24659-3-amir73il@gmail.com
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
All (two) callers of fsnotify_parent() also call fsnotify() to notify
the child inode. Move the second fsnotify() call into fsnotify_parent().
This will allow more flexibility in making decisions about which of the
two event falvors should be sent.
Using 'goto notify_child' in the inline helper seems a bit strange, but
it mimics the code in __fsnotify_parent() for clarity and the goto
pattern will become less strage after following patches are applied.
Link: https://lore.kernel.org/r/20200708111156.24659-2-amir73il@gmail.com
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
The fsnotify paths are trivial to hit even when there are no watchers and
they are surprisingly expensive. For example, every successful vfs_write()
hits fsnotify_modify which calls both fsnotify_parent and fsnotify unless
FMODE_NONOTIFY is set which is an internal flag invisible to userspace.
As it stands, fsnotify_parent is a guaranteed functional call even if there
are no watchers and fsnotify() does a substantial amount of unnecessary
work before it checks if there are any watchers. A perf profile showed
that applying mnt->mnt_fsnotify_mask in fnotify() was almost half of the
total samples taken in that function during a test. This patch rearranges
the fast paths to reduce the amount of work done when there are no
watchers.
The test motivating this was "perf bench sched messaging --pipe". Despite
the fact the pipes are anonymous, fsnotify is still called a lot and
the overhead is noticeable even though it's completely pointless. It's
likely the overhead is negligible for real IO so this is an extreme
example. This is a comparison of hackbench using processes and pipes on
a 1-socket machine with 8 CPU threads without fanotify watchers.
5.7.0 5.7.0
vanilla fastfsnotify-v1r1
Amean 1 0.4837 ( 0.00%) 0.4630 * 4.27%*
Amean 3 1.5447 ( 0.00%) 1.4557 ( 5.76%)
Amean 5 2.6037 ( 0.00%) 2.4363 ( 6.43%)
Amean 7 3.5987 ( 0.00%) 3.4757 ( 3.42%)
Amean 12 5.8267 ( 0.00%) 5.6983 ( 2.20%)
Amean 18 8.4400 ( 0.00%) 8.1327 ( 3.64%)
Amean 24 11.0187 ( 0.00%) 10.0290 * 8.98%*
Amean 30 13.1013 ( 0.00%) 12.8510 ( 1.91%)
Amean 32 13.9190 ( 0.00%) 13.2410 ( 4.87%)
5.7.0 5.7.0
vanilla fastfsnotify-v1r1
Duration User 157.05 152.79
Duration System 1279.98 1219.32
Duration Elapsed 182.81 174.52
This is showing that the latencies are improved by roughly 2-9%. The
variability is not shown but some of these results are within the noise
as this workload heavily overloads the machine. That said, the system CPU
usage is reduced by quite a bit so it makes sense to avoid the overhead
even if it is a bit tricky to detect at times. A perf profile of just 1
group of tasks showed that 5.14% of samples taken were in either fsnotify()
or fsnotify_parent(). With the patch, 2.8% of samples were in fsnotify,
mostly function entry and the initial check for watchers. The check for
watchers is complicated enough that inlining it may be controversial.
[Amir] Slightly simplify with mnt_or_sb_mask => marks_mask
Link: https://lore.kernel.org/r/20200708111156.24659-1-amir73il@gmail.com
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
When user provides large buffer for events and there are lots of events
available, we can try to copy them all to userspace without scheduling
which can softlockup the kernel (furthermore exacerbated by the
contention on notification_lock). Add a scheduling point after copying
each event.
Note that usually the real underlying problem is the cost of fanotify
event merging and the resulting contention on notification_lock but this
is a cheap way to somewhat reduce the problem until we can properly
address that.
Reported-by: Francesco Ruggeri <fruggeri@arista.com>
Link: https://lore.kernel.org/lkml/20200714025417.A25EB95C0339@us180.sjc.aristanetworks.com
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
The ioctl encoding for this parameter is a long but the documentation says
it should be an int and the kernel drivers expect it to be an int. If the
fuse driver treats this as a long it might end up scribbling over the stack
of a userspace process that only allocated enough space for an int.
This was previously discussed in [1] and a patch for fuse was proposed in
[2]. From what I can tell the patch in [2] was nacked in favor of adding
new, "fixed" ioctls and using those from userspace. However there is still
no "fixed" version of these ioctls and the fact is that it's sometimes
infeasible to change all userspace to use the new one.
Handling the ioctls specially in the fuse driver seems like the most
pragmatic way for fuse servers to support them without causing crashes in
userspace applications that call them.
[1]: https://lore.kernel.org/linux-fsdevel/20131126200559.GH20559@hall.aurel32.net/T/
[2]: https://sourceforge.net/p/fuse/mailman/message/31771759/
Signed-off-by: Chirantan Ekbote <chirantan@chromium.org>
Fixes: 59efec7b90 ("fuse: implement ioctl support")
Cc: <stable@vger.kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
commit 0688e64bc6 ("NFS: Allow signal interruption of
NFS4ERR_DELAYed operations") introduces nfs4_delay_interruptible
which also needs an _unsafe version to avoid the following call
trace for the same reason explained in commit 416ad3c9c0 ("freezer:
add unsafe versions of freezable helpers for NFS")
CPU: 4 PID: 3968 Comm: rm Tainted: G W 5.8.0-rc4 #1
Hardware name: Marvell OcteonTX CN96XX board (DT)
Call trace:
dump_backtrace+0x0/0x1dc
show_stack+0x20/0x30
dump_stack+0xdc/0x150
debug_check_no_locks_held+0x98/0xa0
nfs4_delay_interruptible+0xd8/0x120
nfs4_handle_exception+0x130/0x170
nfs4_proc_rmdir+0x8c/0x220
nfs_rmdir+0xa4/0x360
vfs_rmdir.part.0+0x6c/0x1b0
do_rmdir+0x18c/0x210
__arm64_sys_unlinkat+0x64/0x7c
el0_svc_common.constprop.0+0x7c/0x110
do_el0_svc+0x24/0xa0
el0_sync_handler+0x13c/0x1b8
el0_sync+0x158/0x180
Signed-off-by: He Zhe <zhe.he@windriver.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Remove duplicated include.
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
These two definitions are unused now.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
In the course of some operations, we look up the perag from
the mount multiple times to get or change perag information.
These are often very short pieces of code, so while the
lookup cost is generally low, the cost of the lookup is far
higher than the cost of the operation we are doing on the
perag.
Since we changed buffers to hold references to the perag
they are cached in, many modification contexts already hold
active references to the perag that are held across these
operations. This is especially true for any operation that
is serialised by an allocation group header buffer.
In these cases, we can just use the buffer's reference to
the perag to avoid needing to do lookups to access the
perag. This means that many operations don't need to do
perag lookups at all to access the perag because they've
already looked up objects that own persistent references
and hence can use that reference instead.
Cc: Dave Chinner <dchinner@redhat.com>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
fuse_writepages() ignores some errors taken from fuse_writepages_fill() I
believe it is a bug: if .writepages is called with WB_SYNC_ALL it should
either guarantee that all data was successfully saved or return error.
Fixes: 26d614df1d ("fuse: Implement writepages callback")
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
fuse_writepages_fill uses following construction:
if (wpa && ap->num_pages &&
(A || B || C)) {
action;
} else if (wpa && D) {
if (E) {
the same action;
}
}
- ap->num_pages check is always true and can be removed
- "if" and "else if" calls the same action and can be merged.
Move checking A, B, C, D, E conditions to a helper, add comments.
Original-patch-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Previous patch changed handling of remount/reconfigure to ignore all
options, including those that are unknown to the fuse kernel fs. This was
done for backward compatibility, but this likely only affects the old
mount(2) API.
The new fsconfig(2) based reconfiguration could possibly be improved. This
would make the new API less of a drop in replacement for the old, OTOH this
is a good chance to get rid of some weirdnesses in the old API.
Several other behaviors might make sense:
1) unknown options are rejected, known options are ignored
2) unknown options are rejected, known options are rejected if the value
is changed, allowed otherwise
3) all options are rejected
Prior to the backward compatibility fix to ignore all options all known
options were accepted (1), even if they change the value of a mount
parameter; fuse_reconfigure() does not look at the config values set by
fuse_parse_param().
To fix that we'd need to verify that the value provided is the same as set
in the initial configuration (2). The major drawback is that this is much
more complex than just rejecting all attempts at changing options (3);
i.e. all options signify initial configuration values and don't make sense
on reconfigure.
This patch opts for (3) with the rationale that no mount options are
reconfigurable in fuse.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
The command
mount -o remount -o unknownoption /mnt/fuse
succeeds on kernel versions prior to v5.4 and fails on kernel version at or
after. This is because fuse_parse_param() rejects any unrecognised options
in case of FS_CONTEXT_FOR_RECONFIGURE, just as for FS_CONTEXT_FOR_MOUNT.
This causes a regression in case the fuse filesystem is in fstab, since
remount sends all options found there to the kernel; even ones that are
meant for the initial mount and are consumed by the userspace fuse server.
Fix this by ignoring mount options, just as fuse_remount_fs() did prior to
the conversion to the new API.
Reported-by: Stefan Priebe <s.priebe@profihost.ag>
Fixes: c30da2e981 ("fuse: convert to use the new mount API")
Cc: <stable@vger.kernel.org> # v5.4
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
s_op->remount_fs() is only called from legacy_reconfigure(), which is not
used after being converted to the new API.
Convert to using ->reconfigure(). This restores the previous behavior of
syncing the filesystem and rejecting MS_MANDLOCK on remount.
Fixes: c30da2e981 ("fuse: convert to use the new mount API")
Cc: <stable@vger.kernel.org> # v5.4
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
fuse_writepages_fill() calls tree_insert() with ap->num_pages = 0 which
triggers the following warning:
WARNING: CPU: 1 PID: 17211 at fs/fuse/file.c:1728 tree_insert+0xab/0xc0 [fuse]
RIP: 0010:tree_insert+0xab/0xc0 [fuse]
Call Trace:
fuse_writepages_fill+0x5da/0x6a0 [fuse]
write_cache_pages+0x171/0x470
fuse_writepages+0x8a/0x100 [fuse]
do_writepages+0x43/0xe0
Fix up the warning and clean up the code around rb-tree insertion:
- Rename tree_insert() to fuse_insert_writeback() and make it return the
conflicting entry in case of failure
- Re-add tree_insert() as a wrapper around fuse_insert_writeback()
- Rename fuse_writepage_in_flight() to fuse_writepage_add() and reverse
the meaning of the return value to mean
+ "true" in case the writepage entry was successfully added
+ "false" in case it was in-fligt queued on an existing writepage
entry's auxiliary list or the existing writepage entry's temporary
page updated
Switch from fuse_find_writeback() + tree_insert() to
fuse_insert_writeback()
- Move setting orig_pages to before inserting/updating the entry; this may
result in the orig_pages value being discarded later in case of an
in-flight request
- In case of a new writepage entry use fuse_writepage_add()
unconditionally, only set data->wpa if the entry was added.
Fixes: 6b2fb79963 ("fuse: optimize writepages search")
Reported-by: kernel test robot <rong.a.chen@intel.com>
Original-path-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
In fuse_writepage_end() the old writepages entry needs to be removed from
the rbtree before inserting the new one, otherwise tree_insert() would
fail. This is a very rare codepath and no reproducer exists.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Rationale:
Reduces attack surface on kernel devs opening the links for MITM
as HTTPS traffic is much harder to manipulate.
Deterministic algorithm:
For each file:
If not .svg:
For each line:
If doesn't contain `\bxmlns\b`:
For each link, `\bhttp://[^# \t\r\n]*(?:\w|/)`:
If neither `\bgnu\.org/license`, nor `\bmozilla\.org/MPL\b`:
If both the HTTP and HTTPS versions
return 200 OK and serve the same content:
Replace HTTP with HTTPS.
Link: https://lore.kernel.org/r/20200713200738.37800-1-grandmaster@al2klimov.de
Signed-off-by: Alexander A. Klimov <grandmaster@al2klimov.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Implement client side caching for NFSv4.2 extended attributes. The cache
is a per-inode hashtable, with name/value entries. There is one special
entry for the listxattr cache.
NFS inodes have a pointer to a cache structure. The cache structure is
allocated on demand, freed when the cache is invalidated.
Memory shrinkers keep the size in check. Large entries (> PAGE_SIZE)
are collected by a separate shrinker, and freed more aggressively
than others.
Signed-off-by: Frank van der Linden <fllinden@amazon.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Now that all the lower level code is there to make the RPC calls, hook
it in to the xattr handlers and the listxattr entry point, to make them
available.
Signed-off-by: Frank van der Linden <fllinden@amazon.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Implement the extended attribute procedures for NFSv4.2 extended
attribute support (RFC 8276).
Signed-off-by: Frank van der Linden <fllinden@amazon.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Make the buf_to_pages_noslab function available to the rest of the NFS
code. Rename it to nfs4_buf_to_pages_noslab to be consistent.
This will be used later in the NFSv4.2 xattr code.
Signed-off-by: Frank van der Linden <fllinden@amazon.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Define the NFS_INO_INVALID_XATTR flag, to be used for the NFSv4.2 xattr
cache, and use it where appropriate.
No functional change as yet.
Signed-off-by: Frank van der Linden <fllinden@amazon.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Until now, change attributes in change_info form were only returned by
directory operations. However, they are also used for the RFC 8276
extended attribute operations, which work on both directories
and regular files. Modify update_changeattr to deal:
* Rename it to nfs4_update_changeattr and make it non-static.
* Don't always use INO_INVALID_DATA, this isn't needed for a
directory that only had its extended attributes changed by us.
* Existing callers now always pass in INO_INVALID_DATA.
For the current callers of this function, behavior is unchanged.
Signed-off-by: Frank van der Linden <fllinden@amazon.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
RFC 8276 defines separate ACCESS bits for extended attribute checking.
Query them in nfs_do_access and opendata.
Signed-off-by: Frank van der Linden <fllinden@amazon.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
The only consumer of nfs_access_get_cached_rcu and nfs_access_cached
calls these static functions in order to first try RCU access, and
then locked access.
Combine them in to a single function, and call that. Make this function
available to the rest of the NFS code.
Signed-off-by: Frank van der Linden <fllinden@amazon.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Define the argument and response structures that will be used for
RFC 8276 extended attribute RPC calls, and implement the necessary
functions to encode/decode the extended attribute operations.
Signed-off-by: Frank van der Linden <fllinden@amazon.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Query the server for extended attribute support, and record it
as the NFS_CAP_XATTR flag in the server capabilities.
Signed-off-by: Frank van der Linden <fllinden@amazon.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Set limits for extended attributes (attribute value size and listxattr
buffer size), based on the fs-independent limits (XATTR_*_MAX).
Define the maximum XDR sizes for the RFC 8276 XATTR operations.
In the case of operations that carry a larger payload (SETXATTR,
GETXATTR, LISTXATTR), these exclude that payload, which is added
as separate pages, like other operations do.
Define, much like for read and write operations, the maximum overhead
sizes for get/set/listxattr, and use them to limit the maximum payload
size for those operations, in combination with the channel attributes.
Signed-off-by: Frank van der Linden <fllinden@amazon.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
If the rpc_pipefs is unmounted, then the rpc_pipe->dentry becomes NULL
and dereferencing the dentry->d_sb will trigger an oops. The only
reason we're doing that is to determine the nfsd_net, which could
instead be passed in by the caller. So do that instead.
Fixes: 11a60d1592 ("nfsd: add a "GetVersion" upcall for nfsdcld")
Signed-off-by: Scott Mayhew <smayhew@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
We recently fixed lease breaking so that a client's actions won't break
its own delegations.
But we still have an unnecessary self-conflict when granting
delegations: a client's own write opens will prevent us from handing out
a read delegation even when no other client has the file open for write.
Fix that by turning off the checks for conflicting opens under
vfs_setlease, and instead performing those checks in the nfsd code.
We don't depend much on locks here: instead we acquire the delegation,
then check for conflicts, and drop the delegation again if we find any.
The check beforehand is an optimization of sorts, just to avoid
acquiring the delegation unnecessarily. There's a race where the first
check could cause us to deny the delegation when we could have granted
it. But, that's OK, delegation grants are optional (and probably not
even a good idea in that case).
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
A single character (line break) should be put into a sequence.
Thus use the corresponding function "seq_putc()".
Signed-off-by: Xu Wang <vulab@iscas.ac.cn>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Check if user extended attributes are supported for an inode,
and return the answer when being queried for file attributes.
An exported filesystem can now signal its RFC8276 user extended
attributes capability.
Signed-off-by: Frank van der Linden <fllinden@amazon.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Implement the main entry points for the *XATTR operations.
Add functions to calculate the reply size for the user extended attribute
operations, and implement the XDR encode / decode logic for these
operations.
Add the user extended attributes operations to nfsd4_ops.
Signed-off-by: Frank van der Linden <fllinden@amazon.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Add the structures used in extended attribute request / response handling.
Signed-off-by: Frank van der Linden <fllinden@amazon.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Since the NFSv4.2 extended attributes extension defines 3 new access
bits for xattr operations, take them in to account when validating
what the client is asking for, and when checking permissions.
Signed-off-by: Frank van der Linden <fllinden@amazon.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
This adds the filehandle based functions for the xattr operations
that call in to the vfs layer to do the actual work.
Signed-off-by: Frank van der Linden <fllinden@amazon.com>
[ cel: address checkpatch.pl complaint ]
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Add defines for server-side extended attribute support. Most have
already been added as part of client support, but these are
the network order error codes for the noxattr and xattr2big errors,
and the addition of the xattr support to the supported file
attributes (if configured).
Signed-off-by: Frank van der Linden <fllinden@amazon.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
nfs4_decode_write has code to parse incoming XDR write data in to
a kvec head, and a list of pages.
Put this code in to a separate function, so that it can be used
later by the xattr code, for setxattr. No functional change.
Signed-off-by: Frank van der Linden <fllinden@amazon.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Add a function that checks is an extended attribute namespace is
supported for an inode, meaning that a handler must be present
for either the whole namespace, or at least one synthetic
xattr in the namespace.
To be used by the nfs server code when being queried for extended
attributes support.
Cc: linux-fsdevel@vger.kernel.org
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Frank van der Linden <fllinden@amazon.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
set/removexattr on an exported filesystem should break NFS delegations.
This is true in general, but also for the upcoming support for
RFC 8726 (NFSv4 extended attribute support). Make sure that they do.
Additionally, they need to grow a _locked variant, since callers might
call this with i_rwsem held (like the NFS server code).
Cc: stable@vger.kernel.org # v4.9+
Cc: linux-fsdevel@vger.kernel.org
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Frank van der Linden <fllinden@amazon.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Expand __receive_fd() with support for replace_fd() for the coming seccomp
"addfd" ioctl(). Add new wrapper receive_fd_replace() for the new behavior
and update existing wrappers to retain old behavior.
Thanks to Colin Ian King <colin.king@canonical.com> for pointing out an
uninitialized variable exposure in an earlier version of this patch.
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Dmitry Kadashev <dkadashev@gmail.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Sargun Dhillon <sargun@sargun.me>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
For both pidfd and seccomp, the __user pointer is not used. Update
__receive_fd() to make writing to ufd optional via a NULL check. However,
for the receive_fd_user() wrapper, ufd is NULL checked so an -EFAULT
can be returned to avoid changing the SCM_RIGHTS interface behavior. Add
new wrapper receive_fd() for pidfd and seccomp that does not use the ufd
argument. For the new helper, the allocated fd needs to be returned on
success. Update the existing callers to handle it.
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Sargun Dhillon <sargun@sargun.me>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
In preparation for users of the "install a received file" logic outside
of net/ (pidfd and seccomp), relocate and rename __scm_install_fd() from
net/core/scm.c to __receive_fd() in fs/file.c, and provide a wrapper
named receive_fd_user(), as future patches will change the interface
to __receive_fd().
Additionally add a comment to fd_install() as a counterpoint to how
__receive_fd() interacts with fput().
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Dmitry Kadashev <dkadashev@gmail.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Sargun Dhillon <sargun@sargun.me>
Cc: Ido Schimmel <idosch@idosch.org>
Cc: Ioana Ciornei <ioana.ciornei@nxp.com>
Cc: linux-fsdevel@vger.kernel.org
Cc: netdev@vger.kernel.org
Reviewed-by: Sargun Dhillon <sargun@sargun.me>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
We used to do this before 3453d5708b, but this was changed to better
handle the NFS4ERR_SEQ_MISORDERED error code. This commit fixed the slot
re-use case when the server doesn't receive the interrupted operation,
but if the server does receive the operation then it could still end up
replying to the client with mis-matched operations from the reply cache.
We can fix this by sending a SEQUENCE to the server while recovering from
a SEQ_MISORDERED error when we detect that we are in an interrupted slot
situation.
Fixes: 3453d5708b (NFSv4.1: Avoid false retries when RPC calls are interrupted)
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Make sure we specify the layout segment range when calculating the
mirror count. In theory, that number could depend on the range to
which we're writing.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Both nfs_pageio_reset_read_mds() and nfs_pageio_reset_write_mds()
do call pnfs_generic_pg_cleanup() for us.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
If the application uses the AT_STATX_DONT_SYNC flag after doing readdir(),
then we should still mark the parent inode as seeing a readdirplus hit.
That ensures that we continue to use readdirplus in the 'ls -l' type
of workflow to do fast lookups of the dentries.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl8K3ugACgkQxWXV+ddt
WDsDNBAAn5iaMNwlCBYpwAaWlltMog3SKg+vgpEcFD9qLlmimW/1TlrjjGRzp6Mn
nnNp+YjYDotqU9pP1OwESpY1LTzuVQlQL1yaiPLrehw/WsZgjdDWBk/EyU0n1vz1
Sr5wcyCVyVZZyO2/BEVTDhkvu+sj9Rcwo2QCsC2aIOTVSfQGFSklMp2VNdu2YQBy
zyTOhbwpn3OPPZsvScEujvSY9oUAN3J8WYA9jmgtwjZD7sr6UNyNI9vy8woi0VAQ
Uo7nXc43ZcS1xTwziGOpC6fZi90zrF7ZvfFT0qY92EEDcAQcCzPDl6f4OnAjr6/b
rnZcLvusEcENjFQn3pD7fCuXiIRrN8eHspj5+K/oRBTXWC5AykBwsLWt7M+tTMYa
ljEBRZlQlHMlC3xSEZNDccEvScXrEIu3Q2WrTOTXSgXi4e3q89VUTEIjAhfnTTzJ
VwHhGZIB6o+V7wZ0EhWdt9b1/Ro/AcADddV+AxTsfC1YCHVZOsSSa3DxV243ORsA
/U3t2a4SMp/iSHTtoLIwbr/O1Uj9UaOk2n1DcNbGIgdn14yYt6YWOhvrOPBampEa
zfBzmAOx9r5Mf2wWD0iTm4gJEZsrB+IpboYZ6cuBcOI29+A4k0POBfRLXgf8/jMo
5kBWm+C3KKkZO8u/Z4gtVG1ZFdxsnYAc+q+UXS5ZSJMH+++UoZQ=
=hTok
-----END PGP SIGNATURE-----
Merge tag 'for-5.8-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"Two refcounting fixes and one prepartory patch for upcoming splice
cleanup:
- fix double put of block group with nodatacow
- fix missing block group put when remounting with discard=async
- explicitly set splice callback (no functional change), to ease
integrating splice cleanup patches"
* tag 'for-5.8-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: wire up iter_file_splice_write
btrfs: fix double put of block group with nocow
btrfs: discard: add missing put when grabbing block group from unused list
59960b9deb ("io_uring: fix lazy work init") tried to fix missing
io_req_init_async(), but left out work.flags and hash. Do it earlier.
Fixes: 7cdaf587de ("io_uring: avoid whole io_wq_work copy for requests completed inline")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ensure to set msg.msg_name for the async portion of send/recvmsg,
as the header copy will copy to/from it.
Cc: stable@vger.kernel.org # v5.5+
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAl8I/pEACgkQiiy9cAdy
T1HEzAv/frpjB9Ae4EH4oI9bqAPpIvmz0OG8qjiVhA7EMuKCPzj4DdP6HwgJ41Rr
GS6IaltHHYRFmv2rKPFm9BJs0l03XJmkOlx/D+mvh0YJiNAYGan99wCVuhjotRkk
1LLpG8ZKRSGtxc+gkyaAUiKMZPHtDxvfy/VUsoCzOJBUDVByeH2LlwanXGibWEmw
DHap06dClRZ0nLf7BvfL5vtFdX5If9CUOiwgj6PY/Oy+/hhq/XjQJClhVSH1tGhG
wXMOpq5gANG0MOCkJh6GVykawjSE0gz/jc6d+zlUW4stqMLQMz7kfIqLy/OZ2z4/
zggpKmxN969ZyhLxldEJguet5+gHu7rX7dQMJb83UBtV9uC8mtgVR/5vTMCtrGBX
c9YNdzOnk7Djhb9ka1sx9KAORwagphYS4BIAhk2bQbOcW8OfkiRHVLaDou39Vbn3
3rBVTFDg7Kdg1yFhPXZM1r4HBcWupVHq/jh+kCoHcPVrwYBwNDTflTWogDjWkzph
Bc+39FSC
=c1VA
-----END PGP SIGNATURE-----
Merge tag '5.8-rc4-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs fixes from Steve French:
"Four cifs/smb3 fixes: the three for stable fix problems found recently
with change notification including a reference count leak"
* tag '5.8-rc4-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6:
cifs: update internal module version number
cifs: fix reference leak for tlink
smb3: fix unneeded error message on change notify
cifs: remove the retry in cifs_poxis_lock_set
smb3: fix access denied on change notify request to some servers
A common question asked when debugging seccomp filters is "how many
filters are attached to your process?" Provide a way to easily answer
this question through /proc/$pid/status with a "Seccomp_filters" line.
Signed-off-by: Kees Cook <keescook@chromium.org>
debugfs_create_u32_array() allocates a small structure to wrap
the data and size information about the array. If users ever
try to remove the file this leads to a leak since nothing ever
frees this wrapper.
That said there are no upstream users of debugfs_create_u32_array()
that'd remove a u32 array file (we only have one u32 array user in
CMA), so there is no real bug here.
Make callers pass a wrapper they allocated. This way the lifetime
management of the wrapper is on the caller, and we can avoid the
potential leak in debugfs.
CC: Chucheng Luo <luochucheng@vivo.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl8IkFIQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgph7tEACdkJ9CupSjVQBJ35/2wBiUCU6lh0iUbru3
oURuw9AdjzT75en68LOyrhfdGCkNX8xrIoHnh9pGukoPid//R3rAsk+5j4Doc7SY
FBZuH7HbOC/gdTAzSd0tLyaddmf06eq0PPmlWQGUYu29Vy49nvLDM2V+455OgkT9
csCNk/Fq/np7IKVZW56KZ1Iz90Dtp8BVwFJeFdzFK5cpCjKXTyvzS/5GEFqz0rc3
AwvxkBCpDjXQqh5OOs2qM18QUeJAQV0PX1x+XjtASMB3dr4W8D5KDEhaQwiXiGOT
j7yHX/JThtOYI41IiKpF/LP9EQAkavIT1n7ZjXwablq6ecQjj24DxCe0h1uwnKyW
G5Hztjcb2YMhhDVj4q6m/uzD13+ccIeEmVoKjRuGj+0YDFETIBYK8U4GElu7VUIr
yUKqy7jFEHKFZj5eQz5JDl08+zq/ocZHJctqZhCB3DWMbE5f+VCzNGEqGGs2IiRS
J7/S8jB1j7Te8jFXBN3uaNpuj+U/JlYCPqO83fxF14ar3wZ1mblML4cw0sXUJazO
1YI3LU91qBeCNK/9xUc43MzzG/SgdIKJjfoaYYPBPMG5SzVYAtRoUUiNf942UE1k
7u8/leql+RqwSLedjU/AQqzs29p7wcK8nzFBApmTfiiBHYcRbJ6Fxp3/UwTOR3tQ
kwmkUtivtQ==
=31iz
-----END PGP SIGNATURE-----
Merge tag 'io_uring-5.8-2020-07-10' of git://git.kernel.dk/linux-block
Pull io_uring fixes from Jens Axboe:
- Fix memleak for error path in registered files (Yang)
- Export CQ overflow state in flags, necessary to fix a case where
liburing doesn't know if it needs to enter the kernel (Xiaoguang)
- Fix for a regression in when user memory is accounted freed, causing
issues with back-to-back ring exit + init if the ulimit -l setting is
very tight.
* tag 'io_uring-5.8-2020-07-10' of git://git.kernel.dk/linux-block:
io_uring: account user memory freed when exit has been queued
io_uring: fix memleak in io_sqe_files_register()
io_uring: fix memleak in __io_sqe_files_update()
io_uring: export cq overflow status to userspace
Reshuffle the (__)kernel_read and (__)kernel_write helpers, and ensure
all users of in-kernel file I/O use them if they don't use iov_iter
based methods already.
-----BEGIN PGP SIGNATURE-----
iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAl8Ij8gLHGhjaEBsc3Qu
ZGUACgkQD55TZVIEUYOcpBAAn157ooLqRrqQisEA6j59rTgkHUuqZMUx+8XjiivX
baHQPmgctza1Xzjc4PjJ1owtLpt4ywcTpY8IDj3vZF1PpffeeuWVzxMTk/aIvhNN
zPK2SJpRlDQHErKEhkTTOfOYoFTgc7vPa5Hvm6AEMaJs8oPtGZ2rnQHzPXENl/TY
TgcLd1ou3iuw19UIAfB+EfuC9uhq7pCPu9+tryNyT2IfM7fqdsIhRESpcodg1ve+
1k6leFIBrXa3MWiBGVUGCrSmlpP9xd22Zl8D/w60WeYWeg7szZoUK2bjhbdIEDZI
tTwkdZ73IKpcxOyzUVbfr2hqNa94zrXCKQGfEGVS/7arV7QH4yvhg9NU9lqVXZKV
ruPoyjsmJkHW52FfEEv1Gfrd6v4H6qZ6iyJEm3ZYNGul85O97t1xA/kKxAIwMuPa
nFhhxHIooT/We3Ao77FROhIob4D5AOfOI4gvkTE15YMzsNxT/yjilQjdDFR5An6A
ckzqb+VyDvcTx2gxR/qaol7b4lzmri4S/8Jt7WXjHOtNe9eXC4kl44leitK5j31H
fHZNyMLJ2+/JF5pGB2rNRNnTeQ7lXKob4Y+qAjRThddDxtdsf5COdZAiIiZbRurR
Ogl2k3sMDdHgNfycK2Bg5Fab9OIWePQlpcGU14afUSPviuNkIYKLGrx92ZWef53j
loI=
=eYsI
-----END PGP SIGNATURE-----
Merge tag 'cleanup-kernel_read_write' of git://git.infradead.org/users/hch/misc
Pull in-kernel read and write op cleanups from Christoph Hellwig:
"Cleanup in-kernel read and write operations
Reshuffle the (__)kernel_read and (__)kernel_write helpers, and ensure
all users of in-kernel file I/O use them if they don't use iov_iter
based methods already.
The new WARN_ONs in combination with syzcaller already found a missing
input validation in 9p. The fix should be on your way through the
maintainer ASAP".
[ This is prep-work for the real changes coming 5.9 ]
* tag 'cleanup-kernel_read_write' of git://git.infradead.org/users/hch/misc:
fs: remove __vfs_read
fs: implement kernel_read using __kernel_read
integrity/ima: switch to using __kernel_read
fs: add a __kernel_read helper
fs: remove __vfs_write
fs: implement kernel_write using __kernel_write
fs: check FMODE_WRITE in __kernel_write
fs: unexport __kernel_write
bpfilter: switch to kernel_write
autofs: switch to kernel_write
cachefiles: switch to kernel_write
-----BEGIN PGP SIGNATURE-----
iQJIBAABCAAyFiEEJZs3krPW0xkhLMTc1b+f6wMTZToFAl8IgKwUHGFncnVlbmJh
QHJlZGhhdC5jb20ACgkQ1b+f6wMTZTpm2Q//bj3ZDIjep9a4d7mRVGeX3OeslLzk
NDB2Vu03B0oZKQFYbQNdpblxy2Cfyz4m8xkNCdsD8EQ2d1zaPWhywJ6vxc1VO5Dw
wRODwRMgVe0hd9dLR8b8GzUO0+4ncpjqmyEyrCRjwPRkghcX8uuSTifXtY+yeDEv
X2BHlSGMjqCFBfq+RTa8Fi3wWFy9QhGy74QVoidMM0ulFLJbWSu0EnCXZ+hZQ4vR
sJokd2SDSP60LE964CwMxuMNUNwSMwL3VrlUm74qx1WVCK8lyYtm231E5CAHRbAw
C/f6sIKoyzyfJbv2HqgvMXvh72hO4MaJgIb8Pbht8a9GZdfk6i2JbcNmHXXk5OMN
GkYLLhkDrj4X/MChNuk20Zsylaij1+CCLb6C4UsQeXF0e/QA6iYIGRmpApGN2gNP
IA8rTz4Ibmd5ZpVMJNPOGSbq3fpPEboEoxVn+fWVvhDTopATxYS85tKqU5Bfvdr5
QcBqqeAL9yludQa520C1lIbGDBOJ57LisybMBVufklx8ZtFNNbHyB/b1YnfUBvRF
8WXVpYkh1ckB4VvVj7qnKY2/JJT0VVhQmTogqwqZy9m+Nb8I4l0pemUsJnypS0qs
KmoBvZmhWhE3tnqmCVzSvuHzO/eYGSfN91AavGBaddFzsqLLe8Hkm8kzlS5bZxGn
OVWGWVvuoSu72s8=
=dfnJ
-----END PGP SIGNATURE-----
Merge tag 'gfs2-v5.8-rc4.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2
Pull gfs2 fixes from Andreas Gruenbacher:
"Fix gfs2 readahead deadlocks by adding a IOCB_NOIO flag that allows
gfs2 to use the generic fiel read iterator functions without having to
worry about being called back while holding locks".
* tag 'gfs2-v5.8-rc4.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
gfs2: Rework read and page fault locking
fs: Add IOCB_NOIO flag for generic_file_read_iter
We currently account the memory after the exit work has been run, but
that leaves a gap where a process has closed its ring and until the
memory has been accounted as freed. If the memlocked ulimit is
borderline, then that can introduce spurious setup errors returning
-ENOMEM because the free work hasn't been run yet.
Account this as freed when we close the ring, as not to expose a tiny
gap where setting up a new ring can fail.
Fixes: 85faa7b834 ("io_uring: punt final io_ring_ctx wait-and-free to workqueue")
Cc: stable@vger.kernel.org # v5.7
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We don't use 'ctx' at all in io_sq_thread_drop_mm(), it just works
on the mm of the current task. Drop the argument.
Move io_file_put_work() to where we have the other forward declarations
of functions.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
btrfs implements the iter_write op and thus can use the more efficient
iov_iter based splice implementation. For now falling back to the less
efficient default is pretty harmless, but I have a pending series that
removes the default, and thus would cause btrfs to not support splice
at all.
Reported-by: Andy Lavr <andy.lavr@gmail.com>
Tested-by: Andy Lavr <andy.lavr@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Depending on the workloads, the following circular locking dependency
warning between sb_internal (a percpu rwsem) and fs_reclaim (a pseudo
lock) may show up:
======================================================
WARNING: possible circular locking dependency detected
5.0.0-rc1+ #60 Tainted: G W
------------------------------------------------------
fsfreeze/4346 is trying to acquire lock:
0000000026f1d784 (fs_reclaim){+.+.}, at:
fs_reclaim_acquire.part.19+0x5/0x30
but task is already holding lock:
0000000072bfc54b (sb_internal){++++}, at: percpu_down_write+0xb4/0x650
which lock already depends on the new lock.
:
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(sb_internal);
lock(fs_reclaim);
lock(sb_internal);
lock(fs_reclaim);
*** DEADLOCK ***
4 locks held by fsfreeze/4346:
#0: 00000000b478ef56 (sb_writers#8){++++}, at: percpu_down_write+0xb4/0x650
#1: 000000001ec487a9 (&type->s_umount_key#28){++++}, at: freeze_super+0xda/0x290
#2: 000000003edbd5a0 (sb_pagefaults){++++}, at: percpu_down_write+0xb4/0x650
#3: 0000000072bfc54b (sb_internal){++++}, at: percpu_down_write+0xb4/0x650
stack backtrace:
Call Trace:
dump_stack+0xe0/0x19a
print_circular_bug.isra.10.cold.34+0x2f4/0x435
check_prev_add.constprop.19+0xca1/0x15f0
validate_chain.isra.14+0x11af/0x3b50
__lock_acquire+0x728/0x1200
lock_acquire+0x269/0x5a0
fs_reclaim_acquire.part.19+0x29/0x30
fs_reclaim_acquire+0x19/0x20
kmem_cache_alloc+0x3e/0x3f0
kmem_zone_alloc+0x79/0x150
xfs_trans_alloc+0xfa/0x9d0
xfs_sync_sb+0x86/0x170
xfs_log_sbcount+0x10f/0x140
xfs_quiesce_attr+0x134/0x270
xfs_fs_freeze+0x4a/0x70
freeze_super+0x1af/0x290
do_vfs_ioctl+0xedc/0x16c0
ksys_ioctl+0x41/0x80
__x64_sys_ioctl+0x73/0xa9
do_syscall_64+0x18f/0xd23
entry_SYSCALL_64_after_hwframe+0x49/0xbe
This is a false positive as all the dirty pages are flushed out before
the filesystem can be frozen.
One way to avoid this splat is to add GFP_NOFS to the affected allocation
calls by using the memalloc_nofs_save()/memalloc_nofs_restore() pair.
This shouldn't matter unless the system is really running out of memory.
In that particular case, the filesystem freeze operation may fail while
it was succeeding previously.
Without this patch, the command sequence below will show that the lock
dependency chain sb_internal -> fs_reclaim exists.
# fsfreeze -f /home
# fsfreeze --unfreeze /home
# grep -i fs_reclaim -C 3 /proc/lockdep_chains | grep -C 5 sb_internal
After applying the patch, such sb_internal -> fs_reclaim lock dependency
chain can no longer be found. Because of that, the locking dependency
warning will not be shown.
Suggested-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Waiman Long <longman@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
While debugging a patch that I wrote I was hitting use-after-free panics
when accessing block groups on unmount. This turned out to be because
in the nocow case if we bail out of doing the nocow for whatever reason
we need to call btrfs_dec_nocow_writers() if we called the inc. This
puts our block group, but a few error cases does
if (nocow) {
btrfs_dec_nocow_writers();
goto error;
}
unfortunately, error is
error:
if (nocow)
btrfs_dec_nocow_writers();
so we get a double put on our block group. Fix this by dropping the
error cases calling of btrfs_dec_nocow_writers(), as it's handled at the
error label now.
Fixes: 762bf09893 ("btrfs: improve error handling in run_delalloc_nocow")
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Don't leak a reference to tlink during the NOTIFY ioctl
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
CC: Stable <stable@vger.kernel.org> # v5.6+
Commit
bf67fad19e ("efi: Use more granular check for availability for variable services")
introduced a check into the efivarfs, efi-pstore and other drivers that
aborts loading of the module if not all three variable runtime services
(GetVariable, SetVariable and GetNextVariable) are supported. However, this
results in efivarfs being unavailable entirely if only SetVariable support
is missing, which is only needed if you want to make any modifications.
Also, efi-pstore and the sysfs EFI variable interface could be backed by
another implementation of the 'efivars' abstraction, in which case it is
completely irrelevant which services are supported by the EFI firmware.
So make the generic 'efivars' abstraction dependent on the availibility of
the GetVariable and GetNextVariable EFI runtime services, and add a helper
'efivar_supports_writes()' to find out whether the currently active efivars
abstraction supports writes (and wire it up to the availability of
SetVariable for the generic one).
Then, use the efivar_supports_writes() helper to decide whether to permit
efivarfs to be mounted read-write, and whether to enable efi-pstore or the
sysfs EFI variable interface altogether.
Fixes: bf67fad19e ("efi: Use more granular check for availability for variable services")
Reported-by: Heinrich Schuchardt <xypron.glpk@gmx.de>
Acked-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Tested-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Rationale:
Reduces attack surface on kernel devs opening the links for MITM
as HTTPS traffic is much harder to manipulate.
Deterministic algorithm:
For each file:
If not .svg:
For each line:
If doesn't contain `\bxmlns\b`:
For each link, `\bhttp://[^# \t\r\n]*(?:\w|/)`:
If neither `\bgnu\.org/license`, nor `\bmozilla\.org/MPL\b`:
If both the HTTP and HTTPS versions
return 200 OK and serve the same content:
Replace HTTP with HTTPS.
Link: https://lore.kernel.org/r/20200708171905.15396-1-grandmaster@al2klimov.de
Signed-off-by: Alexander A. Klimov <grandmaster@al2klimov.de>
Signed-off-by: Jan Kara <jack@suse.cz>
In order to correctly account/limit space usage, should initialize
quota info before calling quota related functions.
Link: https://lore.kernel.org/r/20200626054959.114177-1-cgxu519@mykernel.net
Signed-off-by: Chengguang Xu <cgxu519@mykernel.net>
Reviewed-by: Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Signed-off-by: Jan Kara <jack@suse.cz>
sbi->s_freeinodes_counter is only decreased by the ext2 code, it is never
increased. This patch fixes it.
Note that sbi->s_freeinodes_counter is only used in the algorithm that
tries to find the group for new allocations, so this bug is not easily
visible (the only visibility is that the group finding algorithm selects
inoptinal result).
Link: https://lore.kernel.org/r/alpine.LRH.2.02.2004201538300.19436@file01.intranet.prod.int.rdu2.redhat.com
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Almost all callers of ext2_find_entry() transform NULL return value to
-ENOENT, so just let ext2_find_entry() retuen -ENOENT instead of NULL
if no valid entry found, and also switch to check the return value of
ext2_inode_by_name() in ext2_lookup() and ext2_get_parent().
Link: https://lore.kernel.org/r/20200608034043.10451-2-yi.zhang@huawei.com
Signed-off-by: zhangyi (F) <yi.zhang@huawei.com>
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jan Kara <jack@suse.cz>
The same to commit <36de928641ee4> (ext4: propagate errors up to
ext4_find_entry()'s callers') in ext4, also return error instead of NULL
pointer in case of some error happens in ext2_find_entry() (e.g. -ENOMEM
or -EIO). This could avoid a negative dentry cache entry installed even
it failed to read directory block due to IO error.
Link: https://lore.kernel.org/r/20200608034043.10451-1-yi.zhang@huawei.com
Signed-off-by: zhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: Jan Kara <jack@suse.cz>
In the process of changing value for existing EA,
there is an improper assignment of e_value_offs(setting to 0),
because it will be reset to incorrect value in the following
loop(shifting EA values before target). Delayed assignment
can avoid this issue.
Link: https://lore.kernel.org/r/20200603084429.25344-1-cgxu519@mykernel.net
Signed-off-by: Chengguang Xu <cgxu519@mykernel.net>
Signed-off-by: Jan Kara <jack@suse.cz>
meta inode's pages are used for encrypted, verity and compressed blocks,
so the meta inode's cache invalidation condition in do_checkpoint() should
consider compression as well, not just for verity and encryption, fix it.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
For those applications which are not willing to use io_uring_enter()
to reap and handle cqes, they may completely rely on liburing's
io_uring_peek_cqe(), but if cq ring has overflowed, currently because
io_uring_peek_cqe() is not aware of this overflow, it won't enter
kernel to flush cqes, below test program can reveal this bug:
static void test_cq_overflow(struct io_uring *ring)
{
struct io_uring_cqe *cqe;
struct io_uring_sqe *sqe;
int issued = 0;
int ret = 0;
do {
sqe = io_uring_get_sqe(ring);
if (!sqe) {
fprintf(stderr, "get sqe failed\n");
break;;
}
ret = io_uring_submit(ring);
if (ret <= 0) {
if (ret != -EBUSY)
fprintf(stderr, "sqe submit failed: %d\n", ret);
break;
}
issued++;
} while (ret > 0);
assert(ret == -EBUSY);
printf("issued requests: %d\n", issued);
while (issued) {
ret = io_uring_peek_cqe(ring, &cqe);
if (ret) {
if (ret != -EAGAIN) {
fprintf(stderr, "peek completion failed: %s\n",
strerror(ret));
break;
}
printf("left requets: %d\n", issued);
continue;
}
io_uring_cqe_seen(ring, cqe);
issued--;
printf("left requets: %d\n", issued);
}
}
int main(int argc, char *argv[])
{
int ret;
struct io_uring ring;
ret = io_uring_queue_init(16, &ring, 0);
if (ret) {
fprintf(stderr, "ring setup failed: %d\n", ret);
return 1;
}
test_cq_overflow(&ring);
return 0;
}
To fix this issue, export cq overflow status to userspace by adding new
IORING_SQ_CQ_OVERFLOW flag, then helper functions() in liburing, such as
io_uring_peek_cqe, can be aware of this cq overflow and do flush accordingly.
Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Except for pktdvd, the only places setting congested bits are file
systems that allocate their own backing_dev_info structures. And
pktdvd is a deprecated driver that isn't useful in stack setup
either. So remove the dead congested_fn stacking infrastructure.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Song Liu <song@kernel.org>
Acked-by: David Sterba <dsterba@suse.com>
[axboe: fixup unused variables in bcache/request.c]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
check_disk_change isn't for consumers of the block layer, so remove
the comment mentioning it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
flush_disk has only two callers, so open code it there. That also helps
clarifying the error message for the particular case, and allows to remove
setting bd_invalidated in check_disk_size_change, which will be cleared
again instantly.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
It's safe to call kfree() with a NULL pointer, but it's also pointless.
Most of the time we don't have any data to free, and at millions of
requests per second, the redundant function call adds noticeable
overhead (about 1.3% of the runtime).
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The "apoll" variable is freed and then used on the next line. We need
to move the free down a few lines.
Fixes: 0be0b0e33b ("io_uring: simplify io_async_task_func()")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Wire up ext4 to support inline encryption via the helper functions which
fs/crypto/ now provides. This includes:
- Adding a mount option 'inlinecrypt' which enables inline encryption
on encrypted files where it can be used.
- Setting the bio_crypt_ctx on bios that will be submitted to an
inline-encrypted file.
Note: submit_bh_wbc() in fs/buffer.c also needed to be patched for
this part, since ext4 sometimes uses ll_rw_block() on file data.
- Not adding logically discontiguous data to bios that will be submitted
to an inline-encrypted file.
- Not doing filesystem-layer crypto on inline-encrypted files.
Co-developed-by: Satya Tangirala <satyat@google.com>
Signed-off-by: Satya Tangirala <satyat@google.com>
Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Link: https://lore.kernel.org/r/20200702015607.1215430-5-satyat@google.com
Signed-off-by: Eric Biggers <ebiggers@google.com>
Wire up f2fs to support inline encryption via the helper functions which
fs/crypto/ now provides. This includes:
- Adding a mount option 'inlinecrypt' which enables inline encryption
on encrypted files where it can be used.
- Setting the bio_crypt_ctx on bios that will be submitted to an
inline-encrypted file.
- Not adding logically discontiguous data to bios that will be submitted
to an inline-encrypted file.
- Not doing filesystem-layer crypto on inline-encrypted files.
This patch includes a fix for a race during IPU by
Sahitya Tummala <stummala@codeaurora.org>
Signed-off-by: Satya Tangirala <satyat@google.com>
Acked-by: Jaegeuk Kim <jaegeuk@kernel.org>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Link: https://lore.kernel.org/r/20200702015607.1215430-4-satyat@google.com
Co-developed-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Add support for inline encryption to fs/crypto/. With "inline
encryption", the block layer handles the decryption/encryption as part
of the bio, instead of the filesystem doing the crypto itself via
Linux's crypto API. This model is needed in order to take advantage of
the inline encryption hardware present on most modern mobile SoCs.
To use inline encryption, the filesystem needs to be mounted with
'-o inlinecrypt'. Blk-crypto will then be used instead of the traditional
filesystem-layer crypto whenever possible to encrypt the contents
of any encrypted files in that filesystem. Fscrypt still provides the key
and IV to use, and the actual ciphertext on-disk is still the same;
therefore it's testable using the existing fscrypt ciphertext verification
tests.
Note that since blk-crypto has a fallback to Linux's crypto API, and
also supports all the encryption modes currently supported by fscrypt,
this feature is usable and testable even without actual inline
encryption hardware.
Per-filesystem changes will be needed to set encryption contexts when
submitting bios and to implement the 'inlinecrypt' mount option. This
patch just adds the common code.
Signed-off-by: Satya Tangirala <satyat@google.com>
Reviewed-by: Jaegeuk Kim <jaegeuk@kernel.org>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Link: https://lore.kernel.org/r/20200702015607.1215430-3-satyat@google.com
Co-developed-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
- don't panic kernel if f2fs_get_node_page() fails in
f2fs_recover_inline_data() or f2fs_recover_inline_xattr();
- return error number of f2fs_truncate_blocks() to
f2fs_recover_inline_data()'s caller;
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
We should not be logging a warning repeatedly on change notify.
CC: Stable <stable@vger.kernel.org> # v5.6+
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Consolidate the two in-kernel read helpers to make upcoming changes
easier. The only difference are the missing call to rw_verify_area
in kernel_read, and an access_ok check that doesn't make sense for
kernel buffers to start with.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Consolidate the two in-kernel write helpers to make upcoming changes
easier. The only difference are the missing call to rw_verify_area
in kernel_write, and an access_ok check that doesn't make sense for
kernel buffers to start with.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Add a WARN_ON_ONCE if the file isn't actually open for write. This
matches the check done in vfs_write, but actually warn warns as a
kernel user calling write on a file not opened for writing is a pretty
obvious programming error.
Signed-off-by: Christoph Hellwig <hch@lst.de>
While pipes don't really need sb_writers projection, __kernel_write is an
interface better kept private, and the additional rw_verify_area does not
hurt here.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Ian Kent <raven@themaw.net>
__kernel_write doesn't take a sb_writers references, which we need here.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Howells <dhowells@redhat.com>
Added a new gc_urgent mode, GC_URGENT_LOW, in which mode
F2FS will lower the bar of checking idle in order to
process outstanding discard commands and GC a little bit
aggressively.
Signed-off-by: Daeho Jeong <daehojeong@google.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If two readahead threads having same offset enter in readpages, every read
IOs are split and issued to the disk which giving lower bandwidth.
This patch tries to avoid redundant readahead calls.
Fixes one build error reported by Randy.
Fix build error when F2FS_FS_COMPRESSION is not set/enabled.
This label is needed in either case.
../fs/f2fs/data.c: In function ‘f2fs_mpage_readpages’:
../fs/f2fs/data.c:2327:5: error: label ‘next_page’ used but not defined
goto next_page;
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If f2fs_grab_cache_page() fails, it needs to return -ENOMEM.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The parameter op_flag is not used in f2fs_get_read_data_page(),
but it is used in f2fs_grab_read_bio(). Obviously, op_flag is
not passed to f2fs_grab_read_bio() successfully. We need to add
parameter in f2fs_submit_page_read() to pass it.
The case:
- gc_data_segment
- f2fs_get_read_data_page(.., op_flag = REQ_RAHEAD,..)
- f2fs_submit_page_read
- f2fs_grab_read_bio(.., op_flag = 0, ..)
Signed-off-by: Jia Yang <jiayang5@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
to two independent functions:
- f2fs_allocate_new_segment() for specified type segment allocation
- f2fs_allocate_new_segments() for all data type segments allocation
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
if get_node_path() return -E2BIG and trace of
f2fs_truncate_inode_blocks_enter/exit enabled
then the matching-pair of trace_exit will lost
in log.
Signed-off-by: Yubo Feng <fengyubo3@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In the f2fs_unlink we do not add trace end for some
error paths, just add.
Signed-off-by: Lihong Kou <koulihong@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In f2fs_write_raw_pages(), we need to check page dirty status before
writeback, because there could be a racer (e.g. reclaimer) helps
writebacking the dirty page.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The parameter compr is unused in the f2fs_cluster_blocks function
so we no longer need to pass it as a parameter.
Signed-off-by: Wang Xiaojun <wangxiaojun11@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
to show f2fs_fiemap()'s result as below:
f2fs_fiemap: dev = (251,0), ino = 7, lblock:0, pblock:1625292800, len:2097152, flags:0, ret:0
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
to show f2fs_bmap()'s result as below:
f2fs_bmap: dev = (251,0), ino = 7, lblock:0, pblock:396800
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If compression is disable, we should return zero rather than -EOPNOTSUPP
to indicate f2fs_bmap() is not supported.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In current version, @state will only be FREE_NID. This parameter
has no real effect so remove it to keep clean.
Signed-off-by: Liu Song <liu.song11@zte.com.cn>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
stakable/stackable
Signed-off-by: Liu Song <fishland@aliyun.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Filesystem including f2fs should support stable page for special
device like software raid, however there is one missing path that
page could be updated while it is writeback state as below, fix
this.
- gc_node_segment
- f2fs_move_node_page
- __write_node_page
- set_page_writeback
- do_read_inode
- f2fs_init_extent_tree
- __f2fs_init_extent_tree
i_ext->len = 0;
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
When f2fs_ioc_gc_range performs multiple segments gc ops, the return
value of f2fs_ioc_gc_range is determined by the last segment gc ops.
If its ops failed, the f2fs_ioc_gc_range will be considered to be failed
despite some of previous segments gc ops succeeded. Therefore, so we
fix: Redefine the return value of getting victim ops and add exception
handle for f2fs_gc. In particular, 1).if target has no valid block, it
will go on. 2).if target sectoion has valid block(s), but it is current
section, we will reminder the caller.
Signed-off-by: Qilong Zhang <zhangqilong3@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Use validation of @fio to inidcate whether caller want to serialize IOs
in io.io_list or not, then @add_list will be redundant, remove it.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
- to avoid race between checkpoint and quota file writeback, it
just needs to hold read lock of node_write in writeback path.
- node_write lock has covered all LFS data write paths, it's not
necessary, we only need to hold node_write lock at write path of
quota file.
This refactors commit ca7f76e680 ("f2fs: fix wrong discard space").
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The caller of cifs_posix_lock_set will do retry(like
fcntl_setlk64->do_lock_file_wait) if we will wait for any file_lock.
So the retry in cifs_poxis_lock_set seems duplicated, remove it to
make a cleanup.
Signed-off-by: yangerkun <yangerkun@huawei.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: NeilBrown <neilb@suse.de>
read permission, not just read attributes permission, is required
on the directory.
See MS-SMB2 (protocol specification) section 3.3.5.19.
Signed-off-by: Steve French <stfrench@microsoft.com>
CC: Stable <stable@vger.kernel.org> # v5.6+
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
So far, gfs2 has taken the inode glocks inside the ->readpage and
->readahead address space operations. Since commit d4388340ae ("fs:
convert mpage_readpages to mpage_readahead"), gfs2_readahead is passed
the pages to read ahead locked. With that, the current holder of the
inode glock may be trying to lock one of those pages while
gfs2_readahead is trying to take the inode glock, resulting in a
deadlock.
Fix that by moving the lock taking to the higher-level ->read_iter file
and ->fault vm operations. This also gets rid of an ugly lock inversion
workaround in gfs2_readpage.
The cache consistency model of filesystems like gfs2 is such that if
data is found in the page cache, the data is up to date and can be used
without taking any filesystem locks. If a page is not cached,
filesystem locks must be taken before populating the page cache.
To avoid taking the inode glock when the data is already cached,
gfs2_file_read_iter first tries to read the data with the IOCB_NOIO flag
set. If that fails, the inode glock is taken and the operation is
retried with the IOCB_NOIO flag cleared.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl8EdTkACgkQxWXV+ddt
WDv6xA/9Hguo/k6oj/7Nl9n3UUZ7gp44R/jy37fhMuNcwuEDuqIEfAgGXupdJVaj
pYDorUMRUQfI2yLB1iHAnPgBMKBidSroDsdrRHKuimnhABSO2/KX/KXPianIIRGi
wPvqZR04L565LNpRlDQx7OYkJWey7b6xf47UZqDglivnKY1OwCJlXgfCj/9FApr0
Y+PVlgEU78ExTeAHs/h8ofZ/f5T2eqiluBSFVykzCg1NngaQVOKpN3gnWEatUAvM
ekm6U4E1ZR9oOprdhlf6V96ztGzVTRKB1vFIeCvJLqLNIe+0pxlRfRn2aOj8vzEO
DRjgOlhyAIgypp78SwCspjhvejvVneSFdEGSVvHOw1ombB//OJ1qBb5G/lIcwCj3
PZ3OnQJV7+/Ty7Xt/X26W841zvnu90K0di0CsOPehtbkgkR4txgHCJB9mSlsMugN
awN5Ryy1rw1cAM5GspXG9EEOvJmnSizQf4BcK649IG5eUKThYYLc5mp68jiMljs0
NHFPg5P4yTRjk7Yqgxq5VvTPLLJo5j5xxqtY/1zDWuguRa40wIoy/JUJaJoPg9Vd
221/qRG4R4xGyZXGx6XTiWK+M3qjTlS9My9tGoWygwlExRkr7Uli9Ikef3U0tBoF
bjTcfCNOuCp+JECHNcnMZ9fhhFaMwIL1V4OflB1iicBAtXxo8Lk=
=+4BZ
-----END PGP SIGNATURE-----
Merge tag 'for-5.8-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- regression fix of a leak in global block reserve accounting
- fix a (hard to hit) race of readahead vs releasepage that could lead
to crash
- convert all remaining uses of comment fall through annotations to the
pseudo keyword
- fix crash when mounting a fuzzed image with -o recovery
* tag 'for-5.8-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: reset tree root pointer after error in init_tree_roots
btrfs: fix reclaim_size counter leak after stealing from global reserve
btrfs: fix fatal extent_buffer readahead vs releasepage race
btrfs: convert comments to fallthrough annotations
First of all don't spin in io_ring_ctx_wait_and_kill() on iopoll.
Requests won't complete faster because of that, but only lengthen
io_uring_release().
The same goes for offloaded cleanup in io_ring_exit_work() -- it
already has waiting loop, don't do blocking active spinning.
For that, pass min=0 into io_iopoll_[try_]reap_events(), so it won't
actively spin. Leave the function if io_do_iopoll() there can't
complete a request to sleep in io_ring_exit_work().
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Nobody checks io_iopoll_check()'s output parameter @nr_events.
Remove the parameter and declare it further down the stack.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
io_iopoll_reap_events() doesn't care about returned valued of
io_iopoll_getevents() and does the same checks for list emptiness
and need_resched(). Just use io_do_iopoll().
io_sq_thread() doesn't check return value as well. It also passes min=0,
so there never be the second iteration inside io_poll_getevents().
Inline it there too.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Make sure the rtbitmap is large enough to store the entire bitmap.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Allison Collins <allison.henderson@oracle.com>
Ensure that the realtime bitmap file is backed entirely by written
extents. No holes, no unwritten blocks, etc.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Allison Collins <allison.henderson@oracle.com>
This debug code is called on every xfs_iflush() call, which then
checks every inode in the buffer for non-zero unlinked list field.
Hence it checks every inode in the cluster buffer every time a
single inode on that cluster it flushed. This is resulting in:
- 38.91% 5.33% [kernel] [k] xfs_iflush
- 17.70% xfs_iflush
- 9.93% xfs_inobp_check
4.36% xfs_buf_offset
10% of the CPU time spent flushing inodes is repeatedly checking
unlinked fields in the buffer. We don't need to do this.
The other place we call xfs_inobp_check() is
xfs_iunlink_update_dinode(), and this is after we've done this
assert for the agino we are about to write into that inode:
ASSERT(xfs_verify_agino_or_null(mp, agno, next_agino));
which means we've already checked that the agino we are about to
write is not 0 on debug kernels. The inode buffer verifiers do
everything else we need, so let's just remove this debug code.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
xfs_iflush_done() does 3 distinct operations to the inodes attached
to the buffer. Separate these operations out into functions so that
it is easier to modify these operations independently in future.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Now that we have all the dirty inodes attached to the cluster
buffer, we don't actually have to do radix tree lookups to find
them. Sure, the radix tree is efficient, but walking a linked list
of just the dirty inodes attached to the buffer is much better.
We are also no longer dependent on having a locked inode passed into
the function to determine where to start the lookup. This means we
can drop it from the function call and treat all inodes the same.
We also make xfs_iflush_cluster skip inodes marked with
XFS_IRECLAIM. This we avoid races with inodes that reclaim is
actively referencing or are being re-initialised by inode lookup. If
they are actually dirty, they'll get written by a future cluster
flush....
We also add a shutdown check after obtaining the flush lock so that
we catch inodes that are dirty in memory and may have inconsistent
state due to the shutdown in progress. We abort these inodes
directly and so they remove themselves directly from the buffer list
and the AIL rather than having to wait for the buffer to be failed
and callbacks run to be processed correctly.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
with xfs_iflush() gone, we can rename xfs_iflush_int() back to
xfs_iflush(). Also move it up above xfs_iflush_cluster() so we don't
need the forward definition any more.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Now we have a cached buffer on inode log items, we don't need
to do buffer lookups when flushing inodes anymore - all we need
to do is lock the buffer and we are ready to go.
This largely gets rid of the need for xfs_iflush(), which is
essentially just a mechanism to look up the buffer and flush the
inode to it. Instead, we can just call xfs_iflush_cluster() with a
few modifications to ensure it also flushes the inode we already
hold locked.
This allows the AIL inode item pushing to be almost entirely
non-blocking in XFS - we won't block unless memory allocation
for the cluster inode lookup blocks or the block device queues are
full.
Writeback during inode reclaim becomes a little more complex because
we now have to lock the buffer ourselves, but otherwise this change
is largely a functional no-op that removes a whole lot of code.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Rather than attach inodes to the cluster buffer just when we are
doing IO, attach the inodes to the cluster buffer when they are
dirtied. The means the buffer always carries a list of dirty inodes
that reference it, and we can use that list to make more fundamental
changes to inode writeback that aren't otherwise possible.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Once we have inodes pinning the cluster buffer and attached whenever
they are dirty, we no longer have a guarantee that the items are
flush locked when we lock the cluster buffer. Hence we cannot just
walk the buffer log item list and modify the attached inodes.
If the inode is not flush locked, we have to ILOCK it first and then
flush lock it to do all the prerequisite checks needed to avoid
races with other code. This is already handled by
xfs_ifree_get_one_inode(), so rework the inode iteration loop and
function to update all inodes in cache whether they are attached to
the buffer or not.
Note: we also remove the copying of the log item lsn to the
ili_flush_lsn as xfs_iflush_done() now uses the XFS_ISTALE flag to
trigger aborts and so flush lsn matching is not needed in IO
completion for processing freed inodes.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Inode reclaim is quite different now to the way described in various
comments, so update all the comments explaining what it does and how
it works.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Clean up xfs_reclaim_inodes() callers. Most callers want blocking
behaviour, so just make the existing SYNC_WAIT behaviour the
default.
For the xfs_reclaim_worker(), just call xfs_reclaim_inodes_ag()
directly because we just want optimistic clean inode reclaim to be
done in the background.
For xfs_quiesce_attr() we can just remove the inode reclaim calls as
they are a historic relic that was required to flush dirty inodes
that contained unlogged changes. We now log all changes to the
inodes, so the sync AIL push from xfs_log_quiesce() called by
xfs_quiesce_attr() will do all the required inode writeback for
freeze.
Seeing as we now want to loop until all reclaimable inodes have been
reclaimed, make xfs_reclaim_inodes() loop on the XFS_ICI_RECLAIM_TAG
tag rather than having xfs_reclaim_inodes_ag() tell it that inodes
were skipped. This is much more reliable and will always loop until
all reclaimable inodes are reclaimed.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
All background reclaim is SYNC_TRYLOCK already, and even blocking
reclaim (SYNC_WAIT) can use trylock mechanisms as
xfs_reclaim_inodes_ag() will keep cycling until there are no more
reclaimable inodes. Hence we can kill SYNC_TRYLOCK from inode
reclaim and make everything unconditionally non-blocking.
We remove all the optimistic "avoid blocking on locks" checks done
in xfs_reclaim_inode_grab() as nothing blocks on locks anymore.
Further, checking XFS_IFLOCK optimistically can result in detecting
inodes in the process of being cleaned (i.e. between being removed
from the AIL and having the flush lock dropped), so for
xfs_reclaim_inodes() to reliably reclaim all inodes we need to drop
these checks anyway.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
When we attempt to reclaim an inode, the first thing we do is take
the inode lock. This is blocking right now, so if the inode being
accessed by something else (e.g. being flushed to the cluster
buffer) we will block here.
Change this to a trylock so that we do not block inode reclaim
unnecessarily here.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Inode reclaim will still throttle direct reclaim on the per-ag
reclaim locks. This is no longer necessary as reclaim can run
non-blocking now. Hence we can remove these locks so that we don't
arbitrarily block reclaimers just because there are more direct
reclaimers than there are AGs.
This can result in multiple reclaimers working on the same range of
an AG, but this doesn't cause any apparent issues. Optimising the
spread of concurrent reclaimers for best efficiency can be done in a
future patchset.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
We no longer need to issue IO from shrinker based inode reclaim to
prevent spurious OOM killer invocation. This leaves only the global
filesystem management operations such as unmount needing to
writeback dirty inodes and reclaim them.
Instead of using the reclaim pass to write dirty inodes before
reclaiming them, use the AIL to push all the dirty inodes before we
try to reclaim them. This allows us to remove all the conditional
SYNC_WAIT locking and the writeback code from xfs_reclaim_inode()
and greatly simplify the checks we need to do to reclaim an inode.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Now that dirty inode writeback doesn't cause read-modify-write
cycles on the inode cluster buffer under memory pressure, the need
to throttle memory reclaim to the rate at which we can clean dirty
inodes goes away. That is due to the fact that we no longer thrash
inode cluster buffers under memory pressure to clean dirty inodes.
This means inode writeback no longer stalls on memory allocation
or read IO, and hence can be done asynchronously without generating
memory pressure. As a result, blocking inode writeback in reclaim is
no longer necessary to prevent reclaim priority windup as cleaning
dirty inodes is no longer dependent on having memory reserves
available for the filesystem to make progress reclaiming inodes.
Hence we can convert inode reclaim to be non-blocking for shrinker
callouts, both for direct reclaim and kswapd.
On a vanilla kernel, running a 16-way fsmark create workload on a
4 node/16p/16GB RAM machine, I can reliably pin 14.75GB of RAM via
userspace mlock(). The OOM killer gets invoked at 15GB of
pinned RAM.
Without the inode cluster pinning, this non-blocking reclaim patch
triggers premature OOM killer invocation with the same memory
pinning, sometimes with as much as 45% of RAM being free. It's
trivially easy to trigger the OOM killer when reclaim does not
block.
With pinning inode clusters in RAM and then adding this patch, I can
reliably pin 14.5GB of RAM and still have the fsmark workload run to
completion. The OOM killer gets invoked 14.75GB of pinned RAM, which
is only a small amount of memory less than the vanilla kernel. It is
much more reliable than just with async reclaim alone.
simoops shows that allocation stalls go away when async reclaim is
used. Vanilla kernel:
Run time: 1924 seconds
Read latency (p50: 3,305,472) (p95: 3,723,264) (p99: 4,001,792)
Write latency (p50: 184,064) (p95: 553,984) (p99: 807,936)
Allocation latency (p50: 2,641,920) (p95: 3,911,680) (p99: 4,464,640)
work rate = 13.45/sec (avg 13.44/sec) (p50: 13.46) (p95: 13.58) (p99: 13.70)
alloc stall rate = 3.80/sec (avg: 2.59) (p50: 2.54) (p95: 2.96) (p99: 3.02)
With inode cluster pinning and async reclaim:
Run time: 1924 seconds
Read latency (p50: 3,305,472) (p95: 3,715,072) (p99: 3,977,216)
Write latency (p50: 187,648) (p95: 553,984) (p99: 789,504)
Allocation latency (p50: 2,748,416) (p95: 3,919,872) (p99: 4,448,256)
work rate = 13.28/sec (avg 13.32/sec) (p50: 13.26) (p95: 13.34) (p99: 13.34)
alloc stall rate = 0.02/sec (avg: 0.02) (p50: 0.01) (p95: 0.03) (p99: 0.03)
Latencies don't really change much, nor does the work rate. However,
allocation almost never stalls with these changes, whilst the
vanilla kernel is sometimes reporting 20 stalls/s over a 60s sample
period. This difference is due to inode reclaim being largely
non-blocking now.
IOWs, once we have pinned inode cluster buffers, we can make inode
reclaim non-blocking without a major risk of premature and/or
spurious OOM killer invocation, and without any changes to memory
reclaim infrastructure.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
When we dirty an inode, we are going to have to write it disk at
some point in the near future. This requires the inode cluster
backing buffer to be present in memory. Unfortunately, under severe
memory pressure we can reclaim the inode backing buffer while the
inode is dirty in memory, resulting in stalling the AIL pushing
because it has to do a read-modify-write cycle on the cluster
buffer.
When we have no memory available, the read of the cluster buffer
blocks the AIL pushing process, and this causes all sorts of issues
for memory reclaim as it requires inode writeback to make forwards
progress. Allocating a cluster buffer causes more memory pressure,
and results in more cluster buffers to be reclaimed, resulting in
more RMW cycles to be done in the AIL context and everything then
backs up on AIL progress. Only the synchronous inode cluster
writeback in the the inode reclaim code provides some level of
forwards progress guarantees that prevent OOM-killer rampages in
this situation.
Fix this by pinning the inode backing buffer to the inode log item
when the inode is first dirtied (i.e. in xfs_trans_log_inode()).
This may mean the first modification of an inode that has been held
in cache for a long time may block on a cluster buffer read, but
we can do that in transaction context and block safely until the
buffer has been allocated and read.
Once we have the cluster buffer, the inode log item takes a
reference to it, pinning it in memory, and attaches it to the log
item for future reference. This means we can always grab the cluster
buffer from the inode log item when we need it.
When the inode is finally cleaned and removed from the AIL, we can
drop the reference the inode log item holds on the cluster buffer.
Once all inodes on the cluster buffer are clean, the cluster buffer
will be unpinned and it will be available for memory reclaim to
reclaim again.
This avoids the issues with needing to do RMW cycles in the AIL
pushing context, and hence allows complete non-blocking inode
flushing to be performed by the AIL pushing context.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
xfs_ail_delete_one() is called directly from dquot and inode IO
completion, as well as from the generic xfs_trans_ail_delete()
function. Inodes are about to have their own failure handling, and
dquots will in future, too. Pull the clearing of the LI_FAILED flag
up into the callers so we can customise the code appropriately.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
When an buffer IO error occurs, we want to mark all
the log items attached to the buffer as failed. Open code
the error handling loop so that we can modify the flagging for the
different types of objects directly and independently of each other.
This also allows us to remove the ->iop_error method from the log
item operations.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Currently when a buffer with attached log items has an IO error
it called ->iop_error for each attched log item. These all call
xfs_set_li_failed() to handle the error, but we are about to change
the way log items manage buffers. hence we first need to remove the
per-item dependency on buffer handling done by xfs_set_li_failed().
We already have specific buffer type IO completion routines, so move
the log item error handling out of the generic error handling and
into the log item specific functions so we can implement per-type
error handling easily.
This requires a more complex return value from the error handling
code so that we can take the correct action the failure handling
requires. This results in some repeated boilerplate in the
functions, but that can be cleaned up later once all the changes
cascade through this code.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
They are not used anymore, so remove them from the log item and the
buffer iodone attachment interfaces.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Now that we've sorted inode and dquot buffers, we can apply the same
cleanups to dirty buffers with buffer log items. They only have one
callback, too, so we don't need the log item callback. Collapse the
iodone functions and remove all the now unnecessary infrastructure
around callback processing.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
[BUG]
The following small test script can trigger ASSERT() at unmount time:
mkfs.btrfs -f $dev
mount $dev $mnt
mount -o remount,discard=async $mnt
umount $mnt
The call trace:
assertion failed: atomic_read(&block_group->count) == 1, in fs/btrfs/block-group.c:3431
------------[ cut here ]------------
kernel BUG at fs/btrfs/ctree.h:3204!
invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
CPU: 4 PID: 10389 Comm: umount Tainted: G O 5.8.0-rc3-custom+ #68
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
Call Trace:
btrfs_free_block_groups.cold+0x22/0x55 [btrfs]
close_ctree+0x2cb/0x323 [btrfs]
btrfs_put_super+0x15/0x17 [btrfs]
generic_shutdown_super+0x72/0x110
kill_anon_super+0x18/0x30
btrfs_kill_super+0x17/0x30 [btrfs]
deactivate_locked_super+0x3b/0xa0
deactivate_super+0x40/0x50
cleanup_mnt+0x135/0x190
__cleanup_mnt+0x12/0x20
task_work_run+0x64/0xb0
__prepare_exit_to_usermode+0x1bc/0x1c0
__syscall_return_slowpath+0x47/0x230
do_syscall_64+0x64/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
The code:
ASSERT(atomic_read(&block_group->count) == 1);
btrfs_put_block_group(block_group);
[CAUSE]
Obviously it's some btrfs_get_block_group() call doesn't get its put
call.
The offending btrfs_get_block_group() happens here:
void btrfs_mark_bg_unused(struct btrfs_block_group *bg)
{
if (list_empty(&bg->bg_list)) {
btrfs_get_block_group(bg);
list_add_tail(&bg->bg_list, &fs_info->unused_bgs);
}
}
So every call sites removing the block group from unused_bgs list should
reduce the ref count of that block group.
However for async discard, it didn't follow the call convention:
void btrfs_discard_punt_unused_bgs_list(struct btrfs_fs_info *fs_info)
{
list_for_each_entry_safe(block_group, next, &fs_info->unused_bgs,
bg_list) {
list_del_init(&block_group->bg_list);
btrfs_discard_queue_work(&fs_info->discard_ctl, block_group);
}
}
And in btrfs_discard_queue_work(), it doesn't call
btrfs_put_block_group() either.
[FIX]
Fix the problem by reducing the reference count when we grab the block
group from unused_bgs list.
Reported-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Fixes: 6e80d4f8c4 ("btrfs: handle empty block_group removal for async discard")
CC: stable@vger.kernel.org # 5.6+
Tested-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When building a kernel with CONFIG_PSTORE=y and CONFIG_CRYPTO not set,
a build error happens:
ld: fs/pstore/platform.o: in function `pstore_dump':
platform.c:(.text+0x3f9): undefined reference to `crypto_comp_compress'
ld: fs/pstore/platform.o: in function `pstore_get_backend_records':
platform.c:(.text+0x784): undefined reference to `crypto_comp_decompress'
This because some pstore code uses crypto_comp_(de)compress regardless
of the CONFIG_CRYPTO status. Fix it by wrapping the (de)compress usage
by IS_ENABLED(CONFIG_PSTORE_COMPRESS)
Signed-off-by: Matteo Croce <mcroce@linux.microsoft.com>
Link: https://lore.kernel.org/lkml/20200706234045.9516-1-mcroce@linux.microsoft.com
Fixes: cb3bee0369 ("pstore: Use crypto compress API")
Cc: stable@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Make sure iomap_end is always called when iomap_begin succeeds.
Without this fix, iomap_end won't be called when a filesystem's
iomap_begin operation returns an invalid mapping, bypassing any
unlocking done in iomap_end. With this fix, the unlocking will still
happen.
This bug was found by Bob Peterson during code review. It's unlikely
that such iomap_begin bugs will survive to affect users, so backporting
this fix seems unnecessary.
Fixes: ae259a9c85 ("fs: introduce iomap infrastructure")
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Similar to inodes, we can call the dquot IO completion functions
directly from the buffer completion code, removing another user of
log item callbacks for IO completion processing.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Having different io completion callbacks for different inode states
makes things complex. We can detect if the inode is stale via the
XFS_ISTALE flag in IO completion, so we don't need a special
callback just for this.
This means inodes only have a single iodone callback, and inode IO
completion is entirely buffer centric at this point. Hence we no
longer need to use a log item callback at all as we can just call
xfs_iflush_done() directly from the buffer completions and walk the
buffer log item list to complete the all inodes under IO.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
When we've emptied the buffer log item list, it does a list_del_init
on itself to reset it's pointers to itself. This is unnecessary as
the list is already empty at this point - it was a left-over
fragment from the list_head conversion of the buffer log item list.
Remove them.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
All unmarked dirty buffers should be in the AIL and have log items
attached to them. Hence when they are written, we will run a
callback to remove the item from the AIL if appropriate. Now that
we've handled inode and dquot buffers, all remaining calls are to
xfs_buf_iodone() and so we can hard code this rather than use an
indirect call.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Log recovery has it's own buffer write completion handler for
buffers that it directly recovers. Convert these to direct calls by
flagging these buffers as being log recovery buffers. The flag will
get cleared by the log recovery IO completion routine, so it will
never leak out of log recovery.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
dquot buffers always have write IO callbacks, so by marking them
directly we can avoid needing to attach ->b_iodone functions to
them. This avoids an indirect call, and makes future modifications
much simpler.
This is largely a rearrangement of the code at this point - no IO
completion functionality changes at this point, just how the
code is run is modified.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Inode buffers always have write IO callbacks, so by marking them
directly we can avoid needing to attach ->b_iodone functions to
them. This avoids an indirect call, and makes future modifications
much simpler.
While this is largely a refactor of existing functionality, we
broaden the scope of the flag to beyond where inodes are explicitly
attached because future changes need to know what type of log items
are attached to the buffer. Adding this buffer flag may invoke the
inode iodone callback in cases where it wouldn't have been
previously, but this is not a functional change because the callback
is identical to the normal buffer write iodone callback when inodes
are not attached.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The inode log item is kind of special in that it can be aggregating
new changes in memory at the same time time existing changes are
being written back to disk. This means there are fields in the log
item that are accessed concurrently from contexts that don't share
any locking at all.
e.g. updating ili_last_fields occurs at flush time under the
ILOCK_EXCL and flush lock at flush time, under the flush lock at IO
completion time, and is read under the ILOCK_EXCL when the inode is
logged. Hence there is no actual serialisation between reading the
field during logging of the inode in transactions vs clearing the
field in IO completion.
We currently get away with this by the fact that we are only
clearing fields in IO completion, and nothing bad happens if we
accidentally log more of the inode than we actually modify. Worst
case is we consume a tiny bit more memory and log bandwidth.
However, if we want to do more complex state manipulations on the
log item that requires updates at all three of these potential
locations, we need to have some mechanism of serialising those
operations. To do this, introduce a spinlock into the log item to
serialise internal state.
This could be done via the xfs_inode i_flags_lock, but this then
leads to potential lock inversion issues where inode flag updates
need to occur inside locks that best nest inside the inode log item
locks (e.g. marking inodes stale during inode cluster freeing).
Using a separate spinlock avoids these sorts of problems and
simplifies future code.
This does not touch the use of ili_fields in the item formatting
code - that is entirely protected by the ILOCK_EXCL at this point in
time, so it remains untouched.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
This was used to track if the item had logged fields being flushed
to disk. We log everything in the inode these days, so this logic is
no longer needed. Remove it.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
In tracking down a problem in this patchset, I discovered we are
reclaiming dirty stale inodes. This wasn't discovered until inodes
were always attached to the cluster buffer and then the rcu callback
that freed inodes was assert failing because the inode still had an
active pointer to the cluster buffer after it had been reclaimed.
Debugging the issue indicated that this was a pre-existing issue
resulting from the way the inodes are handled in xfs_inactive_ifree.
When we free a cluster buffer from xfs_ifree_cluster, all the inodes
in cache are marked XFS_ISTALE. Those that are clean have nothing
else done to them and so eventually get cleaned up by background
reclaim. i.e. it is assumed we'll never dirty/relog an inode marked
XFS_ISTALE.
On journal commit dirty stale inodes as are handled by both
buffer and inode log items to run though xfs_istale_done() and
removed from the AIL (buffer log item commit) or the log item will
simply unpin it because the buffer log item will clean it. What happens
to any specific inode is entirely dependent on which log item wins
the commit race, but the result is the same - stale inodes are
clean, not attached to the cluster buffer, and not in the AIL. Hence
inode reclaim can just free these inodes without further care.
However, if the stale inode is relogged, it gets dirtied again and
relogged into the CIL. Most of the time this isn't an issue, because
relogging simply changes the inode's location in the current
checkpoint. Problems arise, however, when the CIL checkpoints
between two transactions in the xfs_inactive_ifree() deferops
processing. This results in the XFS_ISTALE inode being redirtied
and inserted into the CIL without any of the other stale cluster
buffer infrastructure being in place.
Hence on journal commit, it simply gets unpinned, so it remains
dirty in memory. Everything in inode writeback avoids XFS_ISTALE
inodes so it can't be written back, and it is not tracked in the AIL
so there's not even a trigger to attempt to clean the inode. Hence
the inode just sits dirty in memory until inode reclaim comes along,
sees that it is XFS_ISTALE, and goes to reclaim it. This reclaiming
of a dirty inode caused use after free, list corruptions and other
nasty issues later in this patchset.
Hence this patch addresses a violation of the "never log XFS_ISTALE
inodes" caused by the deferops processing rolling a transaction
and relogging a stale inode in xfs_inactive_free. It also adds a
bunch of asserts to catch this problem in debug kernels so that
we don't reintroduce this problem in future.
Reproducer for this issue was generic/558 on a v4 filesystem.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Remove current_pid(), current_test_flags() and
current_clear_flags_nested(), because they are useless.
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The page faultround path ->map_pages is implemented in XFS via
filemap_map_pages(). This function checks that pages found in page
cache lookups have not raced with truncate based invalidation by
checking page->mapping is correct and page->index is within EOF.
However, we've known for a long time that this is not sufficient to
protect against races with invalidations done by operations that do
not change EOF. e.g. hole punching and other fallocate() based
direct extent manipulations. The way we protect against these
races is we wrap the page fault operations in a XFS_MMAPLOCK_SHARED
lock so they serialise against fallocate and truncate before calling
into the filemap function that processes the fault.
Do the same for XFS's ->map_pages implementation to close this
potential data corruption issue.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Move the double-inode locking helpers to xfs_inode.c since they're not
specific to reflink.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Refactor the two functions that we use to lock and unlock two inodes to
block userspace from initiating IO against a file, whether via system
calls or mmap activity.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Fix the return value of xfs_reflink_remap_prep so that its return value
conventions match the rest of xfs.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
If the source and destination map are identical, we can skip the remap
step to save some time.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
When logging quota block count updates during a reflink operation, we
only log the /delta/ of the block count changes to the dquot. Since we
now know ahead of time the extent type of both dmap and smap (and that
they have the same length), we know that we only need to reserve quota
blocks for dmap's blockcount if we're mapping it into a hole.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Now that we've reworked xfs_reflink_remap_extent to remap only one
extent per transaction, we actually know if the extent being removed is
an allocated mapping. This means that we now know ahead of time if
we're going to be touching the data fork.
Since we only need blocks for a bmbt split if we're going to update the
data fork, we only need to get quota reservation if we know we're going
to touch the data fork.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
The existing reflink remapping loop has some structural problems that
need addressing:
The biggest problem is that we create one transaction for each extent in
the source file without accounting for the number of mappings there are
for the same range in the destination file. In other words, we don't
know the number of remap operations that will be necessary and we
therefore cannot guess the block reservation required. On highly
fragmented filesystems (e.g. ones with active dedupe) we guess wrong,
run out of block reservation, and fail.
The second problem is that we don't actually use the bmap intents to
their full potential -- instead of calling bunmapi directly and having
to deal with its backwards operation, we could call the deferred ops
xfs_bmap_unmap_extent and xfs_refcount_decrease_extent instead. This
makes the frontend loop much simpler.
Solve all of these problems by refactoring the remapping loops so that
we only perform one remapping operation per transaction, and each
operation only tries to remap a single extent from source to dest.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reported-by: Edwin Török <edwin@etorok.net>
Tested-by: Edwin Török <edwin@etorok.net>
The name of this predicate is a little misleading -- it decides if the
extent mapping is allocated and written. Change the name to be more
direct, as we're going to add a new predicate in the next patch.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Quota reservations are supposed to account for the blocks that might be
allocated due to a bmap btree split. Reflink doesn't do this, so fix
this to make the quota accounting more accurate before we start
rearranging things.
Fixes: 862bb360ef ("xfs: reflink extents from one file to another")
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
The data fork scrubber calls filemap_write_and_wait to flush dirty pages
and delalloc reservations out to disk prior to checking the data fork's
extent mappings. Unfortunately, this means that scrub can consume the
EIO/ENOSPC errors that would otherwise have stayed around in the address
space until (we hope) the writer application calls fsync to persist data
and collect errors. The end result is that programs that wrote to a
file might never see the error code and proceed as if nothing were
wrong.
xfs_scrub is not in a position to notify file writers about the
writeback failure, and it's only here to check metadata, not file
contents. Therefore, if writeback fails, we should stuff the error code
back into the address space so that an fsync by the writer application
can pick that up.
Fixes: 99d9d8d05d ("xfs: scrub inode block mappings")
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
The rmapbt extent swap algorithm remaps individual extents between
the source inode and the target to trigger reverse mapping metadata
updates. If either inode straddles a format or other bmap allocation
boundary, the individual unmap and map cycles can trigger repeated
bmap block allocations and frees as the extent count bounces back
and forth across the boundary. While net block usage is bound across
the swap operation, this behavior can prematurely exhaust the
transaction block reservation because it continuously drains as the
transaction rolls. Each allocation accounts against the reservation
and each free returns to global free space on transaction roll.
The previous workaround to this problem attempted to detect this
boundary condition and provide surplus block reservation to
acommodate it. This is insufficient because more remaps can occur
than implied by the extent counts; if start offset boundaries are
not aligned between the two inodes, for example.
To address this problem more generically and dynamically, add a
transaction accounting mode that returns freed blocks to the
transaction reservation instead of the superblock counters on
transaction roll and use it when the rmapbt based algorithm is
active. This allows the chain of remap transactions to preserve the
block reservation based own its own frees and prevent premature
exhaustion regardless of the remap pattern. Note that this is only
safe for superblocks with lazy sb accounting, but the latter is
required for v5 supers and the rmap feature depends on v5.
Fixes: b3fed43482 ("xfs: account format bouncing into rmapbt swapext tx reservation")
Root-caused-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
It's not nice to hold @uring_lock for too long io_iopoll_reap_events().
For instance, the lock is needed to publish requests to @poll_list, and
that locks out tasks doing that for no good reason. Loose it
occasionally.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Nobody adjusts *nr_events (number of completed requests) before calling
io_iopoll_getevents(), so the passed @min shouldn't be adjusted as well.
Othewise it can return less than initially asked @min without hitting
need_resched().
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
->iopoll() may have completed current request, but instead of reaping
it, io_do_iopoll() just continues with the next request in the list.
As a result it can leave just polled and completed request in the list
up until next syscall. Even outer loop in io_iopoll_getevents() doesn't
help the situation.
E.g. poll_list: req0 -> req1
If req0->iopoll() completed both requests, and @min<=1,
then @req0 will be left behind.
Check whether a req was completed after ->iopoll().
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Don't forget to fill cqe->flags properly in io_submit_flush_completions()
Fixes: a1d7c393c4 ("io_uring: enable READ/WRITE to use deferred completions")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
A preparation path, extracts error path into a separate block. It looks
saner then calling req_set_fail_links() after io_put_req_find_next(), even
though it have been working well.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
io_prep_linked_timeout() sets REQ_F_LINK_TIMEOUT altering refcounting of
the following linked request. After that someone should call
io_queue_linked_timeout(), otherwise a submission reference of the linked
timeout won't be ever dropped.
That's what happens in io_steal_work() if io-wq decides to postpone linked
request with io_wqe_enqueue(). io_queue_linked_timeout() can also be
potentially called twice without synchronisation during re-submission,
e.g. io_rw_resubmit().
There are the rules, whoever did io_prep_linked_timeout() must also call
io_queue_linked_timeout(). To not do it twice, io_prep_linked_timeout()
will return non NULL only for the first call. That's controlled by
REQ_F_LINK_TIMEOUT flag.
Also kill REQ_F_QUEUE_TIMEOUT.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Since we now have that in the 5.9 branch, convert the existing users of
task_work_add() to use this new helper.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Provide a helper to run task_work instead of checking and running
manually in a bunch of different spots. While doing so, also move the
task run state setting where we run the task work. Then we can move it
out of the callback helpers. This also helps ensure we only do this once
per task_work list run, not per task_work item.
Suggested-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pull in task_work changes from the 5.8 series, as we'll need to apply
the same kind of changes to other parts in the 5.9 branch.
* io_uring-5.8:
io_uring: fix regression with always ignoring signals in io_cqring_wait()
io_uring: use signal based task_work running
task_work: teach task_work_add() to do signal_wake_up()
Rationale:
Reduces attack surface on kernel devs opening the links for MITM
as HTTPS traffic is much harder to manipulate.
Deterministic algorithm:
For each file:
If not .svg:
For each line:
If doesn't contain `\bxmlns\b`:
For each link, `\bhttp://[^# \t\r\n]*(?:\w|/)`:
If both the HTTP and HTTPS versions
return 200 OK and serve the same content:
Replace HTTP with HTTPS.
Signed-off-by: Alexander A. Klimov <grandmaster@al2klimov.de>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Link: https://lore.kernel.org/r/20200627103125.71828-1-grandmaster@al2klimov.de
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl8BDx4QHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpvhzD/4rxzJsn6ukrsxMXFaKIrjZ/hkcRJIMNozz
YWu4PwcDvszvZu66MeAu0tnCttzxlIgP8oCm6cx9ImMQwkYIVbV0q1XJ3wmzUQpZ
pEDW4j0j8hgcLhfZH9ojUAkTP8TnltakxkrwC6egUvnT0vuKDUy5ISbkl4uxWYpH
p4Dq7ASqy8xjtzac/VLTSzBgzhTMSic5NMJY21md9eAaFB1vYBmDyHB3O1bEk4kw
pvWGFm7a4qssnAB61SMfq3nWQ9UA0+XX4a+CWEzJIMqj4H6UpjOCQU23X1AlaLJX
ILeq26PwoZQF8cS4D83tMnmPWz1LqslBgnUuAGCVLsT7omvhDLM75iFBpMzWglLu
No8TlxLZ+Dga04vpjeEptWqSfUS6K879cNJuFGjadBogq06SImIVDHXXTrPhtCGg
B9+uFHkOUlIkjM5h2zqdkmhnbf0sWodowIrx7+aL294QVlqnY0uBR9eh6+CSKT+h
PhJ+FhN+N6B1dTyryaO5hMjyg0h4ZpvIMT3HBpNXtnRVlUT2+OYN3g5HHt6z//Rp
eeJTh7pnY7uT60c8x96kySwQIydXSKBI+7ysLlntgiyvutbzaC5Fq7/f1YTWyNVk
zqM/+FuJUsstu0y/GBEDpglpL1+S9VjNcJUDpUMUKwCAkh7TnI/ATo1rn9GiM1n1
SQZ4HcaCYw==
=Uawr
-----END PGP SIGNATURE-----
Merge tag 'io_uring-5.8-2020-07-05' of git://git.kernel.dk/linux-block
Pull io_uring fix from Jens Axboe:
"Andres reported a regression with the fix that was merged earlier this
week, where his setup of using signals to interrupt io_uring CQ waits
no longer worked correctly.
Fix this, and also limit our use of TWA_SIGNAL to the case where we
need it, and continue using TWA_RESUME for task_work as before.
Since the original is marked for 5.7 stable, let's flush this one out
early"
* tag 'io_uring-5.8-2020-07-05' of git://git.kernel.dk/linux-block:
io_uring: fix regression with always ignoring signals in io_cqring_wait()
When switching to TWA_SIGNAL for task_work notifications, we also made
any signal based condition in io_cqring_wait() return -ERESTARTSYS.
This breaks applications that rely on using signals to abort someone
waiting for events.
Check if we have a signal pending because of queued task_work, and
repeat the signal check once we've run the task_work. This provides a
reliable way of telling the two apart.
Additionally, only use TWA_SIGNAL if we are using an eventfd. If not,
we don't have the dependency situation described in the original commit,
and we can get by with just using TWA_RESUME like we previously did.
Fixes: ce593a6c48 ("io_uring: use signal based task_work running")
Cc: stable@vger.kernel.org # v5.7
Reported-by: Andres Freund <andres@anarazel.de>
Tested-by: Andres Freund <andres@anarazel.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Now that the last callser has been removed remove this code from exec.
For anyone thinking of resurrecing do_execve_file please note that
the code was buggy in several fundamental ways.
- It did not ensure the file it was passed was read-only and that
deny_write_access had been called on it. Which subtlely breaks
invaniants in exec.
- The caller of do_execve_file was expected to hold and put a
reference to the file, but an extra reference for use by exec was
not taken so that when exec put it's reference to the file an
underflow occured on the file reference count.
- The point of the interface was so that a pathname did not need to
exist. Which breaks pathname based LSMs.
Tetsuo Handa originally reported these issues[1]. While it was clear
that deny_write_access was missing the fundamental incompatibility
with the passed in O_RDWR filehandle was not immediately recognized.
All of these issues were fixed by modifying the usermode driver code
to have a path, so it did not need this hack.
Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
[1] https://lore.kernel.org/linux-fsdevel/2a8775b4-1dd5-9d5c-aa42-9872445e0942@i-love.sakura.ne.jp/
v1: https://lkml.kernel.org/r/871rm2f0hi.fsf_-_@x220.int.ebiederm.org
v2: https://lkml.kernel.org/r/87lfk54p0m.fsf_-_@x220.int.ebiederm.org
Link: https://lkml.kernel.org/r/20200702164140.4468-10-ebiederm@xmission.com
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Tested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Pull sysctl fix from Al Viro:
"Another regression fix for sysctl changes this cycle..."
* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
Call sysctl_head_finish on error
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAl7/s1YACgkQiiy9cAdy
T1GhzgwAqARAg1iCUEDjyy7VEZ9HNA3X87GxM7zkid5Fz2WTDlHLBQL6LWZkLODK
PIz8IP4V3DoBddN2DGlqIiCZmCMDn2bBN+6u1O2TkR2lv2w3ASxzwYSMQWqUUw6U
a03BkDZNFE4fJq5pPDdVaVzDss4tuNKW8N5RvptRqlbLp74SRUgMjVyyWwN4UunW
AHH3VqRCWJJj6Yp6MAx3rtoEiAtjTt9Ej3Fb2MXdF5jZObzI3LOY13Z09QIWbE3P
Sh7Py66CSG7UYYkQqoe43zYwxeOgo6FAYWxIULTPJYdFIi5+RHPQ0SYc6+BHfDRo
AHMchJpwZ/j4JOeIJGDItuUQPVnwYAOZ+75s7ofhAbG95kwcfs+AkDoLqkM8IWpu
LS5rHi7sOA4GK8Hio9xp+MgttsmXRcnBQ4ShBoTaDBKa7v/NeRAPolsD5FgZWunO
CKRDsDD5hKO2bQsJk4te35/IQpxRpEiiGmMpyaNUdaCdhXxcPHCYEWYdw9EnTP6i
1xc7au/u
=laR3
-----END PGP SIGNATURE-----
Merge tag '5.8-rc3-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs fixes from Steve French:
"Eight cifs/smb3 fixes, most when specifying the multiuser mount flag.
Five of the fixes are for stable"
* tag '5.8-rc3-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6:
cifs: prevent truncation from long to int in wait_for_free_credits
cifs: Fix the target file was deleted when rename failed.
SMB3: Honor 'posix' flag for multiuser mounts
SMB3: Honor 'handletimeout' flag for multiuser mounts
SMB3: Honor lease disabling for multiuser mounts
SMB3: Honor persistent/resilient handle flags for multiuser mounts
SMB3: Honor 'seal' flag for multiuser mounts
cifs: Display local UID details for SMB sessions in DebugData
- Fix a use-after-free bug when the fs shuts down.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAl77c9oACgkQ+H93GTRK
tOvYow/+O61y32kJzJ6xJ5QoGE5K10CXp4cpBwOgG/6POERIDBirgAMqKAx8rEu2
go36qUVzwyveaB4iyuIkw5K+odZbHpGiuQqiGu8Aw4XBEAEhDqPPsnHhllSi/VOq
4AjvfYefmj0ALQay1pzGpR2h5+03JwOw0ZFcmBl5QSTMLwZQZ1PoU8ujiYPSxsUr
m9dcGZtGU16mmDgORzYTDnSYSKruhJDSD5IxsID+QST0wK4MuPkJr0T+ZQmziWb1
xmHE1aTGDZcrYG4+x2Pzop822mrnuMBnMaX8KOOtiZAhtKb19sAf0OUWBkWOQYvb
Vk3mIDDz910vd/szw3B5KMVNkeYoRtHAQztpLyfJ3Gtxmt5g/4ZX2fEX/KWJbV5D
82GLf4gB4GhTQFUwcTmgtmCCd87aH0ABiCEnURN6R04tOCXgc7bYn0XXvZL6Axd/
25bkhlDdmOfveAZrZ1WKWSEYN/at9R5iqYsbWH1FmoE6h82OvZDxweB7P4r66KwT
pMhYRqKrRrwNtarPmn6bC/8Ci5h6vl8MOP3+mTYx2eXU0aFe8OCYpPNFsSufxx+e
hHTHUKxCSQDfYHbIqXVF5F8G7msh4ue+687UIo7seWyzTU4PaphH0kaHKV/+fiY1
zU9+GNjk0iVZy6cG+25ZcMPQmYv7qdrvm6XuxciywFim7l2wq0s=
=BnJO
-----END PGP SIGNATURE-----
Merge tag 'xfs-5.8-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs fix from Darrick Wong:
"Fix a use-after-free bug when the fs shuts down"
* tag 'xfs-5.8-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: fix use-after-free on CIL context on shutdown
-----BEGIN PGP SIGNATURE-----
iQJIBAABCAAyFiEEJZs3krPW0xkhLMTc1b+f6wMTZToFAl7/A8kUHGFncnVlbmJh
QHJlZGhhdC5jb20ACgkQ1b+f6wMTZToR8A//RI5c1W9De9ST6ITDsVLVpKX7ugRl
Zo9ORCRgnnBuJFGQ6VtPuren9Rd6nxIOwf8MeFHjkgapIcXVM/f81Hx7WTUS7KOP
dakeOOWPw1Ue3hcnpxjT4fTgZo43u1VoKbFGMpUPK0zvGrYh8fa4euAhALGIUB0X
jc12mX/FdpQY+9YurY45GJ186tC0aZp1kEK+mA5apPuIA+9pUqbPIt6tmLV1wRry
2a2fTZXI0MN0/u4ZhHSFcJVj4k6xdLLQodJHW+FAPh50vHf7W++DL6mJ5V6Svp8N
vaWUdTbLh8wp/IxD+G4tnPk2BFKvGRwXZV9ZmrivxcwRKc4bgA878yjU6Ox4B0XB
EzXcTPeO0zGHtXBv49Pig0dZDllgvlDL/us2rnt+APZ99yWsX8GzJ1BDP4gT4var
3IuNxyvpeAth9zth/aOzS2pE50yano64CQZq/NsuvAz4MTI8zms/o2ioH/aPbwfk
DFkAQHzlNPZlbt6jF6WJxF4R0Mrs8g1oTF5eH0sIpfMUxtq0Ay9dlHDEY1yjl/Jw
2KBuKLF6P6IZN3aSolawY29dcnMp5vbRDLdbqNzz2/DzAQmqjd4IcY0gK/nHTc8i
2TLyffQ6YTdEDXKIDqfPNDxjcYLKd1XCIyogXt6Rw57C3YwIbDY/drPtFUpUCUFI
Oaf4j8M9W+gamzU=
=GQwd
-----END PGP SIGNATURE-----
Merge tag 'gfs2-v5.8-rc3.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2
Pull gfs2 fixes from Andreas Gruenbacher:
"Various gfs2 fixes"
* tag 'gfs2-v5.8-rc3.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
gfs2: The freeze glock should never be frozen
gfs2: When freezing gfs2, use GL_EXACT and not GL_NOCACHE
gfs2: read-only mounts should grab the sd_freeze_gl glock
gfs2: freeze should work on read-only mounts
gfs2: eliminate GIF_ORDERED in favor of list_empty
gfs2: Don't sleep during glock hash walk
gfs2: fix trans slab error when withdraw occurs inside log_flush
gfs2: Don't return NULL from gfs2_inode_lookup
This error path returned directly instead of calling sysctl_head_finish().
Fixes: ef9d965bc8 ("sysctl: reject gigantic reads/write to sysctl files")
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Before this patch, some gfs2 code locked the freeze glock with LM_FLAG_NOEXP
(Do not freeze) flag, and some did not. We never want to freeze the freeze
glock, so this patch makes it consistently use LM_FLAG_NOEXP always.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Before this patch, the freeze code in gfs2 specified GL_NOCACHE in
several places. That's wrong because we always want to know the state
of whether the file system is frozen.
There was also a problem with freeze/thaw transitioning the glock from
frozen (EX) to thawed (SH) because gfs2 will normally grant glocks in EX
to processes that request it in SH mode, unless GL_EXACT is specified.
Therefore, the freeze/thaw code, which tried to reacquire the glock in
SH mode would get the glock in EX mode, and miss the transition from EX
to SH. That made it think the thaw had completed normally, but since the
glock was still cached in EX, other nodes could not freeze again.
This patch removes the GL_NOCACHE flag to allow the freeze glock to be
cached. It also adds the GL_EXACT flag so the glock is fully transitioned
from EX to SH, thereby allowing future freeze operations.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Before this patch, only read-write mounts would grab the freeze
glock in read-only mode, as part of gfs2_make_fs_rw. So the freeze
glock was never initialized. That meant requests to freeze, which
request the glock in EX, were granted without any state transition.
That meant you could mount a gfs2 file system, which is currently
frozen on a different cluster node, in read-only mode.
This patch makes read-only mounts lock the freeze glock in SH mode,
which will block for file systems that are frozen on another node.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Before this patch, function freeze_go_sync, called when promoting
the freeze glock, was testing for the SDF_JOURNAL_LIVE superblock flag.
That's only set for read-write mounts. Read-only mounts don't use a
journal, so the bit is never set, so the freeze never happened.
This patch removes the check for SDF_JOURNAL_LIVE for freeze requests
but still checks it when deciding whether to flush a journal.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
In several places, we used the GIF_ORDERED inode flag to determine
if an inode was on the ordered writes list. However, since we always
held the sd_ordered_lock spin_lock during the manipulation, we can
just as easily check list_empty(&ip->i_ordered) instead.
This allows us to keep more than one ordered writes list to make
journal writing improvements.
This patch eliminates GIF_ORDERED in favor of checking list_empty.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
leak and a module unloading bug in the /proc/fs/nfsd/clients/ code, and
a compile warning.
-----BEGIN PGP SIGNATURE-----
iQJJBAABCAAzFiEEYtFWavXG9hZotryuJ5vNeUKO4b4FAl79+IoVHGJmaWVsZHNA
ZmllbGRzZXMub3JnAAoJECebzXlCjuG+6I8P/24e9/W50SUBsQYseG2BpwjR/RsQ
YjMbqrf1XOxo+8axNpdbe0bhq2jWiyQz0esnF33RlztlDSJmSNJfueWDSezKzKwC
o8afQx0qJaVZUsT/XAXa2Hk2OZd2ZYF6f3DGMiz+knBGdAzSwJjpgqhzocMCQ3Hr
t/PG6DJazLB3VDIe1VziTet2uv52A0A+uBYKguK/QPlpae2uXKFJ8U7v6wCsU395
Sqd2/X2KGbeYoCrWsmpvdCDVeNmAbI0KlhY8pR6BHqGp7TYm4+AueqWzpYHlNHei
PukM8AROoTBEAc6Wiqqmp0UKRR+Qn/9NIuvQtvBnC6WGIPjEG1hTkAwlRfT6VYvn
oPg4oekKjRJLz/TSaqfJRpli5GwxfWAW14LTZZT+Xe0/7FhVe28/R8F1dP5ZJaeq
h9//4rCt/yUYAQq1odOMbNCr0rGVcKzdSN3E36OvJFVQ9bMyXHKetKHywOki13w5
M8UQK5zb21ghT7OSICmeRXHqsXRmTFO8QhUZ8L63Qb2hfiQ5fVQdSiHmM8iRcwWY
bxqrSs8YV7i+I0i1YYTYWmmFgP8Y11sL7ovAEs86cP2Rk58Bk5VA2TPT114W53AD
xaZHpjsH0AfZS87dEVdvS2/dAdtbHZsFwHxGnfvyl/CKTqoz5yY5etcwULQI3+XQ
8kG8FOFpt7T/5zmB
=wF3M
-----END PGP SIGNATURE-----
Merge tag 'nfsd-5.8-1' of git://linux-nfs.org/~bfields/linux
Pull nfsd fixes from Bruce Fields:
"Fixes for a umask bug on exported filesystems lacking ACL support, a
leak and a module unloading bug in the /proc/fs/nfsd/clients/ code,
and a compile warning"
* tag 'nfsd-5.8-1' of git://linux-nfs.org/~bfields/linux:
SUNRPC: Add missing definition of ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE
nfsd: fix nfsdfs inode reference count leak
nfsd4: fix nfsdfs reference count loop
nfsd: apply umask on fs without ACL support
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl79YU0QHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgplHKD/9rgv0c1I7dCh6MgQKxT+2z/eZcaPO3PekW
sbn8yC8RiSIL85Av1zEfC1wAp+Mp21QlFKXFiZ6BJj5bdDbbshLk0WdbnxvuM+9I
gyngTI/+em5D/WCcetAkPjnMTDq0m4l0UXd91fyNAeErmYZbvhL5dXihZsBJ3T9c
Bprn4RzWwrUsUwGn8qIEZhx2UovMrzXJHGFxWXh/81YHkh7Y4mjvATKxtECIliW/
+QQJDU7Tf3gZw+ETPIDOEB9Hl9c9W+9fcWWzmrXzViUyy54IMbF4qyJpWcGaRh6c
sO3apymwu7wwAUbQcE8IWr3ZLZDtw68AgUdZ5b/T0c2fEwqsI/UDMhBbELiuqcT0
MAoQdUSNNqZTti0PX5vg5CQlCFzjnl2uIwHF6LVSbrqgyqxiC3Qrus/FYSaf3x9h
bAmNgWC9DeKp/wtEKMuBXaOm7RjrEutD5hjJYfVK/AkvKTZyZDx3vZ9FRH8WtrII
7KhUI3DPSZCeWlcpDtK+0fEqtqTw6OtCQ8U5vKSnJjoRSXLUtuk6IYbp/tqNxwe/
0d+U6R+w513jVlXARUP48mV7tzpESp2MLP6Nd2Is/OD5tePWzQEZinpKzsFP4djH
d2PT5FFGPCw9yBk03sI1Je/CFqVYwCGqav6h8dKKVBanMjoEdL4U1PMhI48Zua+9
M8pqRHoeDA==
=4lvI
-----END PGP SIGNATURE-----
Merge tag 'io_uring-5.8-2020-07-01' of git://git.kernel.dk/linux-block
Pull io_uring fixes from Jens Axboe:
"One fix in here, for a regression in 5.7 where a task is waiting in
the kernel for a condition, but that condition won't become true until
task_work is run. And the task_work can't be run exactly because the
task is waiting in the kernel, so we'll never make any progress.
One example of that is registering an eventfd and queueing io_uring
work, and then the task goes and waits in eventfd read with the
expectation that it'll get woken (and read an event) when the io_uring
request completes. The io_uring request is finished through task_work,
which won't get run while the task is looping in eventfd read"
* tag 'io_uring-5.8-2020-07-01' of git://git.kernel.dk/linux-block:
io_uring: use signal based task_work running
task_work: teach task_work_add() to do signal_wake_up()
Eric reported an issue where mounting -o recovery with a fuzzed fs
resulted in a kernel panic. This is because we tried to free the tree
node, except it was an error from the read. Fix this by properly
resetting the tree_root->node == NULL in this case. The panic was the
following
BTRFS warning (device loop0): failed to read tree root
BUG: kernel NULL pointer dereference, address: 000000000000001f
RIP: 0010:free_extent_buffer+0xe/0x90 [btrfs]
Call Trace:
free_root_extent_buffers.part.0+0x11/0x30 [btrfs]
free_root_pointers+0x1a/0xa2 [btrfs]
open_ctree+0x1776/0x18a5 [btrfs]
btrfs_mount_root.cold+0x13/0xfa [btrfs]
? selinux_fs_context_parse_param+0x37/0x80
legacy_get_tree+0x27/0x40
vfs_get_tree+0x25/0xb0
fc_mount+0xe/0x30
vfs_kern_mount.part.0+0x71/0x90
btrfs_mount+0x147/0x3e0 [btrfs]
? cred_has_capability+0x7c/0x120
? legacy_get_tree+0x27/0x40
legacy_get_tree+0x27/0x40
vfs_get_tree+0x25/0xb0
do_mount+0x735/0xa40
__x64_sys_mount+0x8e/0xd0
do_syscall_64+0x4d/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
Nik says: this is problematic only if we fail on the last iteration of
the loop as this results in init_tree_roots returning err value with
tree_root->node = -ERR. Subsequently the caller does: fail_tree_roots
which calls free_root_pointers on the bogus value.
Reported-by: Eric Sandeen <sandeen@redhat.com>
Fixes: b8522a1e5f ("btrfs: Factor out tree roots initialization during mount")
CC: stable@vger.kernel.org # 5.5+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add details how the pointer gets dereferenced ]
Signed-off-by: David Sterba <dsterba@suse.com>
Under somewhat convoluted conditions, it is possible to attempt to
release an extent_buffer that is under io, which triggers a BUG_ON in
btrfs_release_extent_buffer_pages.
This relies on a few different factors. First, extent_buffer reads done
as readahead for searching use WAIT_NONE, so they free the local extent
buffer reference while the io is outstanding. However, they should still
be protected by TREE_REF. However, if the system is doing signficant
reclaim, and simultaneously heavily accessing the extent_buffers, it is
possible for releasepage to race with two concurrent readahead attempts
in a way that leaves TREE_REF unset when the readahead extent buffer is
released.
Essentially, if two tasks race to allocate a new extent_buffer, but the
winner who attempts the first io is rebuffed by a page being locked
(likely by the reclaim itself) then the loser will still go ahead with
issuing the readahead. The loser's call to find_extent_buffer must also
race with the reclaim task reading the extent_buffer's refcount as 1 in
a way that allows the reclaim to re-clear the TREE_REF checked by
find_extent_buffer.
The following represents an example execution demonstrating the race:
CPU0 CPU1 CPU2
reada_for_search reada_for_search
readahead_tree_block readahead_tree_block
find_create_tree_block find_create_tree_block
alloc_extent_buffer alloc_extent_buffer
find_extent_buffer // not found
allocates eb
lock pages
associate pages to eb
insert eb into radix tree
set TREE_REF, refs == 2
unlock pages
read_extent_buffer_pages // WAIT_NONE
not uptodate (brand new eb)
lock_page
if !trylock_page
goto unlock_exit // not an error
free_extent_buffer
release_extent_buffer
atomic_dec_and_test refs to 1
find_extent_buffer // found
try_release_extent_buffer
take refs_lock
reads refs == 1; no io
atomic_inc_not_zero refs to 2
mark_buffer_accessed
check_buffer_tree_ref
// not STALE, won't take refs_lock
refs == 2; TREE_REF set // no action
read_extent_buffer_pages // WAIT_NONE
clear TREE_REF
release_extent_buffer
atomic_dec_and_test refs to 1
unlock_page
still not uptodate (CPU1 read failed on trylock_page)
locks pages
set io_pages > 0
submit io
return
free_extent_buffer
release_extent_buffer
dec refs to 0
delete from radix tree
btrfs_release_extent_buffer_pages
BUG_ON(io_pages > 0)!!!
We observe this at a very low rate in production and were also able to
reproduce it in a test environment by introducing some spurious delays
and by introducing probabilistic trylock_page failures.
To fix it, we apply check_tree_ref at a point where it could not
possibly be unset by a competing task: after io_pages has been
incremented. All the codepaths that clear TREE_REF check for io, so they
would not be able to clear it after this point until the io is done.
Stack trace, for reference:
[1417839.424739] ------------[ cut here ]------------
[1417839.435328] kernel BUG at fs/btrfs/extent_io.c:4841!
[1417839.447024] invalid opcode: 0000 [#1] SMP
[1417839.502972] RIP: 0010:btrfs_release_extent_buffer_pages+0x20/0x1f0
[1417839.517008] Code: ed e9 ...
[1417839.558895] RSP: 0018:ffffc90020bcf798 EFLAGS: 00010202
[1417839.570816] RAX: 0000000000000002 RBX: ffff888102d6def0 RCX: 0000000000000028
[1417839.586962] RDX: 0000000000000002 RSI: ffff8887f0296482 RDI: ffff888102d6def0
[1417839.603108] RBP: ffff88885664a000 R08: 0000000000000046 R09: 0000000000000238
[1417839.619255] R10: 0000000000000028 R11: ffff88885664af68 R12: 0000000000000000
[1417839.635402] R13: 0000000000000000 R14: ffff88875f573ad0 R15: ffff888797aafd90
[1417839.651549] FS: 00007f5a844fa700(0000) GS:ffff88885f680000(0000) knlGS:0000000000000000
[1417839.669810] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[1417839.682887] CR2: 00007f7884541fe0 CR3: 000000049f609002 CR4: 00000000003606e0
[1417839.699037] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[1417839.715187] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[1417839.731320] Call Trace:
[1417839.737103] release_extent_buffer+0x39/0x90
[1417839.746913] read_block_for_search.isra.38+0x2a3/0x370
[1417839.758645] btrfs_search_slot+0x260/0x9b0
[1417839.768054] btrfs_lookup_file_extent+0x4a/0x70
[1417839.778427] btrfs_get_extent+0x15f/0x830
[1417839.787665] ? submit_extent_page+0xc4/0x1c0
[1417839.797474] ? __do_readpage+0x299/0x7a0
[1417839.806515] __do_readpage+0x33b/0x7a0
[1417839.815171] ? btrfs_releasepage+0x70/0x70
[1417839.824597] extent_readpages+0x28f/0x400
[1417839.833836] read_pages+0x6a/0x1c0
[1417839.841729] ? startup_64+0x2/0x30
[1417839.849624] __do_page_cache_readahead+0x13c/0x1a0
[1417839.860590] filemap_fault+0x6c7/0x990
[1417839.869252] ? xas_load+0x8/0x80
[1417839.876756] ? xas_find+0x150/0x190
[1417839.884839] ? filemap_map_pages+0x295/0x3b0
[1417839.894652] __do_fault+0x32/0x110
[1417839.902540] __handle_mm_fault+0xacd/0x1000
[1417839.912156] handle_mm_fault+0xaa/0x1c0
[1417839.921004] __do_page_fault+0x242/0x4b0
[1417839.930044] ? page_fault+0x8/0x30
[1417839.937933] page_fault+0x1e/0x30
[1417839.945631] RIP: 0033:0x33c4bae
[1417839.952927] Code: Bad RIP value.
[1417839.960411] RSP: 002b:00007f5a844f7350 EFLAGS: 00010206
[1417839.972331] RAX: 000000000000006e RBX: 1614b3ff6a50398a RCX: 0000000000000000
[1417839.988477] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000002
[1417840.004626] RBP: 00007f5a844f7420 R08: 000000000000006e R09: 00007f5a94aeccb8
[1417840.020784] R10: 00007f5a844f7350 R11: 0000000000000000 R12: 00007f5a94aecc79
[1417840.036932] R13: 00007f5a94aecc78 R14: 00007f5a94aecc90 R15: 00007f5a94aecc40
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
Convert fall through comments to the pseudo-keyword which is now the
preferred way.
Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The wait_event_... defines evaluate to long so we should not assign it an int as this may truncate
the value.
Reported-by: Marshall Midden <marshallmidden@gmail.com>
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
When xfstest generic/035, we found the target file was deleted
if the rename return -EACESS.
In cifs_rename2, we unlink the positive target dentry if rename
failed with EACESS or EEXIST, even if the target dentry is positived
before rename. Then the existing file was deleted.
We should just delete the target file which created during the
rename.
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Zhang Xiaoxu <zhangxiaoxu5@huawei.com>
Cc: stable@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
The flag from the primary tcon needs to be copied into the volume info
so that cifs_get_tcon will try to enable extensions on the per-user
tcon. At that point, since posix extensions must have already been
enabled on the superblock, don't try to needlessly adjust the mount
flags.
Fixes: ce558b0e17 ("smb3: Add posix create context for smb3.11 posix mounts")
Fixes: b326614ea2 ("smb3: allow "posix" mount option to enable new SMB311 protocol extensions")
Signed-off-by: Paul Aurich <paul@darkrain42.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Fixes: ca567eb2b3 ("SMB3: Allow persistent handle timeout to be configurable on mount")
Signed-off-by: Paul Aurich <paul@darkrain42.org>
CC: Stable <stable@vger.kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Without this:
- persistent handles will only be enabled for per-user tcons if the
server advertises the 'Continuous Availabity' capability
- resilient handles would never be enabled for per-user tcons
Signed-off-by: Paul Aurich <paul@darkrain42.org>
CC: Stable <stable@vger.kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Ensure multiuser SMB3 mounts use encryption for all users' tcons if the
mount options are configured to require encryption. Without this, only
the primary tcon and IPC tcons are guaranteed to be encrypted. Per-user
tcons would only be encrypted if the server was configured to require
encryption.
Signed-off-by: Paul Aurich <paul@darkrain42.org>
CC: Stable <stable@vger.kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
This is useful for distinguishing SMB sessions on a multiuser mount.
Signed-off-by: Paul Aurich <paul@darkrain42.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Instead just iterate over the inodes for the block device superblock.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Just use bd_disk->queue instead.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We can trivially calculate the block size from the inodes i_blkbits
variable. Use that instead of keeping two redundant copies of the
information in slightly different formats.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The loop to increase the initial block size doesn't really make any
sense, as the AND operation won't match for powers of two if it didn't
for the initial block size.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
All bios can get remapped if submitted to partitions. No need to
comment on that.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
- Zero out unused characters of FileName field to avoid a complaint from some fsck tool.
- Fix memory leak on error paths.
- Fix unnecessary VOL_DIRTY set when calling rmdir on non-empty directory.
- Call sync_filesystem() for read-only remount(Fix generic/452 test in xfstests)
- Add own fsync() to flush dirty metadata.
-----BEGIN PGP SIGNATURE-----
iQJMBAABCgA2FiEE6NzKS6Uv/XAAGHgyZwv7A1FEIQgFAl76ldAYHG5hbWphZS5q
ZW9uQHNhbXN1bmcuY29tAAoJEGcL+wNRRCEIjoEQAKYh1yAd9xYqYznwWKgRa76d
DyRfuDzIcgPoM8C00sys237OGhb2iXlyLAWQ1Ag6kIZkxjCPjMZ3Ma+piqi0sEvG
YXhrDdSkAstsbRiQ/Z/pFSFPBmI8wej64uMR1eZOtY5ms0VPtau3paX6JWBhiGZU
cmS3ggUFUvOlky9vKCRX2kaPSVyN+VpUMiGe2jfa8x5y6ZRLWPgkQwfVYk38O4zS
Z4x/UZiokfUXqrh5kPVWDxk96oWq2c+KLxmRawjEA9IOvgqs2ydbcAQnGx5fkHAO
d+aqLjo3XsMlN7dfB9xKhFjRrZL6MggU2Ptu/BoEb5RsyPUGk/wCYQMjAykeBLtT
VC+3tGQob3GEgeVdhogrPOhPCNv3Pxgl8XBigE8sDMtvdoqrHeP83i8fYCcUb3jY
ENjSaIZxD/kOtjf2nbgz6FDJhJQSsoFP+oKqndPc9umD5mM0Foj+NZ9cevdNvLsd
qqanWxbdfgI6iCSg8S8dJE4PTSI2o08MY+Nh+NA6MktIEOaQDy3ncXjk/XZ7oX42
4zMrvNvTX894vcpCDNaa+ZW1NVSTWaIf+saHRqqnsU6nouQL0VmsTK4SAGqtGeFb
vZobK4z8qy3uliKiGtjbc3DYA1gB9lJCKNCXLaFuCD6amAPufXWDeeVp8AzA5AVh
AqS+oQIoO8yCf9GyBAL7
=lu+d
-----END PGP SIGNATURE-----
Merge tag 'exfat-for-5.8-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat
Pull exfat fixes from Namjae Jeon:
- Zero out unused characters of FileName field to avoid a complaint
from some fsck tool.
- Fix memory leak on error paths.
- Fix unnecessary VOL_DIRTY set when calling rmdir on non-empty
directory.
- Call sync_filesystem() for read-only remount (Fix generic/452 test in
xfstests)
- Add own fsync() to flush dirty metadata.
* tag 'exfat-for-5.8-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat:
exfat: flush dirty metadata in fsync
exfat: move setting VOL_DIRTY over exfat_remove_entries()
exfat: call sync_filesystem for read-only remount
exfat: add missing brelse() calls on error paths
exfat: Set the unused characters of FileName field to the value 0000h
Since 5.7, we've been using task_work to trigger async running of
requests in the context of the original task. This generally works
great, but there's a case where if the task is currently blocked
in the kernel waiting on a condition to become true, it won't process
task_work. Even though the task is woken, it just checks whatever
condition it's waiting on, and goes back to sleep if it's still false.
This is a problem if that very condition only becomes true when that
task_work is run. An example of that is the task registering an eventfd
with io_uring, and it's now blocked waiting on an eventfd read. That
read could depend on a completion event, and that completion event
won't get trigged until task_work has been run.
Use the TWA_SIGNAL notification for task_work, so that we ensure that
the task always runs the work when queued.
Cc: stable@vger.kernel.org # v5.7
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There is a fancy bug, where exiting user task may not have ->mm,
that makes task_works to try to do kthread_use_mm(ctx->sqo_mm).
Don't do that if sqo_mm is NULL.
[ 290.460558] WARNING: CPU: 6 PID: 150933 at kernel/kthread.c:1238
kthread_use_mm+0xf3/0x110
[ 290.460579] CPU: 6 PID: 150933 Comm: read-write2 Tainted: G
I E 5.8.0-rc2-00066-g9b21720607cf #531
[ 290.460580] RIP: 0010:kthread_use_mm+0xf3/0x110
...
[ 290.460584] Call Trace:
[ 290.460584] __io_sq_thread_acquire_mm.isra.0.part.0+0x25/0x30
[ 290.460584] __io_req_task_submit+0x64/0x80
[ 290.460584] io_req_task_submit+0x15/0x20
[ 290.460585] task_work_run+0x67/0xa0
[ 290.460585] do_exit+0x35d/0xb70
[ 290.460585] do_group_exit+0x43/0xa0
[ 290.460585] get_signal+0x140/0x900
[ 290.460586] do_signal+0x37/0x780
[ 290.460586] __prepare_exit_to_usermode+0x126/0x1c0
[ 290.460586] __syscall_return_slowpath+0x3b/0x1c0
[ 290.460587] do_syscall_64+0x5f/0xa0
[ 290.460587] entry_SYSCALL_64_after_hwframe+0x44/0xa9
following with faults.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
gcc 9.2.0 compiles io_req_find_next() as a separate function leaving
the first REQ_F_LINK_HEAD fast check not inlined. Help it by splitting
out the check from the function.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Greatly simplify io_async_task_func() removing duplicated functionality
of __io_req_task_submit(). This do one extra spin lock/unlock for
cancelled poll case, but that shouldn't happen often.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
io_poll_task_func() hand-coded link submission forgetting to set
TASK_RUNNING, acquire mm, etc. Call existing helper for that,
i.e. __io_req_task_submit().
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Actually, io_iopoll_queue() may have NULL ->mm, that's if SQ thread
didn't grabbed mm before doing iopoll. Don't fail reqs there, as after
recent changes it won't be punted directly but rather through task_work.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Avoid jumping through hoops to silence unused variable warnings, and
also fix sparse rightfully complaining about the locking context:
fs/io_uring.c:1593:39: warning: context imbalance in 'io_kill_linked_timeout' - unexpected unlock
Provide the functional helper as __io_kill_linked_timeout(), and have
separate the locking from it.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently io_steal_work() is disabled, and every linked request should
go through task_work for initialisation. Do io_req_work_grab_env()
just before io-wq punting and for the whole link, so any request
reachable by io_steal_work() is prepared.
This is also interesting for another reason -- it localises
io_req_work_grab_env() into one place just before io-wq punting, helping
to to better manage req->work lifetime and add some neat
cleanup/optimisations later.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Remove io_req_work_grab_env() call from io_req_defer_prep(), just call
it when neccessary.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Place io_req_init_async() in io_req_work_grab_env() so it won't be
forgotten.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Remove struct io_op_def *def parameter from io_req_work_grab_env(),
it's trivially deducible from req->opcode and fast. The API is
cleaner this way, and also helps the complier to understand
that it's a real constant and could be register-cached.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
After __io_free_req() puts a ctx ref, it should be assumed that the ctx
may already be gone. However, it can be accessed when putting the
fallback req. Free the req first and then put the ctx.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There are too many useless flags, kill REQ_F_TIMEOUT_NOSEQ, which can be
easily infered from req.timeout itself.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Generally, it's better to return a value directly than having out
parameter. It's cleaner and saves from some kinds of ugly bugs.
May also be faster.
Return next request from io_req_find_next() and friends directly
instead of passing out parameter.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Linked timeout cancellation code is repeated in in io_req_link_next()
and io_fail_links(), and they differ in details even though shouldn't.
Basing on the fact that there is maximum one armed linked timeout in
a link, and it immediately follows the head, extract a function that
will check for it and defuse.
Justification:
- DRY and cleaner
- better inlining for io_req_link_next() (just 1 call site now)
- isolates linked_timeouts from common path
- reduces time under spinlock for failed links
- actually less code
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
[axboe: fold in locking fix for io_fail_links()]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The header file linux/uio.h includes crypto/hash.h which pulls in
most of the Crypto API. Since linux/uio.h is used throughout the
kernel this means that every tiny bit of change to the Crypto API
causes the entire kernel to get rebuilt.
This patch fixes this by moving it into lib/iov_iter.c instead
where it is actually used.
This patch also fixes the ifdef to use CRYPTO_HASH instead of just
CRYPTO which does not guarantee the existence of ahash.
Unfortunately a number of drivers were relying on linux/uio.h to
provide access to linux/slab.h. This patch adds inclusions of
linux/slab.h as detected by build failures.
Also skbuff.h was relying on this to provide a declaration for
ahash_request. This patch adds a forward declaration instead.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
In flush_delete_work, instead of flushing each individual pending
delayed work item, cancel and re-queue them for immediate execution.
The waiting isn't needed here because we're already waiting for all
queued work items to complete in gfs2_flush_delete_work. This makes the
code more efficient, but more importantly, it avoids sleeping during a
rhashtable walk, inside rcu_read_lock().
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Log flush operations (gfs2_log_flush()) can target a specific transaction.
But if the function encounters errors (e.g. io errors) and withdraws,
the transaction was only freed it if was queued to one of the ail lists.
If the withdraw occurred before the transaction was queued to the ail1
list, function ail_drain never freed it. The result was:
BUG gfs2_trans: Objects remaining in gfs2_trans on __kmem_cache_shutdown()
This patch makes log_flush() add the targeted transaction to the ail1
list so that function ail_drain() will find and free it properly.
Cc: stable@vger.kernel.org # v5.7+
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Callers expect gfs2_inode_lookup to return an inode pointer or ERR_PTR(error).
Commit b66648ad6d caused it to return NULL instead of ERR_PTR(-ESTALE) in
some cases. Fix that.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Fixes: b66648ad6d ("gfs2: Move inode generation number check into gfs2_inode_lookup")
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
I don't understand this code well, but I'm seeing a warning about a
still-referenced inode on unmount, and every other similar filesystem
does a dput() here.
Fixes: e8a79fb14f ("nfsd: add nfsd/clients directory")
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
We don't drop the reference on the nfsdfs filesystem with
mntput(nn->nfsd_mnt) until nfsd_exit_net(), but that won't be called
until the nfsd module's unloaded, and we can't unload the module as long
as there's a reference on nfsdfs. So this prevents module unloading.
Fixes: 2c830dd720 ("nfsd: persist nfsd filesystem across mounts")
Reported-and-Tested-by: Luo Xiaogang <lxgrxd@163.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
This reverts commit e9c15badbb ("fs: Do not check if there is a
fsnotify watcher on pseudo inodes"). The commit intended to eliminate
fsnotify-related overhead for pseudo inodes but it is broken in
concept. inotify can receive events of pipe files under /proc/X/fd and
chromium relies on close and open events for sandboxing. Maxim Levitsky
reported the following
Chromium starts as a white rectangle, shows few white rectangles that
resemble its notifications and then crashes.
The stdout output from chromium:
[mlevitsk@starship ~]$chromium-freeworld
mesa: for the --simplifycfg-sink-common option: may only occur zero or one times!
mesa: for the --global-isel-abort option: may only occur zero or one times!
[3379:3379:0628/135151.440930:ERROR:browser_switcher_service.cc(238)] XXX Init()
../../sandbox/linux/seccomp-bpf-helpers/sigsys_handlers.cc:**CRASHING**:seccomp-bpf failure in syscall 0072
Received signal 11 SEGV_MAPERR 0000004a9048
Crashes are not universal but even if chromium does not crash, it certainly
does not work properly. While filtering just modify and access might be
safe, the benefit is not worth the risk hence the revert.
Reported-by: Maxim Levitsky <mlevitsk@redhat.com>
Fixes: e9c15badbb ("fs: Do not check if there is a fsnotify watcher on pseudo inodes")
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Don't forget to wake up a process to which io_rw_reissue() added
task_work.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
generic_file_fsync() exfat used could not guarantee the consistency of
a file because it has flushed not dirty metadata but only dirty data pages
for a file.
Instead of that, use exfat_file_fsync() for files and directories so that
it guarantees to commit both the metadata and data pages for a file.
Signed-off-by: Sungjong Seo <sj1557.seo@samsung.com>
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
We need to commit dirty metadata and pages to disk
before remounting exfat as read-only.
This fixes a failure in xfstests generic/452
generic/452 does the following:
cp something <exfat>/
mount -o remount,ro <exfat>
the <exfat>/something is corrupted. because while
exfat is remounted as read-only, exfat doesn't
have a chance to commit metadata and
vfs invalidates page caches in a block device.
Signed-off-by: Hyunchul Lee <hyc.lee@gmail.com>
Acked-by: Sungjong Seo <sj1557.seo@samsung.com>
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
If the second exfat_get_dentry() call fails then we need to release
"old_bh" before returning. There is a similar bug in exfat_move_file().
Fixes: 5f2aa07507 ("exfat: add inode operations")
Reported-by: Markus Elfring <Markus.Elfring@web.de>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Some fsck tool complain that padding part of the FileName field
is not set to the value 0000h. So let's maintain filesystem cleaner,
as exfat's spec. recommendation.
Signed-off-by: Hyeongseok.Kim <Hyeongseok@gmail.com>
Reviewed-by: Sungjong Seo <sj1557.seo@samsung.com>
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
- Robustness fix for TPM log parsing code
- kobject refcount fix for the ESRT parsing code
- Two efivarfs fixes to make it behave more like an ordinary file system
- Style fixup for zero length arrays
- Fix a regression in path separator handling in the initrd loader
- Fix a missing prototype warning
- Add some kerneldoc headers for newly introduced stub routines
- Allow support for SSDT overrides via EFI variables to be disabled
- Report CPU mode and MMU state upon entry for 32-bit ARM
- Use the correct stack pointer alignment when entering from mixed mode
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAl74344RHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1heMw//b9UPgWlkH2xnAjo9QeFvounyT8XrLLnW
QkhkiIGDvM2qWUmRotRrxRq39P9A+AH4x0krWTZam67W1OuWleUjwQWrnYE8vhql
xdIAJmD1oWTi07p4SFzLVA7mJvMX5xenCYvGTALoHtsGnLbOiRGSSTnuXZr1c6Kd
2XcY89kpcZGXgw9VCNV2Ez1g0OlCHS1N5LV31WGUcFl30Q3aZpdLmnFUzKLUbRgb
sTNMlu2mLGSs/ZaTAaOGNzFkxGVJI2+0C+ApKvmR9WR7+5n9Brs27RSLgPMViXun
BnsTewMdxNBXITgLxcUEtngPEWIzqrwJVbLaZVeWcWez0g11GIt0+wonpRnxWjHA
XgQm00sK4HIvs+3YWUJ1PpXyjUmiPvOKZM5um9zsCiYml+RzzIm6bznII4Lh7rQe
4kOLXkxaww+LS4r3+si6Q16og4zd/zZs4MoxaF7frTJ6oiUWOpBJqdf92Kiz0DaS
kfQ2I3d/PdZvWuNIiBCfX9bjd7q0zq0zyIghP7460lx88aaHb20samTtl+qjN4MM
Wpik/soeYi5pICDRRwiAHhpgK+li4LLjP3D81rYX8pEaAiubpjCwqLxIexQ6XJCV
UZAR4swswrYntdXfUMmRnPBsLWWLePq6sRAvlent2si2cp+65f8I1xZ0ClK7YMjr
qXUW7jOp/88=
=F0bv
-----END PGP SIGNATURE-----
Merge tag 'efi-urgent-2020-06-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull EFI fixes from Ingo Molnar:
- Fix build regression on v4.8 and older
- Robustness fix for TPM log parsing code
- kobject refcount fix for the ESRT parsing code
- Two efivarfs fixes to make it behave more like an ordinary file
system
- Style fixup for zero length arrays
- Fix a regression in path separator handling in the initrd loader
- Fix a missing prototype warning
- Add some kerneldoc headers for newly introduced stub routines
- Allow support for SSDT overrides via EFI variables to be disabled
- Report CPU mode and MMU state upon entry for 32-bit ARM
- Use the correct stack pointer alignment when entering from mixed mode
* tag 'efi-urgent-2020-06-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
efi/libstub: arm: Print CPU boot mode and MMU state at boot
efi/libstub: arm: Omit arch specific config table matching array on arm64
efi/x86: Setup stack correctly for efi_pe_entry
efi: Make it possible to disable efivar_ssdt entirely
efi/libstub: Descriptions for stub helper functions
efi/libstub: Fix path separator regression
efi/libstub: Fix missing-prototype warning for skip_spaces()
efi: Replace zero-length array and use struct_size() helper
efivarfs: Don't return -EINTR when rate-limiting reads
efivarfs: Update inode modification time for successful writes
efi/esrt: Fix reference count leak in esre_create_sysfs_entry.
efi/tpm: Verify event log header before parsing
efi/x86: Fix build with gcc 4
req->iopoll() is not necessarily called by a task that submitted a
request. Because of that, it's dangerous to grab_env() and punt async on
-EGAIN, potentially grabbing another task's mm and corrupting its
memory.
Do resubmit from the submitter task context.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There are a lot of new users of task_work, and some of task_work_add()
may happen while we do io polling, thus make iopoll from time to time
to do task_work_run(), so it doesn't poll for sitting there reqs.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Assign req->result to io_size early in io_{read,write}(), it's enough
and makes it more straightforward.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
After pulling nxt from a request, it's no more a links head, so clear
REQ_F_LINK_HEAD. Absence of this flag also indicates that there are no
linked requests, so replacing REQ_F_LINK_NEXT, which can be killed.
Linked timeouts also behave leaving the flag intact when necessary.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Move all batch free bits close to each other and rename in a consistent
way.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There is no reason to not batch deallocation of linked requests. Take
away its next req first and handle it as everything else in
io_req_multi_free().
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Every request in io_req_multi_free() is has ->file set. Instead of
pointlessly defering and counting reqs with file, dismantle it on place
and save for batch dealloc.
It also saves us from potentially skipping io_cleanup_req(), put_task(),
etc. Never happens though, becacuse ->file is always there.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
io_free_req_many() is used only for iopoll requests, i.e. reads/writes.
Hence no need to batch inflight unhooking. For safety, it'll be done by
io_dismantle_req(), which replaces __io_req_aux_free(), and looks more
solid and cleaner.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We won't have valid ring_fd, ring_file in task work. Grab files early.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
No reason to mark a head of a link as for-async in io_req_defer_prep().
grab_env(), etc. That will be done further during submission if
neccessary.
Mark for_async=false saving extra grab_env() in many cases.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
It's not enough to check for REQ_F_WORK_INITIALIZED and punt async
assuming that io_req_work_grab_env() was called, it may not have been.
E.g. io_close_prep() and personality path set the flag without further
async init.
As a quick fix, always pass next work through io_req_task_queue().
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The cell name stored in the afs_cell struct is a 64-char + NUL buffer -
when it needs to be able to handle up to AFS_MAXCELLNAME (256 chars) + NUL.
Fix this by changing the array to a pointer and allocating the string.
Found using Coverity.
Fixes: 989782dcdc ("afs: Overhaul cell database management")
Reported-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAl73SpUACgkQiiy9cAdy
T1EseAwAkwkY8r/7LbNDnil+xQivdCfxuc+FCYXpw3HBvR/Zfjb+n/01RIpJoJw7
kl6MyFUBALrNFY6DvhsNErn7cP9O5Fjg73AfDfE2ySG4N+xZt+EcbbNZ6MtWwdQQ
a+ZzelGkT1lg+x4Xzz6oy9eWjvHPu6V9e8ycWjl2uRc7I19ze9NinV0rWOp80DAN
uiVEZo/5f28qTYIVP9rFayKN4TcOQYYRYLukP9zH9s0EBvLYQHGefvE8f01iLdm4
JyDi/4hmGIS4e7IaROImX25DKxPQTVUytjhmxHdmjg1Or0O3WMSr7zLWauJNn1G8
/820ec/CgBLtqpD6Y9vUar01+U3Q7Qms/UrEwx+WVVpZPDFVNKDfd6aLlj+UCJeQ
PHERRVKdHMyz5iaqY4hZhS90uizt4mHAmoNf+YcbjdaiBvebqaAuo/foIwadYBEm
1ZGevYUIt3cpvbAIv/I3OSrTSvY1/OQZmkHj5IZ0iZXdJaMeOhgrXYyIeL95aJEU
d6x8VYpI
=5BTi
-----END PGP SIGNATURE-----
Merge tag '5.8-rc2-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs fixes from Steve French:
"Six cifs/smb3 fixes, three of them for stable.
Fixes xfstests 451, 313 and 316"
* tag '5.8-rc2-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6:
cifs: misc: Use array_size() in if-statement controlling expression
cifs: update ctime and mtime during truncate
cifs/smb3: Fix data inconsistent when punch hole
cifs/smb3: Fix data inconsistent when zero file range
cifs: Fix double add page to memcg when cifs_readpages
cifs: Fix cached_fid refcnt leak in open_shroot
Stable Fixes:
- xprtrdma: Fix handling of RDMA_ERROR replies
- sunrpc: Fix rollback in rpc_gssd_dummy_populate()
- pNFS/flexfiles: Fix list corruption if the mirror count changes
- NFSv4: Fix CLOSE not waiting for direct IO completion
- SUNRPC: Properly set the @subbuf parameter of xdr_buf_subsegment()
Other Fixes:
- xprtrdma: Fix a use-after-free with r_xprt->rx_ep
- Fix other xprtrdma races during disconnect
- NFS: Fix memory leak of export_path
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAl72aFkACgkQ18tUv7Cl
QOsWfg//ewmCjJV1LGJJM2ntxcN9xAZJdIY3cuYfxQaDr/qwdbgh8DNbPlkImaoB
aW5DVciqKJ8HpJchko4wYvNbbnAI32Kd87RcmUYoXwwUY+H2kwuOf41Vm4jfrScF
NHiN5b5GTUz2X/s83NsbE9uGCFE1TS8pJkn6chVEWJY+QOjWpQmJrFQ0E9ULwP1O
g46Dym9RtILrsNyGcSks6Rnts4Ujm3+PDW+hLWjGzwotDgMS2LGZ7oQpfcs0NvHs
A3RjSOywltockeKvqchibTZMAXjIxqLV8cmo6AsT2H3llGbr+F61DkBMuTgqozhp
QAONwvxDv6EcnsS5NnOJJdhwG7IK1dPIA5oxmGq7XlhShZF+hrfvGYyhkmDkdf8V
9wfpV6foPC07hTcd+h0+A5DTh4Bxi71q+VIvVyQzgvX4UgRMrRptkNUzAm/Tn56C
JoFtjxswy0W476rqYaIJKjs/Mv1eozwvEifIuwpMu+VWiwiNEygNKyvmdVYxeDmv
13hjXVbQCCjyPvQSmBRKUEOR07DxHUt5Kcy9xHQ5ZXr5KdCERSt9MfXucxUxMQTA
JG143HPt3P7tkr+1wIyerN94w0kZGQqtQR/BHd5Ms0abrv+jgqjQVleFd4vX2igU
o/pCH4SLEhEndChU6lvv534ilRSH5LLQifyV2ThFFdZpOhtw7tU=
=BzX9
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-5.8-2' of git://git.linux-nfs.org/projects/anna/linux-nfs
Pull NFS client bugfixes from Anna Schumaker:
"Stable Fixes:
- xprtrdma: Fix handling of RDMA_ERROR replies
- sunrpc: Fix rollback in rpc_gssd_dummy_populate()
- pNFS/flexfiles: Fix list corruption if the mirror count changes
- NFSv4: Fix CLOSE not waiting for direct IO completion
- SUNRPC: Properly set the @subbuf parameter of xdr_buf_subsegment()
Other Fixes:
- xprtrdma: Fix a use-after-free with r_xprt->rx_ep
- Fix other xprtrdma races during disconnect
- NFS: Fix memory leak of export_path"
* tag 'nfs-for-5.8-2' of git://git.linux-nfs.org/projects/anna/linux-nfs:
SUNRPC: Properly set the @subbuf parameter of xdr_buf_subsegment()
NFSv4 fix CLOSE not waiting for direct IO compeletion
pNFS/flexfiles: Fix list corruption if the mirror count changes
nfs: Fix memory leak of export_path
sunrpc: fixed rollback in rpc_gssd_dummy_populate()
xprtrdma: Fix handling of RDMA_ERROR replies
xprtrdma: Clean up disconnect
xprtrdma: Clean up synopsis of rpcrdma_flush_disconnect()
xprtrdma: Use re_connect_status safely in rpcrdma_xprt_connect()
xprtrdma: Prevent dereferencing r_xprt->rx_ep after it is freed
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl72TkcQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpsr2D/40VWZtyhFeUlpE+Qiodz0ZSmREZxCj1QQ1
oE8vvyfGVjYbGAeWUsR7hpXBTv9bnEHpF7GRumjOAuML5+cuhh1XSWUHitnJcuHC
dX6K3ueh6x8l3mn2EKK/NOaHT6/4STwO7er3lX1wQIAAhIXp7Och2geOL+a3PoZd
NMcGaQ3aPrr0Qo7hW7ZMaAmYROewLvZ7p8aIowmBXqTT1Qxy9Ig29HtaDbEkno0X
TWy/tuU73nli4QWwWIst14Oeqfm81xDjLRSDa9tID0nvn5ZtB6wy8yAa0QRZS90w
t9dB02VVQl+Ql4ZzrXnRTJciP6B4jFvir61oq9vSnDp51LQGyQb/rATXaoiEPPc3
uQARCrB4MDAWFs70BX6MFprI0NNZIdCZK+Okaki2HsjnI5uJQvN5Hrlmo1Khyate
doO9HjQtDenFyQcha+ea0SUWzXKV/Uss4WemES5Sem6CFPVMkZ/vco2d7D6PEJc1
AX5efoiBcd/NNL5XfVQoe7HTuCHIczXXEHP2FAgJc8q1lp7ROUnWQZsm5968ERqs
pelRq5jHNd9ZF29jEfnYvxidJCc1+34YrKmQ9OPgJkqaoQ9aBGANsI9eM6cQ5CLx
X7riSQh+BTqdAtczT5HDFX15GF9VxsD3CGaOrhG1f7aZm7J19bIImP5+Uh/AHY49
iBkyVZ7fNA==
=ar3Q
-----END PGP SIGNATURE-----
Merge tag 'io_uring-5.8-2020-06-26' of git://git.kernel.dk/linux-block
Pull io_uring fixes from Jens Axboe:
"Three small fixes:
- Close a corner case for polled IO resubmission (Pavel)
- Toss commands when exiting (Pavel)
- Fix SQPOLL conditional reschedule on perpetually busy submit
(Xuan)"
* tag 'io_uring-5.8-2020-06-26' of git://git.kernel.dk/linux-block:
io_uring: fix current->mm NULL dereference on exit
io_uring: fix hanging iopoll in case of -EAGAIN
io_uring: fix io_sq_thread no schedule when busy
Fix build errors when CONFIG_NET is not set/enabled:
../fs/io_uring.c:5472:10: error: too many arguments to function ‘io_sendmsg’
../fs/io_uring.c:5474:10: error: too many arguments to function ‘io_send’
../fs/io_uring.c:5484:10: error: too many arguments to function ‘io_recvmsg’
../fs/io_uring.c:5486:10: error: too many arguments to function ‘io_recv’
../fs/io_uring.c:5510:9: error: too many arguments to function ‘io_accept’
../fs/io_uring.c:5518:9: error: too many arguments to function ‘io_connect’
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: io-uring@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Merge in changes that went into 5.8-rc3. GIT will silently do the
merge, but we still need a tweak on top of that since
io_complete_rw_common() was modified to take a io_comp_state pointer.
The auto-merge fails on that, and we end up with something that
doesn't compile.
* io_uring-5.8:
io_uring: fix current->mm NULL dereference on exit
io_uring: fix hanging iopoll in case of -EAGAIN
io_uring: fix io_sq_thread no schedule when busy
Signed-off-by: Jens Axboe <axboe@kernel.dk>
It's easier to return next work from ->do_work() than
having an in-out argument. Looks nicer and easier to compile.
Also, merge io_wq_assign_next() into its only user.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently links are always done in an async fashion, unless we catch them
inline after we successfully complete a request without having to resort
to blocking. This isn't necessarily the most efficient approach, it'd be
more ideal if we could just use the task_work handling for this.
Outside of saving an async jump, we can also do less prep work for these
kinds of requests.
Running dependent links from the task_work handler yields some nice
performance benefits. As an example, examples/link-cp from the liburing
repository uses read+write links to implement a copy operation. Without
this patch, the a cache fold 4G file read from a VM runs in about 3
seconds:
$ time examples/link-cp /data/file /dev/null
real 0m2.986s
user 0m0.051s
sys 0m2.843s
and a subsequent cache hot run looks like this:
$ time examples/link-cp /data/file /dev/null
real 0m0.898s
user 0m0.069s
sys 0m0.797s
With this patch in place, the cold case takes about 2.4 seconds:
$ time examples/link-cp /data/file /dev/null
real 0m2.400s
user 0m0.020s
sys 0m2.366s
and the cache hot case looks like this:
$ time examples/link-cp /data/file /dev/null
real 0m0.676s
user 0m0.010s
sys 0m0.665s
As expected, the (mostly) cache hot case yields the biggest improvement,
running about 25% faster with this change, while the cache cold case
yields about a 20% increase in performance. Outside of the performance
increase, we're using less CPU as well, as we're not using the async
offload threads at all for this anymore.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Figuring out the root case for the REMOVE/CLOSE race and
suggesting the solution was done by Neil Brown.
Currently what happens is that direct IO calls hold a reference
on the open context which is decremented as an asynchronous task
in the nfs_direct_complete(). Before reference is decremented,
control is returned to the application which is free to close the
file. When close is being processed, it decrements its reference
on the open_context but since directIO still holds one, it doesn't
sent a close on the wire. It returns control to the application
which is free to do other operations. For instance, it can delete a
file. Direct IO is finally releasing its reference and triggering
an asynchronous close. Which races with the REMOVE. On the server,
REMOVE can be processed before the CLOSE, failing the REMOVE with
EACCES as the file is still opened.
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Suggested-by: Neil Brown <neilb@suse.com>
CC: stable@vger.kernel.org
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
If the mirror count changes in the new layout we pick up inside
ff_layout_pg_init_write(), then we can end up adding the
request to the wrong mirror and corrupting the mirror->pg_list.
Fixes: d600ad1f2b ("NFS41: pop some layoutget errors to application")
Cc: stable@vger.kernel.org
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
The try_location function is called within a loop by nfs_follow_referral.
try_location calls nfs4_pathname_string to created the export_path.
nfs4_pathname_string allocates the memory. export_path is stored in the
nfs_fs_context/fs_context structure similarly as hostname and source.
But whereas the ctx hostname and source are freed before assignment,
export_path is not. So if there are multiple loops, the new export_path
will overwrite the old without the old being freed.
So call kfree for export_path.
Signed-off-by: Tom Rix <trix@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
In the ocfs2 disk layout, slot number is 16 bits, but in ocfs2
implementation, slot number is 32 bits. Usually this will not cause any
issue, because slot number is converted from u16 to u32, but
OCFS2_INVALID_SLOT was defined as -1, when an invalid slot number from
disk was obtained, its value was (u16)-1, and it was converted to u32.
Then the following checking in get_local_system_inode will be always
skipped:
static struct inode **get_local_system_inode(struct ocfs2_super *osb,
int type,
u32 slot)
{
BUG_ON(slot == OCFS2_INVALID_SLOT);
...
}
Link: http://lkml.kernel.org/r/20200616183829.87211-5-junxiao.bi@oracle.com
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Set global_inode_alloc as OCFS2_FIRST_ONLINE_SYSTEM_INODE, that will
make it load during mount. It can be used to test whether some
global/system inodes are valid. One use case is that nfsd will test
whether root inode is valid.
Link: http://lkml.kernel.org/r/20200616183829.87211-3-junxiao.bi@oracle.com
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Jun Piao <piaojun@huawei.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "ocfs2: fix nfsd over ocfs2 issues", v2.
This is a series of patches to fix issues on nfsd over ocfs2. patch 1
is to avoid inode removed while nfsd access it patch 2 & 3 is to fix a
panic issue.
This patch (of 4):
When nfsd is getting file dentry using handle or parent dentry of some
dentry, one cluster lock is used to avoid inode removed from other node,
but it still could be removed from local node, so use a rw lock to avoid
this.
Link: http://lkml.kernel.org/r/20200616183829.87211-1-junxiao.bi@oracle.com
Link: http://lkml.kernel.org/r/20200616183829.87211-2-junxiao.bi@oracle.com
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Jun Piao <piaojun@huawei.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAl706ikACgkQnJ2qBz9k
QNkk/Af9E2/VzEy4CNsGWTBdxRCZQ12Q3n1pe+ReqkmQDEWjN4FxTuhukw9dtsxE
a6ZIm9EXOyFmu+LnrSFoskWDBDCrgwo2zOF2kW/pjs9KRW04l0sWuGEI5btKW9/2
Q/uFUJjpgrQ3sxSbj2Df0Q6k0CVBQMTzoJvH2QobViRgzoJeSMr0nE+Sw7PRHzOB
Wh3Fis65B8ZrxBMnTPuwzo3zLrvvqtzW6MGRSK0HxOBR1R9KCWvkJgBdyMy80/tg
bX2VvpUL6FRUmc36B1VJ/d3hon13nQ0GthTvD1FuBYHmVf/z5AU1gtQOIGl5QkWi
Q6PoW+lL8m+gTcN29stz1KHHrvhPbQ==
=nQGb
-----END PGP SIGNATURE-----
Merge tag 'fsnotify_for_v5.8-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull fsnotify fixlet from Jan Kara:
"A performance improvement to reduce impact of fsnotify for inodes
where it isn't used"
* tag 'fsnotify_for_v5.8-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
fs: Do not check if there is a fsnotify watcher on pseudo inodes
A bit more surgery required here, as completions are generally done
through the kiocb->ki_complete() callback, even if they complete inline.
This enables the regular read/write path to use the io_comp_state
logic to batch inline completions.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Provide the completion state to the handlers that we know can complete
inline, so they can utilize this for batching completions.
Cap the max batch count at 32. This should be enough to provide a good
amortization of the cost of the lock+commit dance for completions, while
still being low enough not to cause any real latency issues for SQPOLL
applications.
Xuan Zhuo <xuanzhuo@linux.alibaba.com> reports that this changes his
profile from:
17.97% [kernel] [k] copy_user_generic_unrolled
13.92% [kernel] [k] io_commit_cqring
11.04% [kernel] [k] __io_cqring_fill_event
10.33% [kernel] [k] udp_recvmsg
5.94% [kernel] [k] skb_release_data
4.31% [kernel] [k] udp_rmem_release
2.68% [kernel] [k] __check_object_size
2.24% [kernel] [k] __slab_free
2.22% [kernel] [k] _raw_spin_lock_bh
2.21% [kernel] [k] kmem_cache_free
2.13% [kernel] [k] free_pcppages_bulk
1.83% [kernel] [k] io_submit_sqes
1.38% [kernel] [k] page_frag_free
1.31% [kernel] [k] inet_recvmsg
to
19.99% [kernel] [k] copy_user_generic_unrolled
11.63% [kernel] [k] skb_release_data
9.36% [kernel] [k] udp_rmem_release
8.64% [kernel] [k] udp_recvmsg
6.21% [kernel] [k] __slab_free
4.39% [kernel] [k] __check_object_size
3.64% [kernel] [k] free_pcppages_bulk
2.41% [kernel] [k] kmem_cache_free
2.00% [kernel] [k] io_submit_sqes
1.95% [kernel] [k] page_frag_free
1.54% [kernel] [k] io_put_req
[...]
0.07% [kernel] [k] io_commit_cqring
0.44% [kernel] [k] __io_cqring_fill_event
Signed-off-by: Jens Axboe <axboe@kernel.dk>
No functional changes in this patch, just in preparation for having the
completion state be available on the issue side. Later on, this will
allow requests that complete inline to be completed in batches.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
No functional changes in this patch, just in preparation for passing back
pending completions to the caller and completing them in a batched
fashion.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We have lots of callers of:
io_cqring_add_event(req, result);
io_put_req(req);
Provide a helper that does this for us. It helps clean up the code, and
also provides a more convenient location for us to change the completion
handling.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
__io_queue_sqe() tries to handle all request of a link,
so it's not enough to grab mm in io_sq_thread_acquire_mm()
based just on the head.
Don't check req->needs_mm and do it always.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
io_do_iopoll() won't do anything with a request unless
req->iopoll_completed is set. So io_complete_rw_iopoll() has to set
it, otherwise io_do_iopoll() will poll a file again and again even
though the request of interest was completed long time ago.
Also, remove -EAGAIN check from io_issue_sqe() as it races with
the changed lines. The request will take the long way and be
resubmitted from io_iopoll*().
io_kiocb's result and iopoll_completed")
Fixes: bbde017a32 ("io_uring: add memory barrier to synchronize
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Fix a regression which uses potential uninitialized
high 32-bit value unexpectedly recently observed with
specific compiler options.
-----BEGIN PGP SIGNATURE-----
iIsEABYIADMWIQThPAmQN9sSA0DVxtI5NzHcH7XmBAUCXvO6thUcaHNpYW5na2Fv
QHJlZGhhdC5jb20ACgkQOTcx3B+15gT8eQEA/W9d/II6pqD1KD7Oh7K8AIt7kU46
JTBY6bA/lmMC/GkA/1cqAOxDfEGmWzH5Y/Hz7CLgnsRQYo90i9JZ1tcFAWkK
=kUeU
-----END PGP SIGNATURE-----
Merge tag 'erofs-for-5.8-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs
Pull erofs fix from Gao Xiang:
"Fix a regression which uses potential uninitialized high 32-bit value
unexpectedly recently observed with specific compiler options"
* tag 'erofs-for-5.8-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs:
erofs: fix partially uninitialized misuse in z_erofs_onlinepage_fixup
Move the struct block_device definition together with most of the
block layer definitions, as it has nothing to do with the rest of fs.h.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Move most of the block related definition out of fs.h into more suitable
headers.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Hongyu reported "id != index" in z_erofs_onlinepage_fixup() with
specific aarch64 environment easily, which wasn't shown before.
After digging into that, I found that high 32 bits of page->private
was set to 0xaaaaaaaa rather than 0 (due to z_erofs_onlinepage_init
behavior with specific compiler options). Actually we only use low
32 bits to keep the page information since page->private is only 4
bytes on most 32-bit platforms. However z_erofs_onlinepage_fixup()
uses the upper 32 bits by mistake.
Let's fix it now.
Reported-and-tested-by: Hongyu Jin <hongyu.jin@unisoc.com>
Fixes: 3883a79abd ("staging: erofs: introduce VLE decompression support")
Cc: <stable@vger.kernel.org> # 4.19+
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Link: https://lore.kernel.org/r/20200618234349.22553-1-hsiangkao@aol.com
Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
Use array_size() instead of the open-coded version in the controlling
expression of the if statement.
Also, while there, use the preferred form for passing a size of a struct.
The alternative form where struct name is spelled out hurts readability
and introduces an opportunity for a bug when the pointer variable type is
changed but the corresponding sizeof that is passed as argument is not.
This issue was found with the help of Coccinelle and, audited and fixed
manually.
Addresses-KSPP-ID: https://github.com/KSPP/linux/issues/83
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
As the man description of the truncate, if the size changed,
then the st_ctime and st_mtime fields should be updated. But
in cifs, we doesn't do it.
It lead the xfstests generic/313 failed.
So, add the ATTR_MTIME|ATTR_CTIME flags on attrs when change
the file size
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Zhang Xiaoxu <zhangxiaoxu5@huawei.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
When punch hole success, we also can read old data from file:
# strace -e trace=pread64,fallocate xfs_io -f -c "pread 20 40" \
-c "fpunch 20 40" -c"pread 20 40" file
pread64(3, " version 5.8.0-rc1+"..., 40, 20) = 40
fallocate(3, FALLOC_FL_KEEP_SIZE|FALLOC_FL_PUNCH_HOLE, 20, 40) = 0
pread64(3, " version 5.8.0-rc1+"..., 40, 20) = 40
CIFS implements the fallocate(FALLOCATE_FL_PUNCH_HOLE) with send SMB
ioctl(FSCTL_SET_ZERO_DATA) to server. It just set the range of the
remote file to zero, but local page caches not updated, then the
local page caches inconsistent with server.
Also can be found by xfstests generic/316.
So, we need to remove the page caches before send the SMB
ioctl(FSCTL_SET_ZERO_DATA) to server.
Fixes: 31742c5a33 ("enable fallocate punch hole ("fallocate -p") for SMB3")
Suggested-by: Pavel Shilovsky <pshilov@microsoft.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Zhang Xiaoxu <zhangxiaoxu5@huawei.com>
Cc: stable@vger.kernel.org # v3.17
Signed-off-by: Steve French <stfrench@microsoft.com>
CIFS implements the fallocate(FALLOC_FL_ZERO_RANGE) with send SMB
ioctl(FSCTL_SET_ZERO_DATA) to server. It just set the range of the
remote file to zero, but local page cache not update, then the data
inconsistent with server, which leads the xfstest generic/008 failed.
So we need to remove the local page caches before send SMB
ioctl(FSCTL_SET_ZERO_DATA) to server. After next read, it will
re-cache it.
Fixes: 30175628bf ("[SMB3] Enable fallocate -z support for SMB3 mounts")
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Zhang Xiaoxu <zhangxiaoxu5@huawei.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Cc: stable@vger.kernel.org # v3.17
Signed-off-by: Steve French <stfrench@microsoft.com>
When the user consumes and generates sqe at a fast rate,
io_sqring_entries can always get sqe, and ret will not be equal to -EBUSY,
so that io_sq_thread will never call cond_resched or schedule, and then
we will get the following system error prompt:
rcu: INFO: rcu_sched self-detected stall on CPU
or
watchdog: BUG: soft lockup-CPU#23 stuck for 112s! [io_uring-sq:1863]
This patch checks whether need to call cond_resched() by checking
the need_resched() function every cycle.
Suggested-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl7yABEACgkQxWXV+ddt
WDtGoQ//cBWRRWLlLTRgpaKnY6t8JgVUqNvPJISHHf45cNbOJh0yo8hUuKMW+440
8ovYqtFoZD+JHcHDE2sMueHBFe38rG5eT/zh8j/ruhBzeJcTb3lSYz53d7sfl5kD
cIVngPEVlGziDqW2PsWLlyh8ulBGzY3YmS6kAEkyP/6/uhE/B1dq6qn3GUibkbKI
dfNjHTLwZVmwnqoxLu8ZE2/hHFbzhl0sm09snsXYSVu13g36+edp0Z+pF0MlKGVk
G6YrnZcts8TWwneZ4nogD9f2CMvzMhYDDLyEjsX0Ouhb+Cu2WNxdfrJ2ZbPNU82w
EGbo451mIt6Ht8wicEjh27LWLI7YMraF/Ig/ODMdvFBYDbhl4voX2t+4n+p5Czbg
AW6Wtg/q5EaaNFqrTsqAAiUn0+R3sMiDWrE0AewcE7syPGqQ2XMwP4la5pZ36rz8
8Vo5KIGo44PIJ1dMwcX+bg3HTtUnBJSxE5fUi0rJ3ZfHKGjLS79VonEeQjh3QD6W
0UlK+jCjo6KZoe33XdVV2hVkHd63ZIlliXWv0LOR+gpmqqgW2b3wf181zTvo/5sI
v0fDjstA9caqf68ChPE9jJi7rZPp/AL1yAQGEiNzjKm4U431TeZJl2cpREicMJDg
FCDU51t9425h8BFkM4scErX2/53F1SNNNSlAsFBGvgJkx6rTENs=
=/eCR
-----END PGP SIGNATURE-----
Merge tag 'for-5.8-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"A number of fixes, located in two areas, one performance fix and one
fixup for better integration with another patchset.
- bug fixes in nowait aio:
- fix snapshot creation hang after nowait-aio was used
- fix failure to write to prealloc extent past EOF
- don't block when extent range is locked
- block group fixes:
- relocation failure when scrub runs in parallel
- refcount fix when removing fails
- fix race between removal and creation
- space accounting fixes
- reinstante fast path check for log tree at unlink time, fixes
performance drop up to 30% in REAIM
- kzfree/kfree fixup to ease treewide patchset renaming kzfree"
* tag 'for-5.8-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: use kfree() in btrfs_ioctl_get_subvol_info()
btrfs: fix RWF_NOWAIT writes blocking on extent locks and waiting for IO
btrfs: fix RWF_NOWAIT write not failling when we need to cow
btrfs: fix failure of RWF_NOWAIT write into prealloc extent beyond eof
btrfs: fix hang on snapshot creation after RWF_NOWAIT write
btrfs: check if a log root exists before locking the log_mutex on unlink
btrfs: fix bytes_may_use underflow when running balance and scrub in parallel
btrfs: fix data block group relocation failure due to concurrent scrub
btrfs: fix race between block group removal and block group creation
btrfs: fix a block group ref counter leak after failure to remove block group
xlog_wait() on the CIL context can reference a freed context if the
waiter doesn't get scheduled before the CIL context is freed. This
can happen when a task is on the hard throttle and the CIL push
aborts due to a shutdown. This was detected by generic/019:
thread 1 thread 2
__xfs_trans_commit
xfs_log_commit_cil
<CIL size over hard throttle limit>
xlog_wait
schedule
xlog_cil_push_work
wake_up_all
<shutdown aborts commit>
xlog_cil_committed
kmem_free
remove_wait_queue
spin_lock_irqsave --> UAF
Fix it by moving the wait queue to the CIL rather than keeping it in
in the CIL context that gets freed on push completion. Because the
wait queue is now independent of the CIL context and we might have
multiple contexts in flight at once, only wake the waiters on the
push throttle when the context we are pushing is over the hard
throttle size threshold.
Fixes: 0e7ab7efe7 ("xfs: Throttle commits on delayed background CIL push")
Reported-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
open_shroot() invokes kref_get(), which increases the refcount of the
"tcon->crfid" object. When open_shroot() returns not zero, it means the
open operation failed and close_shroot() will not be called to decrement
the refcount of the "tcon->crfid".
The reference counting issue happens in one normal path of
open_shroot(). When the cached root have been opened successfully in a
concurrent process, the function increases the refcount and jump to
"oshr_free" to return. However the current return value "rc" may not
equal to 0, thus the increased refcount will not be balanced outside the
function, causing a refcnt leak.
Fix this issue by setting the value of "rc" to 0 before jumping to
"oshr_free" label.
Signed-off-by: Xiyu Yang <xiyuyang19@fudan.edu.cn>
Signed-off-by: Xin Tan <tanxin.ctf@gmail.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
CC: Stable <stable@vger.kernel.org>
After recent changes, io_submit_sqes() always passes valid submit state,
so kill leftovers checking it for NULL.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
It's a good practice to modify fields of a struct after but not before
it was initialised. Even though io_init_poll_iocb() doesn't touch
poll->file, call it first.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
REQ_F_MUST_PUNT may seem looking good and clear, but it's the same
as not having REQ_F_NOWAIT set. That rather creates more confusion.
Moreover, it doesn't even affect any behaviour (e.g. see the patch
removing it from io_{read,write}).
Kill theg flag and update already outdated comments.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
io_{read,write}() {
...
copy_iov: // prep async
if (!(flags & REQ_F_NOWAIT) && !file_can_poll(file))
flags |= REQ_F_MUST_PUNT;
}
REQ_F_MUST_PUNT there is pointless, because if it happens then
REQ_F_NOWAIT is known to be _not_ set, and the request will go
async path in __io_queue_sqe() anyway. file_can_poll() check
is also repeated in arm_poll*(), so don't need it.
Remove the mentioned assignment REQ_F_MUST_PUNT in preparation
for killing the flag.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If the file is flagged with FMODE_BUF_RASYNC, then we don't have to punt
the buffered read to an io-wq worker. Instead we can rely on page
unlocking callbacks to support retry based async IO. This is a lot more
efficient than doing async thread offload.
The retry is done similarly to how we handle poll based retry. From
the unlock callback, we simply queue the retry to a task_work based
handler.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Mark the plug with nowait == true, which will cause requests to avoid
blocking on request allocation. If they do, we catch them and reissue
them from a task_work based handler.
Normally we can catch -EAGAIN directly, but the hard case is for split
requests. As an example, the application issues a 512KB request. The
block core will split this into 128KB if that's the max size for the
device. The first request issues just fine, but we run into -EAGAIN for
some latter splits for the same request. As the bio is split, we don't
get to see the -EAGAIN until one of the actual reads complete, and hence
we cannot handle it inline as part of submission.
This does potentially cause re-reads of parts of the range, as the whole
request is reissued. There's currently no better way to handle this.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-EIO bubbles up like -EAGAIN if we fail to allocate a request at the
lower level. Play it safe and treat it like -EAGAIN in terms of sync
retry, to avoid passing back an errant -EIO.
Catch some of these early for block based file, as non-mq devices
generally do not support NOWAIT. That saves us some overhead by
not first trying, then retrying from async context. We can go straight
to async punt instead.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently we only plug if we're doing more than two request. We're going
to be relying on always having the plug there to pass down information,
so plug unconditionally.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ring pages are not pinned so it is more appropriate to report them
as locked.
Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Applications can pass this flag in to avoid accept thundering herd.
Signed-off-by: Jiufei Xue <jiufei.xue@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
poll events should be 32-bits to cover EPOLLEXCLUSIVE.
Explicit word-swap the poll32_events for big endian to make sure the ABI
is not changed. We call this feature IORING_FEAT_POLL_32BITS,
applications who want to use EPOLLEXCLUSIVE should check the feature bit
first.
Signed-off-by: Jiufei Xue <jiufei.xue@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
- Have recordmcount work with > 64K sections (to support LTO)
- kprobe RCU fixes
- Correct a kprobe critical section with missing mutex
- Remove redundant arch_disarm_kprobe() call
- Fix lockup when kretprobe triggers within kprobe_flush_task()
- Fix memory leak in fetch_op_data operations
- Fix sleep in atomic in ftrace trace array sample code
- Free up memory on failure in sample trace array code
- Fix incorrect reporting of function_graph fields in format file
- Fix quote within quote parsing in bootconfig
- Fix return value of bootconfig tool
- Add testcases for bootconfig tool
- Fix maybe uninitialized warning in ftrace pid file code
- Remove unused variable in tracing_iter_reset()
- Fix some typos
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCXu1jrRQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qoCMAP91nOccE3X+Nvc3zET3isDWnl1tWJxk
icsBgN/JwBRuTAD/dnWTHIWM2/5lTiagvyVsmINdJHP6JLr8T7dpN9tlxAQ=
=Cuo7
-----END PGP SIGNATURE-----
Merge tag 'trace-v5.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fixes from Steven Rostedt:
- Have recordmcount work with > 64K sections (to support LTO)
- kprobe RCU fixes
- Correct a kprobe critical section with missing mutex
- Remove redundant arch_disarm_kprobe() call
- Fix lockup when kretprobe triggers within kprobe_flush_task()
- Fix memory leak in fetch_op_data operations
- Fix sleep in atomic in ftrace trace array sample code
- Free up memory on failure in sample trace array code
- Fix incorrect reporting of function_graph fields in format file
- Fix quote within quote parsing in bootconfig
- Fix return value of bootconfig tool
- Add testcases for bootconfig tool
- Fix maybe uninitialized warning in ftrace pid file code
- Remove unused variable in tracing_iter_reset()
- Fix some typos
* tag 'trace-v5.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
ftrace: Fix maybe-uninitialized compiler warning
tools/bootconfig: Add testcase for show-command and quotes test
tools/bootconfig: Fix to return 0 if succeeded to show the bootconfig
tools/bootconfig: Fix to use correct quotes for value
proc/bootconfig: Fix to use correct quotes for value
tracing: Remove unused event variable in tracing_iter_reset
tracing/probe: Fix memleak in fetch_op_data operations
trace: Fix typo in allocate_ftrace_ops()'s comment
tracing: Make ftrace packed events have align of 1
sample-trace-array: Remove trace_array 'sample-instance'
sample-trace-array: Fix sleeping function called from invalid context
kretprobe: Prevent triggering kretprobe from within kprobe_flush_task
kprobes: Remove redundant arch_disarm_kprobe() call
kprobes: Fix to protect kick_kprobe_optimizer() by kprobe_mutex
kprobes: Use non RCU traversal APIs on kprobe_tables if possible
kprobes: Suppress the suspicious RCU warning on kprobes
recordmcount: support >64k sections
The fileserver probe timer, net->fs_probe_timer, isn't cancelled when
the kafs module is being removed and so the count it holds on
net->servers_outstanding doesn't get dropped..
This causes rmmod to wait forever. The hung process shows a stack like:
afs_purge_servers+0x1b5/0x23c [kafs]
afs_net_exit+0x44/0x6e [kafs]
ops_exit_list+0x72/0x93
unregister_pernet_operations+0x14c/0x1ba
unregister_pernet_subsys+0x1d/0x2a
afs_exit+0x29/0x6f [kafs]
__do_sys_delete_module.isra.0+0x1a2/0x24b
do_syscall_64+0x51/0x95
entry_SYSCALL_64_after_hwframe+0x44/0xa9
Fix this by:
(1) Attempting to cancel the probe timer and, if successful, drop the
count that the timer was holding.
(2) Make the timer function just drop the count and not schedule the
prober if the afs portion of net namespace is being destroyed.
Also, whilst we're at it, make the following changes:
(3) Initialise net->servers_outstanding to 1 and decrement it before
waiting on it so that it doesn't generate wake up events by being
decremented to 0 until we're cleaning up.
(4) Switch the atomic_dec() on ->servers_outstanding for ->fs_timer in
afs_purge_servers() to use the helper function for that.
Fixes: f6cbb368bc ("afs: Actively poll fileservers to maintain NAT or firewall openings")
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix afs_do_lookup()'s fallback case for when FS.InlineBulkStatus isn't
supported by the server.
In the fallback, it calls FS.FetchStatus for the specific vnode it's
meant to be looking up. Commit b6489a49f7 broke this by renaming one
of the two identically-named afs_fetch_status_operation descriptors to
something else so that one of them could be made non-static. The site
that used the renamed one, however, wasn't renamed and didn't produce
any warning because the other was declared in a header.
Fix this by making afs_do_lookup() use the renamed variant.
Note that there are two variants of the success method because one is
called from ->lookup() where we may or may not have an inode, but can't
call iget until after we've talked to the server - whereas the other is
called from within iget where we have an inode, but it may or may not be
initialised.
The latter variant expects there to be an inode, but because it's being
called from there former case, there might not be - resulting in an oops
like the following:
BUG: kernel NULL pointer dereference, address: 00000000000000b0
...
RIP: 0010:afs_fetch_status_success+0x27/0x7e
...
Call Trace:
afs_wait_for_operation+0xda/0x234
afs_do_lookup+0x2fe/0x3c1
afs_lookup+0x3c5/0x4bd
__lookup_slow+0xcd/0x10f
walk_component+0xa2/0x10c
path_lookupat.isra.0+0x80/0x110
filename_lookup+0x81/0x104
vfs_statx+0x76/0x109
__do_sys_newlstat+0x39/0x6b
do_syscall_64+0x4c/0x78
entry_SYSCALL_64_after_hwframe+0x44/0xa9
Fixes: b6489a49f7 ("afs: Fix silly rename")
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl7s0e0QHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgppsQD/9ZD5Rrr1fRqZw29UMvjoKu53Plvc14GA7R
Tv44p/qF8KD7mkSY+YB3OIRh+NP7fHPXIJLotjZU9CFpgQtTjkiYbVCxPH5TpwZJ
YuEODrLuwuy3o5MU+a4t5uCBoGqDq5Fnz0Kh5kFfOl1D8qBzqczNzL0Ygn1FyoLd
LcCchC+kjROX6F+Oo+0onQeOFipSjxkG6LOThSiFsLJAL8huVLDem7ihon/HvqYw
68lAj2X0QlaIMzk0yKJ2LFovQRhk+nlWGtW6XzVCPetbEkFdmGOcEl13orwh/say
tkzKGN8O0JLThIxWqEQn1MHK9MeaKlnS9j2tFI3suR65xvjlxE8+cxsmlg/wNzGo
UyQgh1M8QvPNvDAXCL4q1k2QmCH0YwTY+pHqCIFDp37LRE6ZPboNWEV55YVB8VpL
axLPf89Any8ta3YICFq/Zmm03A/GUmLsxWspgbtOZMT40loNgZ3YoDR7cPfE76jU
N2XEZEVOQrvodXW6fjqfx6AAYraxhDo2gh4SZhF0ydXDgGTmK6BcHHu7a3lSp7+e
eKgYDgkMsa7bP5Cm3+VaNuv7db84kPEBeXLO5zJ86N/nKk8JE97Tl6uiVC8maBC2
r9ftsQd3fXkwDYmUk81EOjp2+YXY2zEt3vs8cP3euGt39qwjy4mdN05UQa5Xm5BS
XLWmpO2teA==
=tmwG
-----END PGP SIGNATURE-----
Merge tag 'io_uring-5.8-2020-06-19' of git://git.kernel.dk/linux-block
Pull io_uring fixes from Jens Axboe:
- Catch a case where io_sq_thread() didn't do proper mm acquire
- Ensure poll completions are reaped on shutdown
- Async cancelation and run fixes (Pavel)
- io-poll race fixes (Xiaoguang)
- Request cleanup race fix (Xiaoguang)
* tag 'io_uring-5.8-2020-06-19' of git://git.kernel.dk/linux-block:
io_uring: fix possible race condition against REQ_F_NEED_CLEANUP
io_uring: reap poll completions while waiting for refs to drop on exit
io_uring: acquire 'mm' for task_work for SQPOLL
io_uring: add memory barrier to synchronize io_kiocb's result and iopoll_completed
io_uring: don't fail links for EAGAIN error in IOPOLL mode
io_uring: cancel by ->task not pid
io_uring: lazy get task
io_uring: batch cancel in io_uring_cancel_files()
io_uring: cancel all task's requests on exit
io-wq: add an option to cancel all matched reqs
io-wq: reorder cancellation pending -> running
io_uring: fix lazy work init
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl7s0SAQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpp+YEACVqFvsfzxKCqa61IzyuOaPfnj9awyP+MY2
7V6y9sDDHL8sp6aPDbHvqFnqz0O7E+7nHVZD2rf2qc6tKKMvJYNO/BFZSXPvWTZV
KQ4cBChf/LDwqAKOnI4ZhmF5UcSyyob1yMy4uJ+U0gQiXXrRMbwJ3N1K24a9dr4c
epkzGavR0Q+PJ9BbUgjACjbRdT+vrP4bOu0cuyCGkIpD9eCerKJ6mFaUAj0FDthD
bg4BJj+c8Ij6LO0V++Wga6OxccmL43KeP0ky8B3x07PfAl+tDWqsbHSlU2YPtdcq
5nKgMMTW16mVnZeO2/W0JB7tn89VubsmyvIFcm2KNeeRqSnEZyW9HI8n4kq994Ju
xMH24lgbsU4trNeYkgOmzPoJJZ+LShkn+rnldyI1U/fhpEYub7DqfVySuT7ti9in
uFpQdeRUmPsdw92F3+o6h8OYAflpcQQ7CblkzxPEeV4OyzOZasb+S9tMNPe59KBh
0MtHv9IfzgtDihR6HuXifitXaP+GtH4x3D2z0dzEdooHKHC/+P3WycS5daG+3WKQ
xV5lJruvpTuxhXKLFAH0wRrxnVlB0VUvhQ21T3WgHrwF0btbdmQMHFc83XOxBIB4
jHWJMHGc4xp1ZdpWFBC8Cj79OmJh1w/ao8+/cf8SUoTB0LzFce1B8LvwnxgpcpUk
VjIOrl7zhQ==
=LeLd
-----END PGP SIGNATURE-----
Merge tag 'block-5.8-2020-06-19' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
- Use import_uuid() where appropriate (Andy)
- bcache fixes (Coly, Mauricio, Zhiqiang)
- blktrace sparse warnings fix (Jan)
- blktrace concurrent setup fix (Luis)
- blkdev_get use-after-free fix (Jason)
- Ensure all blk-mq maps are updated (Weiping)
- Loop invalidate bdev fix (Zheng)
* tag 'block-5.8-2020-06-19' of git://git.kernel.dk/linux-block:
block: make function 'kill_bdev' static
loop: replace kill_bdev with invalidate_bdev
partitions/ldm: Replace uuid_copy() with import_uuid() where it makes sense
block: update hctx map when use multiple maps
blktrace: Avoid sparse warnings when assigning q->blk_trace
blktrace: break out of blktrace setup on concurrent calls
block: Fix use-after-free in blkdev_get()
trace/events/block.h: drop kernel-doc for dropped function parameter
blk-mq: Remove redundant 'return' statement
bcache: pr_info() format clean up in bcache_device_init()
bcache: use delayed kworker fo asynchronous devices registration
bcache: check and adjust logical block size for backing devices
bcache: fix potential deadlock problem in btree_gc_coalesce
Use kfree() instead of kvfree() to free variables allocated
by match_strdup(). Because the memory is allocated with kmalloc
inside match_strdup().
Signed-off-by: Wang Xiaojun <wangxiaojun11@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Merge non-faulting memory access cleanups from Christoph Hellwig:
"Andrew and I decided to drop the patches implementing your suggested
rename of the probe_kernel_* and probe_user_* helpers from -mm as
there were way to many conflicts.
After -rc1 might be a good time for this as all the conflicts are
resolved now"
This also adds a type safety checking patch on top of the renaming
series to make the subtle behavioral difference between 'get_user()' and
'get_kernel_nofault()' less potentially dangerous and surprising.
* emailed patches from Christoph Hellwig <hch@lst.de>:
maccess: make get_kernel_nofault() check for minimal type compatibility
maccess: rename probe_kernel_address to get_kernel_nofault
maccess: rename probe_user_{read,write} to copy_{from,to}_user_nofault
maccess: rename probe_kernel_{read,write} to copy_{from,to}_kernel_nofault
Assume each section has 4 segment:
.___________________________.
|_Segment0_|_..._|_Segment3_|
. .
. .
.__________.
|_section0_|
Segment 0~2 has 0 valid block, segment 3 has 512 valid blocks.
It will fail if we want to gc section0 in this scenes,
because all 4 segments in section0 is not dirty.
So we should use dirty section bitmap instead of dirty segment bitmap
to get right victim section.
Signed-off-by: Jack Qiu <jack.qiu@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Under some condition, the __write_node_page will submit a page which is not
f2fs_in_warm_node_list and will not call f2fs_add_fsync_node_entry.
f2fs_gc continue to run to invoke f2fs_iget -> do_read_inode to read the same node page
and set code node, which make f2fs_in_warm_node_list become true,
that will cause f2fs_bug_on in f2fs_del_fsync_node_entry when f2fs_write_end_io called.
- f2fs_write_end_io
- f2fs_iget
- do_read_inode
- set_cold_node
recover cold node flag
- f2fs_in_warm_node_list
- is_cold_node
if node is cold, assume we have added
node to fsync_node_list during writepages()
- f2fs_del_fsync_node_entry
- f2fs_bug_on() due to node page
is not in fsync_node_list
[ 34.966133] Call trace:
[ 34.969902] f2fs_del_fsync_node_entry+0x100/0x108
[ 34.976071] f2fs_write_end_io+0x1e0/0x288
[ 34.981539] bio_endio+0x248/0x270
[ 34.986289] blk_update_request+0x2b0/0x4d8
[ 34.991841] scsi_end_request+0x40/0x440
[ 34.997126] scsi_io_completion+0xa4/0x748
[ 35.002593] scsi_finish_command+0xdc/0x110
[ 35.008143] scsi_softirq_done+0x118/0x150
[ 35.013610] blk_done_softirq+0x8c/0xe8
[ 35.018811] __do_softirq+0x2e8/0x578
[ 35.023828] irq_exit+0xfc/0x120
[ 35.028398] handle_IPI+0x1d8/0x330
[ 35.033233] gic_handle_irq+0x110/0x1d4
[ 35.038433] el1_irq+0xb4/0x130
[ 35.042917] kmem_cache_alloc+0x3f0/0x418
[ 35.048288] radix_tree_node_alloc+0x50/0xf8
[ 35.053933] __radix_tree_create+0xf8/0x188
[ 35.059484] __radix_tree_insert+0x3c/0x128
[ 35.065035] add_gc_inode+0x90/0x118
[ 35.069967] f2fs_gc+0x1b80/0x2d70
[ 35.074718] f2fs_disable_checkpoint+0x94/0x1d0
[ 35.080621] f2fs_fill_super+0x10c4/0x1b88
[ 35.086088] mount_bdev+0x194/0x1e0
[ 35.090923] f2fs_mount+0x40/0x50
[ 35.095589] mount_fs+0xb4/0x190
[ 35.100159] vfs_kern_mount+0x80/0x1d8
[ 35.105260] do_mount+0x478/0xf18
[ 35.109926] ksys_mount+0x90/0xd0
[ 35.114592] __arm64_sys_mount+0x24/0x38
Signed-off-by: Wuyun Zhao <zhaowuyun@wingtech.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Since offset < new_size, no need to do truncate_pagecache() again
with new_size.
Signed-off-by: Wei Fang <fangwei1@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Use kfree() instead of kvfree() to free super in read_raw_super_block()
because the memory is allocated with kzalloc() in the function.
Use kfree() instead of kvfree() to free sbi, raw_super in
f2fs_fill_super() and f2fs_put_super() because the memory is allocated
with kzalloc().
Signed-off-by: Denis Efremov <efremov@linux.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
kill_bdev does not have any external user, so make it static.
Signed-off-by: Zheng Bin <zhengbin13@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In io_read() or io_write(), when io request is submitted successfully,
it'll go through the below sequence:
kfree(iovec);
req->flags &= ~REQ_F_NEED_CLEANUP;
return ret;
But clearing REQ_F_NEED_CLEANUP might be unsafe. The io request may
already have been completed, and then io_complete_rw_iopoll()
and io_complete_rw() will be called, both of which will also modify
req->flags if needed. This causes a race condition, with concurrent
non-atomic modification of req->flags.
To eliminate this race, in io_read() or io_write(), if io request is
submitted successfully, we don't remove REQ_F_NEED_CLEANUP flag. If
REQ_F_NEED_CLEANUP is set, we'll leave __io_req_aux_free() to the
iovec cleanup work correspondingly.
Cc: stable@vger.kernel.org
Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If we're doing polled IO and end up having requests being submitted
async, then completions can come in while we're waiting for refs to
drop. We need to reap these manually, as nobody else will be looking
for them.
Break the wait into 1/20th of a second time waits, and check for done
poll completions if we time out. Otherwise we can have done poll
completions sitting in ctx->poll_list, which needs us to reap them but
we're just waiting for them.
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If we're unlucky with timing, we could be running task_work after
having dropped the memory context in the sq thread. Since dropping
the context requires a runnable task state, we cannot reliably drop
it as part of our check-for-work loop in io_sq_thread(). Instead,
abstract out the mm acquire for the sq thread into a helper, and call
it from the async task work handler.
Cc: stable@vger.kernel.org # v5.7
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In io_complete_rw_iopoll(), stores to io_kiocb's result and iopoll
completed are two independent store operations, to ensure that once
iopoll_completed is ture and then req->result must been perceived by
the cpu executing io_do_iopoll(), proper memory barrier should be used.
And in io_do_iopoll(), we check whether req->result is EAGAIN, if it is,
we'll need to issue this io request using io-wq again. In order to just
issue a single smp_rmb() on the completion side, move the re-submit work
to io_iopoll_complete().
Cc: stable@vger.kernel.org
Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
[axboe: don't set ->iopoll_completed for -EAGAIN retry]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In IOPOLL mode, for EAGAIN error, we'll try to submit io request
again using io-wq, so don't fail rest of links if this io request
has links.
Cc: stable@vger.kernel.org
Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The server is failing to apply the umask when creating new objects on
filesystems without ACL support.
To reproduce this, you need to use NFSv4.2 and a client and server
recent enough to support umask, and you need to export a filesystem that
lacks ACL support (for example, ext4 with the "noacl" mount option).
Filesystems with ACL support are expected to take care of the umask
themselves (usually by calling posix_acl_create).
For filesystems without ACL support, this is up to the caller of
vfs_create(), vfs_mknod(), or vfs_mkdir().
Reported-by: Elliott Mitchell <ehem+debian@m5p.com>
Reported-by: Salvatore Bonaccorso <carnil@debian.org>
Tested-by: Salvatore Bonaccorso <carnil@debian.org>
Fixes: 47057abde5 ("nfsd: add support for the umask attribute")
Cc: stable@vger.kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Fix /proc/bootconfig to select double or single quotes
corrctly according to the value.
If a bootconfig value includes a double quote character,
we must use single-quotes to quote that value.
This modifies if() condition and blocks for avoiding
double-quote in value check in 2 places. Anyway, since
xbc_array_for_each_value() can handle the array which
has a single node correctly.
Thus,
if (vnode && xbc_node_is_array(vnode)) {
xbc_array_for_each_value(vnode) /* vnode->next != NULL */
...
} else {
snprintf(val); /* val is an empty string if !vnode */
}
is equivalent to
if (vnode) {
xbc_array_for_each_value(vnode) /* vnode->next can be NULL */
...
} else {
snprintf(""); /* value is always empty */
}
Link: http://lkml.kernel.org/r/159230244786.65555.3763894451251622488.stgit@devnote2
Cc: stable@vger.kernel.org
Fixes: c1a3c36017 ("proc: bootconfig: Add /proc/bootconfig to show boot config list")
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEqG5UsNXhtOCrfGQP+7dXa6fLC2sFAl7pMu8ACgkQ+7dXa6fL
C2sgNRAAnOCq281ojebwVSIkRDVGlxBODNeJtcgOOC4ib3jZM++vhdnnJgJIr8kc
UOQ+LF4E5hNgwELubCrLOx/AjIzVuzfrreFNOPh3P3TSjyxW/7AU+tFGkdnLkYun
NyadOXxI9Dk84UBN1LrmRm3ccAbF6nDf/KcPykS0oAEh12LVm6sDpVJz9+1uclnK
Xq0rgl+zrR0+SPplPYz4P/OEPTgNfpLV9DHVYfkvsvEhwb/TaUmiLj9SEgndp+fg
L3CT66QXoG9zds9hYFVODQM3devaXOpGNU0vsc9+Xg57BWuYvVed24eH5oBrcBQo
F5kon+mcZlHtmTG87UJ6vFUwfHGeYqKKRb9XTbKbATtIWvkB3XM4Jz/XUlaAIE+R
y0njNYEoIn4wHkleL/KeHmFPFSYG7pZpAN3wqhXZ9wVptXRDSB10OK3vpgLD/2rM
V68FmBin6eStE5qZ8Mu9qMQxXb1buknoef37FIXUozjc+VMPrg5dbG6GjcW/CqIC
LynaNUvrQOvF0ZFVzMt7ffZPrdDYlqqzyN0bReMdibys4BPKo24gSr5aVMLt7YXf
ZaJeApcSdsphs4uUmtHKlHYgUQrSEl9pSGmc4hcq9bNIKHo9S618LG9uuUplOjdP
j0L8N6uWBHQCjAvu6kDm8Wp5pRPPUnTgaXDsok7yP2GLRqBEm3Q=
=bYOZ
-----END PGP SIGNATURE-----
Merge tag 'afs-fixes-20200616' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
Pull AFS fixes from David Howells:
"I've managed to get xfstests kind of working with afs. Here are a set
of patches that fix most of the bugs found.
There are a number of primary issues:
- Incorrect handling of mtime and non-handling of ctime. It might be
argued, that the latter isn't a bug since the AFS protocol doesn't
support ctime, but I should probably still update it locally.
- Shared-write mmap, truncate and writeback bugs. This includes not
changing i_size under the callback lock, overwriting local i_size
with the reply from the server after a partial writeback, not
limiting the writeback from an mmapped page to EOF.
- Checks for an abort code indicating that the primary vnode in an
operation was deleted by a third-party are done in the wrong place.
- Silly rename bugs. This includes an incomplete conversion to the
new operation handling, duplicate nlink handling, nlink changing
not being done inside the callback lock and insufficient handling
of third-party conflicting directory changes.
And some secondary ones:
- The UAEOVERFLOW abort code should map to EOVERFLOW not EREMOTEIO.
- Remove a couple of unused or incompletely used bits.
- Remove a couple of redundant success checks.
These seem to fix all the data-corruption bugs found by
./check -afs -g quick
along with the obvious silly rename bugs and time bugs.
There are still some test failures, but they seem to fall into two
classes: firstly, the authentication/security model is different to
the standard UNIX model and permission is arbitrated by the server and
cached locally; and secondly, there are a number of features that AFS
does not support (such as mknod). But in these cases, the tests
themselves need to be adapted or skipped.
Using the in-kernel afs client with xfstests also found a bug in the
AuriStor AFS server that has been fixed for a future release"
* tag 'afs-fixes-20200616' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
afs: Fix silly rename
afs: afs_vnode_commit_status() doesn't need to check the RPC error
afs: Fix use of afs_check_for_remote_deletion()
afs: Remove afs_operation::abort_code
afs: Fix yfs_fs_fetch_status() to honour vnode selector
afs: Remove yfs_fs_fetch_file_status() as it's not used
afs: Fix the mapping of the UAEOVERFLOW abort code
afs: Fix truncation issues and mmap writeback size
afs: Concoct ctimes
afs: Fix EOF corruption
afs: afs_write_end() should change i_size under the right lock
afs: Fix non-setting of mtime when writing into mmap
Hi Linus,
Please, pull the following patches that replace zero-length arrays with
flexible-array members.
Notice that all of these patches have been baking in linux-next for
two development cycles now.
There is a regular need in the kernel to provide a way to declare having a
dynamically sized set of trailing elements in a structure. Kernel code should
always use “flexible array members”[1] for these cases. The older style of
one-element or zero-length arrays should no longer be used[2].
C99 introduced “flexible array members”, which lacks a numeric size for the
array declaration entirely:
struct something {
size_t count;
struct foo items[];
};
This is the way the kernel expects dynamically sized trailing elements to be
declared. It allows the compiler to generate errors when the flexible array
does not occur last in the structure, which helps to prevent some kind of
undefined behavior[3] bugs from being inadvertently introduced to the codebase.
It also allows the compiler to correctly analyze array sizes (via sizeof(),
CONFIG_FORTIFY_SOURCE, and CONFIG_UBSAN_BOUNDS). For instance, there is no
mechanism that warns us that the following application of the sizeof() operator
to a zero-length array always results in zero:
struct something {
size_t count;
struct foo items[0];
};
struct something *instance;
instance = kmalloc(struct_size(instance, items, count), GFP_KERNEL);
instance->count = count;
size = sizeof(instance->items) * instance->count;
memcpy(instance->items, source, size);
At the last line of code above, size turns out to be zero, when one might have
thought it represents the total size in bytes of the dynamic memory recently
allocated for the trailing array items. Here are a couple examples of this
issue[4][5]. Instead, flexible array members have incomplete type, and so the
sizeof() operator may not be applied[6], so any misuse of such operators will
be immediately noticed at build time.
The cleanest and least error-prone way to implement this is through the use of
a flexible array member:
struct something {
size_t count;
struct foo items[];
};
struct something *instance;
instance = kmalloc(struct_size(instance, items, count), GFP_KERNEL);
instance->count = count;
size = sizeof(instance->items[0]) * instance->count;
memcpy(instance->items, source, size);
Thanks
--
Gustavo
[1] https://en.wikipedia.org/wiki/Flexible_array_member
[2] https://github.com/KSPP/linux/issues/21
[3] https://git.kernel.org/linus/76497732932f15e7323dc805e8ea8dc11bb587cf
[4] https://git.kernel.org/linus/f2cd32a443da694ac4e28fbf4ac6f9d5cc63a539
[5] https://git.kernel.org/linus/ab91c2a89f86be2898cee208d492816ec238b2cf
[6] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEkmRahXBSurMIg1YvRwW0y0cG2zEFAl7oSmYACgkQRwW0y0cG
2zGEiw/9FiH3MBwMlPVJPcneY1wCH/N6ZSf+kr7SJiVwV/YbBe9EWuaKZ0D4vAWm
kTACkOfsZ1me1OKz9wNrOxn0zezTMFQK2PLPgzKIPuK0Hg8MW1EU63RIRsnr0bPc
b90wZwyBQtLbGRC3/9yAACKwFZe/SeYoV5rr8uylffA35HZW3SZbTex6XnGCF9Q5
UYwnz7vNg+9VH1GRQeB5jlqL7mAoRzJ49I/TL3zJr04Mn+xC+vVBS7XwipDd03p+
foC6/KmGhlCO9HMPASReGrOYNPydDAMKLNPdIfUlcTKHWsoTjGOcW/dzfT4rUu6n
nKr5rIqJ4FdlIvXZL5P5w7Uhkwbd3mus5G0HBk+V/cUScckCpBou+yuGzjxXSitQ
o0qPsGjWr3v+gxRWHj8YO/9MhKKKW0Iy+QmAC9+uLnbfJdbUwYbLIXbsOKnokCA8
jkDEr64F5hFTKtajIK4VToJK1CsM3D9dwTub27lwZysHn3RYSQdcyN+9OiZgdzpc
GlI6QoaqKR9AT4b/eBmqlQAKgA07zSQ5RsIjRm6hN3d7u/77x2kyrreo+trJyVY2
F17uEOzfTqZyxtkPayE8DVjTtbByoCuBR0Vm1oMAFxjyqZQY5daalB0DKd1mdYqi
khIXqNAuYqHOb898fEuzidjV38hxZ9y8SAym3P7WnYl+Hxz+8Jo=
=8HUQ
-----END PGP SIGNATURE-----
Merge tag 'flex-array-conversions-5.8-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gustavoars/linux
Pull flexible-array member conversions from Gustavo A. R. Silva:
"Replace zero-length arrays with flexible-array members.
Notice that all of these patches have been baking in linux-next for
two development cycles now.
There is a regular need in the kernel to provide a way to declare
having a dynamically sized set of trailing elements in a structure.
Kernel code should always use “flexible array members”[1] for these
cases. The older style of one-element or zero-length arrays should no
longer be used[2].
C99 introduced “flexible array members”, which lacks a numeric size
for the array declaration entirely:
struct something {
size_t count;
struct foo items[];
};
This is the way the kernel expects dynamically sized trailing elements
to be declared. It allows the compiler to generate errors when the
flexible array does not occur last in the structure, which helps to
prevent some kind of undefined behavior[3] bugs from being
inadvertently introduced to the codebase.
It also allows the compiler to correctly analyze array sizes (via
sizeof(), CONFIG_FORTIFY_SOURCE, and CONFIG_UBSAN_BOUNDS). For
instance, there is no mechanism that warns us that the following
application of the sizeof() operator to a zero-length array always
results in zero:
struct something {
size_t count;
struct foo items[0];
};
struct something *instance;
instance = kmalloc(struct_size(instance, items, count), GFP_KERNEL);
instance->count = count;
size = sizeof(instance->items) * instance->count;
memcpy(instance->items, source, size);
At the last line of code above, size turns out to be zero, when one
might have thought it represents the total size in bytes of the
dynamic memory recently allocated for the trailing array items. Here
are a couple examples of this issue[4][5].
Instead, flexible array members have incomplete type, and so the
sizeof() operator may not be applied[6], so any misuse of such
operators will be immediately noticed at build time.
The cleanest and least error-prone way to implement this is through
the use of a flexible array member:
struct something {
size_t count;
struct foo items[];
};
struct something *instance;
instance = kmalloc(struct_size(instance, items, count), GFP_KERNEL);
instance->count = count;
size = sizeof(instance->items[0]) * instance->count;
memcpy(instance->items, source, size);
instead"
[1] https://en.wikipedia.org/wiki/Flexible_array_member
[2] https://github.com/KSPP/linux/issues/21
[3] commit 7649773293 ("cxgb3/l2t: Fix undefined behaviour")
[4] commit f2cd32a443 ("rndis_wlan: Remove logically dead code")
[5] commit ab91c2a89f ("tpm: eventlog: Replace zero-length array with flexible-array member")
[6] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
* tag 'flex-array-conversions-5.8-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gustavoars/linux: (41 commits)
w1: Replace zero-length array with flexible-array
tracing/probe: Replace zero-length array with flexible-array
soc: ti: Replace zero-length array with flexible-array
tifm: Replace zero-length array with flexible-array
dmaengine: tegra-apb: Replace zero-length array with flexible-array
stm class: Replace zero-length array with flexible-array
Squashfs: Replace zero-length array with flexible-array
ASoC: SOF: Replace zero-length array with flexible-array
ima: Replace zero-length array with flexible-array
sctp: Replace zero-length array with flexible-array
phy: samsung: Replace zero-length array with flexible-array
RxRPC: Replace zero-length array with flexible-array
rapidio: Replace zero-length array with flexible-array
media: pwc: Replace zero-length array with flexible-array
firmware: pcdp: Replace zero-length array with flexible-array
oprofile: Replace zero-length array with flexible-array
block: Replace zero-length array with flexible-array
tools/testing/nvdimm: Replace zero-length array with flexible-array
libata: Replace zero-length array with flexible-array
kprobes: Replace zero-length array with flexible-array
...
One of the use-cases of close_range() is to drop file descriptors just before
execve(). This would usually be expressed in the sequence:
unshare(CLONE_FILES);
close_range(3, ~0U);
as pointed out by Linus it might be desirable to have this be a part of
close_range() itself under a new flag CLOSE_RANGE_UNSHARE.
This expands {dup,unshare)_fd() to take a max_fds argument that indicates the
maximum number of file descriptors to copy from the old struct files. When the
user requests that all file descriptors are supposed to be closed via
close_range(min, max) then we can cap via unshare_fd(min) and hence don't need
to do any of the heavy fput() work for everything above min.
The patch makes it so that if CLOSE_RANGE_UNSHARE is requested and we do in
fact currently share our file descriptor table we create a new private copy.
We then close all fds in the requested range and finally after we're done we
install the new fd table.
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
This adds the close_range() syscall. It allows to efficiently close a range
of file descriptors up to all file descriptors of a calling task.
I was contacted by FreeBSD as they wanted to have the same close_range()
syscall as we proposed here. We've coordinated this and in the meantime, Kyle
was fast enough to merge close_range() into FreeBSD already in April:
https://reviews.freebsd.org/D21627https://svnweb.freebsd.org/base?view=revision&revision=359836
and the current plan is to backport close_range() to FreeBSD 12.2 (cf. [2])
once its merged in Linux too. Python is in the process of switching to
close_range() on FreeBSD and they are waiting on us to merge this to switch on
Linux as well: https://bugs.python.org/issue38061
The syscall came up in a recent discussion around the new mount API and
making new file descriptor types cloexec by default. During this
discussion, Al suggested the close_range() syscall (cf. [1]). Note, a
syscall in this manner has been requested by various people over time.
First, it helps to close all file descriptors of an exec()ing task. This
can be done safely via (quoting Al's example from [1] verbatim):
/* that exec is sensitive */
unshare(CLONE_FILES);
/* we don't want anything past stderr here */
close_range(3, ~0U);
execve(....);
The code snippet above is one way of working around the problem that file
descriptors are not cloexec by default. This is aggravated by the fact that
we can't just switch them over without massively regressing userspace. For
a whole class of programs having an in-kernel method of closing all file
descriptors is very helpful (e.g. demons, service managers, programming
language standard libraries, container managers etc.).
(Please note, unshare(CLONE_FILES) should only be needed if the calling
task is multi-threaded and shares the file descriptor table with another
thread in which case two threads could race with one thread allocating file
descriptors and the other one closing them via close_range(). For the
general case close_range() before the execve() is sufficient.)
Second, it allows userspace to avoid implementing closing all file
descriptors by parsing through /proc/<pid>/fd/* and calling close() on each
file descriptor. From looking at various large(ish) userspace code bases
this or similar patterns are very common in:
- service managers (cf. [4])
- libcs (cf. [6])
- container runtimes (cf. [5])
- programming language runtimes/standard libraries
- Python (cf. [2])
- Rust (cf. [7], [8])
As Dmitry pointed out there's even a long-standing glibc bug about missing
kernel support for this task (cf. [3]).
In addition, the syscall will also work for tasks that do not have procfs
mounted and on kernels that do not have procfs support compiled in. In such
situations the only way to make sure that all file descriptors are closed
is to call close() on each file descriptor up to UINT_MAX or RLIMIT_NOFILE,
OPEN_MAX trickery (cf. comment [8] on Rust).
The performance is striking. For good measure, comparing the following
simple close_all_fds() userspace implementation that is essentially just
glibc's version in [6]:
static int close_all_fds(void)
{
int dir_fd;
DIR *dir;
struct dirent *direntp;
dir = opendir("/proc/self/fd");
if (!dir)
return -1;
dir_fd = dirfd(dir);
while ((direntp = readdir(dir))) {
int fd;
if (strcmp(direntp->d_name, ".") == 0)
continue;
if (strcmp(direntp->d_name, "..") == 0)
continue;
fd = atoi(direntp->d_name);
if (fd == dir_fd || fd == 0 || fd == 1 || fd == 2)
continue;
close(fd);
}
closedir(dir);
return 0;
}
to close_range() yields:
1. closing 4 open files:
- close_all_fds(): ~280 us
- close_range(): ~24 us
2. closing 1000 open files:
- close_all_fds(): ~5000 us
- close_range(): ~800 us
close_range() is designed to allow for some flexibility. Specifically, it
does not simply always close all open file descriptors of a task. Instead,
callers can specify an upper bound.
This is e.g. useful for scenarios where specific file descriptors are
created with well-known numbers that are supposed to be excluded from
getting closed.
For extra paranoia close_range() comes with a flags argument. This can e.g.
be used to implement extension. Once can imagine userspace wanting to stop
at the first error instead of ignoring errors under certain circumstances.
There might be other valid ideas in the future. In any case, a flag
argument doesn't hurt and keeps us on the safe side.
From an implementation side this is kept rather dumb. It saw some input
from David and Jann but all nonsense is obviously my own!
- Errors to close file descriptors are currently ignored. (Could be changed
by setting a flag in the future if needed.)
- __close_range() is a rather simplistic wrapper around __close_fd().
My reasoning behind this is based on the nature of how __close_fd() needs
to release an fd. But maybe I misunderstood specifics:
We take the files_lock and rcu-dereference the fdtable of the calling
task, we find the entry in the fdtable, get the file and need to release
files_lock before calling filp_close().
In the meantime the fdtable might have been altered so we can't just
retake the spinlock and keep the old rcu-reference of the fdtable
around. Instead we need to grab a fresh reference to the fdtable.
If my reasoning is correct then there's really no point in fancyfying
__close_range(): We just need to rcu-dereference the fdtable of the
calling task once to cap the max_fd value correctly and then go on
calling __close_fd() in a loop.
/* References */
[1]: https://lore.kernel.org/lkml/20190516165021.GD17978@ZenIV.linux.org.uk/
[2]: 9e4f2f3a6b/Modules/_posixsubprocess.c (L220)
[3]: https://sourceware.org/bugzilla/show_bug.cgi?id=10353#c7
[4]: 5238e95759/src/basic/fd-util.c (L217)
[5]: ddf4b77e11/src/lxc/start.c (L236)
[6]: https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/grantpt.c;h=2030e07fa6e652aac32c775b8c6e005844c3c4eb;hb=HEAD#l17
Note that this is an internal implementation that is not exported.
Currently, libc seems to not provide an exported version of this
because of missing kernel support to do this.
Note, in a recent patch series Florian made grantpt() a nop thereby
removing the code referenced here.
[7]: https://github.com/rust-lang/rust/issues/12148
[8]: 5f47c0613e/src/libstd/sys/unix/process2.rs (L303-L308)
Rust's solution is slightly different but is equally unperformant.
Rust calls getdtablesize() which is a glibc library function that
simply returns the current RLIMIT_NOFILE or OPEN_MAX values. Rust then
goes on to call close() on each fd. That's obviously overkill for most
tasks. Rarely, tasks - especially non-demons - hit RLIMIT_NOFILE or
OPEN_MAX.
Let's be nice and assume an unprivileged user with RLIMIT_NOFILE set
to 1024. Even in this case, there's a very high chance that in the
common case Rust is calling the close() syscall 1021 times pointlessly
if the task just has 0, 1, and 2 open.
Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Kyle Evans <self@kyle-evans.net>
Cc: Jann Horn <jannh@google.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Dmitry V. Levin <ldv@altlinux.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Florian Weimer <fweimer@redhat.com>
Cc: linux-api@vger.kernel.org
Fix AFS's silly rename by the following means:
(1) Set the destination directory in afs_do_silly_rename() so as to avoid
misbehaviour and indicate that the directory data version will
increment by 1 so as to avoid warnings about unexpected changes in the
DV. Also indicate that the ctime should be updated to avoid xfstest
grumbling.
(2) Note when the server indicates that a directory changed more than we
expected (AFS_OPERATION_DIR_CONFLICT), indicating a conflict with a
third party change, checking on successful completion of unlink and
rename.
The problem is that the FS.RemoveFile RPC op doesn't report the status
of the unlinked file, though YFS.RemoveFile2 does. This can be
mitigated by the assumption that if the directory DV cranked by
exactly 1, we can be sure we removed one link from the file; further,
ordinarily in AFS, files cannot be hardlinked across directories, so
if we reduce nlink to 0, the file is deleted.
However, if the directory DV jumps by more than 1, we cannot know if a
third party intervened by adding or removing a link on the file we
just removed a link from.
The same also goes for any vnode that is at the destination of the
FS.Rename RPC op.
(3) Make afs_vnode_commit_status() apply the nlink drop inside the cb_lock
section along with the other attribute updates if ->op_unlinked is set
on the descriptor for the appropriate vnode.
(4) Issue a follow up status fetch to the unlinked file in the event of a
third party conflict that makes it impossible for us to know if we
actually deleted the file or not.
(5) Provide a flag, AFS_VNODE_SILLY_DELETED, to make afs_getattr() lie to
the user about the nlink of a silly deleted file so that it appears as
0, not 1.
Found with the generic/035 and generic/084 xfstests.
Fixes: e49c7b2f6d ("afs: Build an abstraction around an "operation" concept")
Reported-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
In btrfs_ioctl_get_subvol_info(), there is a classic case where kzalloc()
was incorrectly paired with kzfree(). According to David Sterba, there
isn't any sensitive information in the subvol_info that needs to be
cleared before freeing. So kzfree() isn't really needed, use kfree()
instead.
Signed-off-by: Waiman Long <longman@redhat.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
A RWF_NOWAIT write is not supposed to wait on filesystem locks that can be
held for a long time or for ongoing IO to complete.
However when calling check_can_nocow(), if the inode has prealloc extents
or has the NOCOW flag set, we can block on extent (file range) locks
through the call to btrfs_lock_and_flush_ordered_range(). Such lock can
take a significant amount of time to be available. For example, a fiemap
task may be running, and iterating through the entire file range checking
all extents and doing backref walking to determine if they are shared,
or a readpage operation may be in progress.
Also at btrfs_lock_and_flush_ordered_range(), called by check_can_nocow(),
after locking the file range we wait for any existing ordered extent that
is in progress to complete. Another operation that can take a significant
amount of time and defeat the purpose of RWF_NOWAIT.
So fix this by trying to lock the file range and if it's currently locked
return -EAGAIN to user space. If we are able to lock the file range without
waiting and there is an ordered extent in the range, return -EAGAIN as
well, instead of waiting for it to complete. Finally, don't bother trying
to lock the snapshot lock of the root when attempting a RWF_NOWAIT write,
as that is only important for buffered writes.
Fixes: edf064e7c6 ("btrfs: nowait aio support")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If we attempt to do a RWF_NOWAIT write against a file range for which we
can only do NOCOW for a part of it, due to the existence of holes or
shared extents for example, we proceed with the write as if it were
possible to NOCOW the whole range.
Example:
$ mkfs.btrfs -f /dev/sdb
$ mount /dev/sdb /mnt
$ touch /mnt/sdj/bar
$ chattr +C /mnt/sdj/bar
$ xfs_io -d -c "pwrite -S 0xab -b 256K 0 256K" /mnt/bar
wrote 262144/262144 bytes at offset 0
256 KiB, 1 ops; 0.0003 sec (694.444 MiB/sec and 2777.7778 ops/sec)
$ xfs_io -c "fpunch 64K 64K" /mnt/bar
$ sync
$ xfs_io -d -c "pwrite -N -V 1 -b 128K -S 0xfe 0 128K" /mnt/bar
wrote 131072/131072 bytes at offset 0
128 KiB, 1 ops; 0.0007 sec (160.051 MiB/sec and 1280.4097 ops/sec)
This last write should fail with -EAGAIN since the file range from 64K to
128K is a hole. On xfs it fails, as expected, but on ext4 it currently
succeeds because apparently it is expensive to check if there are extents
allocated for the whole range, but I'll check with the ext4 people.
Fix the issue by checking if check_can_nocow() returns a number of
NOCOW'able bytes smaller then the requested number of bytes, and if it
does return -EAGAIN.
Fixes: edf064e7c6 ("btrfs: nowait aio support")
CC: stable@vger.kernel.org # 4.14+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If we attempt to write to prealloc extent located after eof using a
RWF_NOWAIT write, we always fail with -EAGAIN.
We do actually check if we have an allocated extent for the write at
the start of btrfs_file_write_iter() through a call to check_can_nocow(),
but later when we go into the actual direct IO write path we simply
return -EAGAIN if the write starts at or beyond EOF.
Trivial to reproduce:
$ mkfs.btrfs -f /dev/sdb
$ mount /dev/sdb /mnt
$ touch /mnt/foo
$ chattr +C /mnt/foo
$ xfs_io -d -c "pwrite -S 0xab 0 64K" /mnt/foo
wrote 65536/65536 bytes at offset 0
64 KiB, 16 ops; 0.0004 sec (135.575 MiB/sec and 34707.1584 ops/sec)
$ xfs_io -c "falloc -k 64K 1M" /mnt/foo
$ xfs_io -d -c "pwrite -N -V 1 -S 0xfe -b 64K 64K 64K" /mnt/foo
pwrite: Resource temporarily unavailable
On xfs and ext4 the write succeeds, as expected.
Fix this by removing the wrong check at btrfs_direct_IO().
Fixes: edf064e7c6 ("btrfs: nowait aio support")
CC: stable@vger.kernel.org # 4.14+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If we do a successful RWF_NOWAIT write we end up locking the snapshot lock
of the inode, through a call to check_can_nocow(), but we never unlock it.
This means the next attempt to create a snapshot on the subvolume will
hang forever.
Trivial reproducer:
$ mkfs.btrfs -f /dev/sdb
$ mount /dev/sdb /mnt
$ touch /mnt/foobar
$ chattr +C /mnt/foobar
$ xfs_io -d -c "pwrite -S 0xab 0 64K" /mnt/foobar
$ xfs_io -d -c "pwrite -N -V 1 -S 0xfe 0 64K" /mnt/foobar
$ btrfs subvolume snapshot -r /mnt /mnt/snap
--> hangs
Fix this by unlocking the snapshot lock if check_can_nocow() returned
success.
Fixes: edf064e7c6 ("btrfs: nowait aio support")
CC: stable@vger.kernel.org # 4.14+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This brings back an optimization that commit e678934cbe ("btrfs:
Remove unnecessary check from join_running_log_trans") removed, but in
a different form. So it's almost equivalent to a revert.
That commit removed an optimization where we avoid locking a root's
log_mutex when there is no log tree created in the current transaction.
The affected code path is triggered through unlink operations.
That commit was based on the assumption that the optimization was not
necessary because we used to have the following checks when the patch
was authored:
int btrfs_del_dir_entries_in_log(...)
{
(...)
if (dir->logged_trans < trans->transid)
return 0;
ret = join_running_log_trans(root);
(...)
}
int btrfs_del_inode_ref_in_log(...)
{
(...)
if (inode->logged_trans < trans->transid)
return 0;
ret = join_running_log_trans(root);
(...)
}
However before that patch was merged, another patch was merged first which
replaced those checks because they were buggy.
That other patch corresponds to commit 803f0f64d1 ("Btrfs: fix fsync
not persisting dentry deletions due to inode evictions"). The assumption
that if the logged_trans field of an inode had a smaller value then the
current transaction's generation (transid) meant that the inode was not
logged in the current transaction was only correct if the inode was not
evicted and reloaded in the current transaction. So the corresponding bug
fix changed those checks and replaced them with the following helper
function:
static bool inode_logged(struct btrfs_trans_handle *trans,
struct btrfs_inode *inode)
{
if (inode->logged_trans == trans->transid)
return true;
if (inode->last_trans == trans->transid &&
test_bit(BTRFS_INODE_NEEDS_FULL_SYNC, &inode->runtime_flags) &&
!test_bit(BTRFS_FS_LOG_RECOVERING, &trans->fs_info->flags))
return true;
return false;
}
So if we have a subvolume without a log tree in the current transaction
(because we had no fsyncs), every time we unlink an inode we can end up
trying to lock the log_mutex of the root through join_running_log_trans()
twice, once for the inode being unlinked (by btrfs_del_inode_ref_in_log())
and once for the parent directory (with btrfs_del_dir_entries_in_log()).
This means if we have several unlink operations happening in parallel for
inodes in the same subvolume, and the those inodes and/or their parent
inode were changed in the current transaction, we end up having a lot of
contention on the log_mutex.
The test robots from intel reported a -30.7% performance regression for
a REAIM test after commit e678934cbe ("btrfs: Remove unnecessary check
from join_running_log_trans").
So just bring back the optimization to join_running_log_trans() where we
check first if a log root exists before trying to lock the log_mutex. This
is done by checking for a bit that is set on the root when a log tree is
created and removed when a log tree is freed (at transaction commit time).
Commit e678934cbe ("btrfs: Remove unnecessary check from
join_running_log_trans") was merged in the 5.4 merge window while commit
803f0f64d1 ("Btrfs: fix fsync not persisting dentry deletions due to
inode evictions") was merged in the 5.3 merge window. But the first
commit was actually authored before the second commit (May 23 2019 vs
June 19 2019).
Reported-by: kernel test robot <rong.a.chen@intel.com>
Link: https://lore.kernel.org/lkml/20200611090233.GL12456@shao2-debian/
Fixes: e678934cbe ("btrfs: Remove unnecessary check from join_running_log_trans")
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When balance and scrub are running in parallel it is possible to end up
with an underflow of the bytes_may_use counter of the data space_info
object, which triggers a warning like the following:
[134243.793196] BTRFS info (device sdc): relocating block group 1104150528 flags data
[134243.806891] ------------[ cut here ]------------
[134243.807561] WARNING: CPU: 1 PID: 26884 at fs/btrfs/space-info.h:125 btrfs_add_reserved_bytes+0x1da/0x280 [btrfs]
[134243.808819] Modules linked in: btrfs blake2b_generic xor (...)
[134243.815779] CPU: 1 PID: 26884 Comm: kworker/u8:8 Tainted: G W 5.6.0-rc7-btrfs-next-58 #5
[134243.816944] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
[134243.818389] Workqueue: writeback wb_workfn (flush-btrfs-108483)
[134243.819186] RIP: 0010:btrfs_add_reserved_bytes+0x1da/0x280 [btrfs]
[134243.819963] Code: 0b f2 85 (...)
[134243.822271] RSP: 0018:ffffa4160aae7510 EFLAGS: 00010287
[134243.822929] RAX: 000000000000c000 RBX: ffff96159a8c1000 RCX: 0000000000000000
[134243.823816] RDX: 0000000000008000 RSI: 0000000000000000 RDI: ffff96158067a810
[134243.824742] RBP: ffff96158067a800 R08: 0000000000000001 R09: 0000000000000000
[134243.825636] R10: ffff961501432a40 R11: 0000000000000000 R12: 000000000000c000
[134243.826532] R13: 0000000000000001 R14: ffffffffffff4000 R15: ffff96158067a810
[134243.827432] FS: 0000000000000000(0000) GS:ffff9615baa00000(0000) knlGS:0000000000000000
[134243.828451] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[134243.829184] CR2: 000055bd7e414000 CR3: 00000001077be004 CR4: 00000000003606e0
[134243.830083] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[134243.830975] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[134243.831867] Call Trace:
[134243.832211] find_free_extent+0x4a0/0x16c0 [btrfs]
[134243.832846] btrfs_reserve_extent+0x91/0x180 [btrfs]
[134243.833487] cow_file_range+0x12d/0x490 [btrfs]
[134243.834080] fallback_to_cow+0x82/0x1b0 [btrfs]
[134243.834689] ? release_extent_buffer+0x121/0x170 [btrfs]
[134243.835370] run_delalloc_nocow+0x33f/0xa30 [btrfs]
[134243.836032] btrfs_run_delalloc_range+0x1ea/0x6d0 [btrfs]
[134243.836725] ? find_lock_delalloc_range+0x221/0x250 [btrfs]
[134243.837450] writepage_delalloc+0xe8/0x150 [btrfs]
[134243.838059] __extent_writepage+0xe8/0x4c0 [btrfs]
[134243.838674] extent_write_cache_pages+0x237/0x530 [btrfs]
[134243.839364] extent_writepages+0x44/0xa0 [btrfs]
[134243.839946] do_writepages+0x23/0x80
[134243.840401] __writeback_single_inode+0x59/0x700
[134243.841006] writeback_sb_inodes+0x267/0x5f0
[134243.841548] __writeback_inodes_wb+0x87/0xe0
[134243.842091] wb_writeback+0x382/0x590
[134243.842574] ? wb_workfn+0x4a2/0x6c0
[134243.843030] wb_workfn+0x4a2/0x6c0
[134243.843468] process_one_work+0x26d/0x6a0
[134243.843978] worker_thread+0x4f/0x3e0
[134243.844452] ? process_one_work+0x6a0/0x6a0
[134243.844981] kthread+0x103/0x140
[134243.845400] ? kthread_create_worker_on_cpu+0x70/0x70
[134243.846030] ret_from_fork+0x3a/0x50
[134243.846494] irq event stamp: 0
[134243.846892] hardirqs last enabled at (0): [<0000000000000000>] 0x0
[134243.847682] hardirqs last disabled at (0): [<ffffffffb2abdedf>] copy_process+0x74f/0x2020
[134243.848687] softirqs last enabled at (0): [<ffffffffb2abdedf>] copy_process+0x74f/0x2020
[134243.849913] softirqs last disabled at (0): [<0000000000000000>] 0x0
[134243.850698] ---[ end trace bd7c03622e0b0a96 ]---
[134243.851335] ------------[ cut here ]------------
When relocating a data block group, for each extent allocated in the
block group we preallocate another extent with the same size for the
data relocation inode (we do it at prealloc_file_extent_cluster()).
We reserve space by calling btrfs_check_data_free_space(), which ends
up incrementing the data space_info's bytes_may_use counter, and
then call btrfs_prealloc_file_range() to allocate the extent, which
always decrements the bytes_may_use counter by the same amount.
The expectation is that writeback of the data relocation inode always
follows a NOCOW path, by writing into the preallocated extents. However,
when starting writeback we might end up falling back into the COW path,
because the block group that contains the preallocated extent was turned
into RO mode by a scrub running in parallel. The COW path then calls the
extent allocator which ends up calling btrfs_add_reserved_bytes(), and
this function decrements the bytes_may_use counter of the data space_info
object by an amount corresponding to the size of the allocated extent,
despite we haven't previously incremented it. When the counter currently
has a value smaller then the allocated extent we reset the counter to 0
and emit a warning, otherwise we just decrement it and slowly mess up
with this counter which is crucial for space reservation, the end result
can be granting reserved space to tasks when there isn't really enough
free space, and having the tasks fail later in critical places where
error handling consists of a transaction abort or hitting a BUG_ON().
Fix this by making sure that if we fallback to the COW path for a data
relocation inode, we increment the bytes_may_use counter of the data
space_info object. The COW path will then decrement it at
btrfs_add_reserved_bytes() on success or through its error handling part
by a call to extent_clear_unlock_delalloc() (which ends up calling
btrfs_clear_delalloc_extent() that does the decrement operation) in case
of an error.
Test case btrfs/061 from fstests could sporadically trigger this.
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When running relocation of a data block group while scrub is running in
parallel, it is possible that the relocation will fail and abort the
current transaction with an -EINVAL error:
[134243.988595] BTRFS info (device sdc): found 14 extents, stage: move data extents
[134243.999871] ------------[ cut here ]------------
[134244.000741] BTRFS: Transaction aborted (error -22)
[134244.001692] WARNING: CPU: 0 PID: 26954 at fs/btrfs/ctree.c:1071 __btrfs_cow_block+0x6a7/0x790 [btrfs]
[134244.003380] Modules linked in: btrfs blake2b_generic xor raid6_pq (...)
[134244.012577] CPU: 0 PID: 26954 Comm: btrfs Tainted: G W 5.6.0-rc7-btrfs-next-58 #5
[134244.014162] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
[134244.016184] RIP: 0010:__btrfs_cow_block+0x6a7/0x790 [btrfs]
[134244.017151] Code: 48 c7 c7 (...)
[134244.020549] RSP: 0018:ffffa41607863888 EFLAGS: 00010286
[134244.021515] RAX: 0000000000000000 RBX: ffff9614bdfe09c8 RCX: 0000000000000000
[134244.022822] RDX: 0000000000000001 RSI: ffffffffb3d63980 RDI: 0000000000000001
[134244.024124] RBP: ffff961589e8c000 R08: 0000000000000000 R09: 0000000000000001
[134244.025424] R10: ffffffffc0ae5955 R11: 0000000000000000 R12: ffff9614bd530d08
[134244.026725] R13: ffff9614ced41b88 R14: ffff9614bdfe2a48 R15: 0000000000000000
[134244.028024] FS: 00007f29b63c08c0(0000) GS:ffff9615ba600000(0000) knlGS:0000000000000000
[134244.029491] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[134244.030560] CR2: 00007f4eb339b000 CR3: 0000000130d6e006 CR4: 00000000003606f0
[134244.031997] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[134244.033153] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[134244.034484] Call Trace:
[134244.034984] btrfs_cow_block+0x12b/0x2b0 [btrfs]
[134244.035859] do_relocation+0x30b/0x790 [btrfs]
[134244.036681] ? do_raw_spin_unlock+0x49/0xc0
[134244.037460] ? _raw_spin_unlock+0x29/0x40
[134244.038235] relocate_tree_blocks+0x37b/0x730 [btrfs]
[134244.039245] relocate_block_group+0x388/0x770 [btrfs]
[134244.040228] btrfs_relocate_block_group+0x161/0x2e0 [btrfs]
[134244.041323] btrfs_relocate_chunk+0x36/0x110 [btrfs]
[134244.041345] btrfs_balance+0xc06/0x1860 [btrfs]
[134244.043382] ? btrfs_ioctl_balance+0x27c/0x310 [btrfs]
[134244.045586] btrfs_ioctl_balance+0x1ed/0x310 [btrfs]
[134244.045611] btrfs_ioctl+0x1880/0x3760 [btrfs]
[134244.049043] ? do_raw_spin_unlock+0x49/0xc0
[134244.049838] ? _raw_spin_unlock+0x29/0x40
[134244.050587] ? __handle_mm_fault+0x11b3/0x14b0
[134244.051417] ? ksys_ioctl+0x92/0xb0
[134244.052070] ksys_ioctl+0x92/0xb0
[134244.052701] ? trace_hardirqs_off_thunk+0x1a/0x1c
[134244.053511] __x64_sys_ioctl+0x16/0x20
[134244.054206] do_syscall_64+0x5c/0x280
[134244.054891] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[134244.055819] RIP: 0033:0x7f29b51c9dd7
[134244.056491] Code: 00 00 00 (...)
[134244.059767] RSP: 002b:00007ffcccc1dd08 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
[134244.061168] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f29b51c9dd7
[134244.062474] RDX: 00007ffcccc1dda0 RSI: 00000000c4009420 RDI: 0000000000000003
[134244.063771] RBP: 0000000000000003 R08: 00005565cea4b000 R09: 0000000000000000
[134244.065032] R10: 0000000000000541 R11: 0000000000000202 R12: 00007ffcccc2060a
[134244.066327] R13: 00007ffcccc1dda0 R14: 0000000000000002 R15: 00007ffcccc1dec0
[134244.067626] irq event stamp: 0
[134244.068202] hardirqs last enabled at (0): [<0000000000000000>] 0x0
[134244.069351] hardirqs last disabled at (0): [<ffffffffb2abdedf>] copy_process+0x74f/0x2020
[134244.070909] softirqs last enabled at (0): [<ffffffffb2abdedf>] copy_process+0x74f/0x2020
[134244.072392] softirqs last disabled at (0): [<0000000000000000>] 0x0
[134244.073432] ---[ end trace bd7c03622e0b0a99 ]---
The -EINVAL error comes from the following chain of function calls:
__btrfs_cow_block() <-- aborts the transaction
btrfs_reloc_cow_block()
replace_file_extents()
get_new_location() <-- returns -EINVAL
When relocating a data block group, for each allocated extent of the block
group, we preallocate another extent (at prealloc_file_extent_cluster()),
associated with the data relocation inode, and then dirty all its pages.
These preallocated extents have, and must have, the same size that extents
from the data block group being relocated have.
Later before we start the relocation stage that updates pointers (bytenr
field of file extent items) to point to the the new extents, we trigger
writeback for the data relocation inode. The expectation is that writeback
will write the pages to the previously preallocated extents, that it
follows the NOCOW path. That is generally the case, however, if a scrub
is running it may have turned the block group that contains those extents
into RO mode, in which case writeback falls back to the COW path.
However in the COW path instead of allocating exactly one extent with the
expected size, the allocator may end up allocating several smaller extents
due to free space fragmentation - because we tell it at cow_file_range()
that the minimum allocation size can match the filesystem's sector size.
This later breaks the relocation's expectation that an extent associated
to a file extent item in the data relocation inode has the same size as
the respective extent pointed by a file extent item in another tree - in
this case the extent to which the relocation inode poins to is smaller,
causing relocation.c:get_new_location() to return -EINVAL.
For example, if we are relocating a data block group X that has a logical
address of X and the block group has an extent allocated at the logical
address X + 128KiB with a size of 64KiB:
1) At prealloc_file_extent_cluster() we allocate an extent for the data
relocation inode with a size of 64KiB and associate it to the file
offset 128KiB (X + 128KiB - X) of the data relocation inode. This
preallocated extent was allocated at block group Z;
2) A scrub running in parallel turns block group Z into RO mode and
starts scrubing its extents;
3) Relocation triggers writeback for the data relocation inode;
4) When running delalloc (btrfs_run_delalloc_range()), we try first the
NOCOW path because the data relocation inode has BTRFS_INODE_PREALLOC
set in its flags. However, because block group Z is in RO mode, the
NOCOW path (run_delalloc_nocow()) falls back into the COW path, by
calling cow_file_range();
5) At cow_file_range(), in the first iteration of the while loop we call
btrfs_reserve_extent() to allocate a 64KiB extent and pass it a minimum
allocation size of 4KiB (fs_info->sectorsize). Due to free space
fragmentation, btrfs_reserve_extent() ends up allocating two extents
of 32KiB each, each one on a different iteration of that while loop;
6) Writeback of the data relocation inode completes;
7) Relocation proceeds and ends up at relocation.c:replace_file_extents(),
with a leaf which has a file extent item that points to the data extent
from block group X, that has a logical address (bytenr) of X + 128KiB
and a size of 64KiB. Then it calls get_new_location(), which does a
lookup in the data relocation tree for a file extent item starting at
offset 128KiB (X + 128KiB - X) and belonging to the data relocation
inode. It finds a corresponding file extent item, however that item
points to an extent that has a size of 32KiB, which doesn't match the
expected size of 64KiB, resuling in -EINVAL being returned from this
function and propagated up to __btrfs_cow_block(), which aborts the
current transaction.
To fix this make sure that at cow_file_range() when we call the allocator
we pass it a minimum allocation size corresponding the desired extent size
if the inode belongs to the data relocation tree, otherwise pass it the
filesystem's sector size as the minimum allocation size.
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There is a race between block group removal and block group creation
when the removal is completed by a task running fitrim or scrub. When
this happens we end up failing the block group creation with an error
-EEXIST since we attempt to insert a duplicate block group item key
in the extent tree. That results in a transaction abort.
The race happens like this:
1) Task A is doing a fitrim, and at btrfs_trim_block_group() it freezes
block group X with btrfs_freeze_block_group() (until very recently
that was named btrfs_get_block_group_trimming());
2) Task B starts removing block group X, either because it's now unused
or due to relocation for example. So at btrfs_remove_block_group(),
while holding the chunk mutex and the block group's lock, it sets
the 'removed' flag of the block group and it sets the local variable
'remove_em' to false, because the block group is currently frozen
(its 'frozen' counter is > 0, until very recently this counter was
named 'trimming');
3) Task B unlocks the block group and the chunk mutex;
4) Task A is done trimming the block group and unfreezes the block group
by calling btrfs_unfreeze_block_group() (until very recently this was
named btrfs_put_block_group_trimming()). In this function we lock the
block group and set the local variable 'cleanup' to true because we
were able to decrement the block group's 'frozen' counter down to 0 and
the flag 'removed' is set in the block group.
Since 'cleanup' is set to true, it locks the chunk mutex and removes
the extent mapping representing the block group from the mapping tree;
5) Task C allocates a new block group Y and it picks up the logical address
that block group X had as the logical address for Y, because X was the
block group with the highest logical address and now the second block
group with the highest logical address, the last in the fs mapping tree,
ends at an offset corresponding to block group X's logical address (this
logical address selection is done at volumes.c:find_next_chunk()).
At this point the new block group Y does not have yet its item added
to the extent tree (nor the corresponding device extent items and
chunk item in the device and chunk trees). The new group Y is added to
the list of pending block groups in the transaction handle;
6) Before task B proceeds to removing the block group item for block
group X from the extent tree, which has a key matching:
(X logical offset, BTRFS_BLOCK_GROUP_ITEM_KEY, length)
task C while ending its transaction handle calls
btrfs_create_pending_block_groups(), which finds block group Y and
tries to insert the block group item for Y into the exten tree, which
fails with -EEXIST since logical offset is the same that X had and
task B hasn't yet deleted the key from the extent tree.
This failure results in a transaction abort, producing a stack like
the following:
------------[ cut here ]------------
BTRFS: Transaction aborted (error -17)
WARNING: CPU: 2 PID: 19736 at fs/btrfs/block-group.c:2074 btrfs_create_pending_block_groups+0x1eb/0x260 [btrfs]
Modules linked in: btrfs blake2b_generic xor raid6_pq (...)
CPU: 2 PID: 19736 Comm: fsstress Tainted: G W 5.6.0-rc7-btrfs-next-58 #5
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
RIP: 0010:btrfs_create_pending_block_groups+0x1eb/0x260 [btrfs]
Code: ff ff ff 48 8b 55 50 f0 48 (...)
RSP: 0018:ffffa4160a1c7d58 EFLAGS: 00010286
RAX: 0000000000000000 RBX: ffff961581909d98 RCX: 0000000000000000
RDX: 0000000000000001 RSI: ffffffffb3d63990 RDI: 0000000000000001
RBP: ffff9614f3356a58 R08: 0000000000000000 R09: 0000000000000001
R10: ffff9615b65b0040 R11: 0000000000000000 R12: ffff961581909c10
R13: ffff9615b0c32000 R14: ffff9614f3356ab0 R15: ffff9614be779000
FS: 00007f2ce2841e80(0000) GS:ffff9615bae00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000555f18780000 CR3: 0000000131d34005 CR4: 00000000003606e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
btrfs_start_dirty_block_groups+0x398/0x4e0 [btrfs]
btrfs_commit_transaction+0xd0/0xc50 [btrfs]
? btrfs_attach_transaction_barrier+0x1e/0x50 [btrfs]
? __ia32_sys_fdatasync+0x20/0x20
iterate_supers+0xdb/0x180
ksys_sync+0x60/0xb0
__ia32_sys_sync+0xa/0x10
do_syscall_64+0x5c/0x280
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x7f2ce1d4d5b7
Code: 83 c4 08 48 3d 01 (...)
RSP: 002b:00007ffd8b558c58 EFLAGS: 00000202 ORIG_RAX: 00000000000000a2
RAX: ffffffffffffffda RBX: 000000000000002c RCX: 00007f2ce1d4d5b7
RDX: 00000000ffffffff RSI: 00000000186ba07b RDI: 000000000000002c
RBP: 0000555f17b9e520 R08: 0000000000000012 R09: 000000000000ce00
R10: 0000000000000078 R11: 0000000000000202 R12: 0000000000000032
R13: 0000000051eb851f R14: 00007ffd8b558cd0 R15: 0000555f1798ec20
irq event stamp: 0
hardirqs last enabled at (0): [<0000000000000000>] 0x0
hardirqs last disabled at (0): [<ffffffffb2abdedf>] copy_process+0x74f/0x2020
softirqs last enabled at (0): [<ffffffffb2abdedf>] copy_process+0x74f/0x2020
softirqs last disabled at (0): [<0000000000000000>] 0x0
---[ end trace bd7c03622e0b0a9c ]---
Fix this simply by making btrfs_remove_block_group() remove the block
group's item from the extent tree before it flags the block group as
removed. Also make the free space deletion from the free space tree
before flagging the block group as removed, to avoid a similar race
with adding and removing free space entries for the free space tree.
Fixes: 04216820fe ("Btrfs: fix race between fs trimming and block group remove/allocation")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When removing a block group, if we fail to delete the block group's item
from the extent tree, we jump to the 'out' label and end up decrementing
the block group's reference count once only (by 1), resulting in a counter
leak because the block group at that point was already removed from the
block group cache rbtree - so we have to decrement the reference count
twice, once for the rbtree and once for our lookup at the start of the
function.
There is a second bug where if removing the free space tree entries (the
call to remove_block_group_free_space()) fails we end up jumping to the
'out_put_group' label but end up decrementing the reference count only
once, when we should have done it twice, since we have already removed
the block group from the block group cache rbtree. This happens because
the reference count decrement for the rbtree reference happens after
attempting to remove the free space tree entries, which is far away from
the place where we remove the block group from the rbtree.
To make things less error prone, decrement the reference count for the
rbtree immediately after removing the block group from it. This also
eleminates the need for two different exit labels on error, renaming
'out_put_label' to just 'out' and removing the old 'out'.
Fixes: f6033c5e33 ("btrfs: fix block group leak when removing fails")
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
afs_vnode_commit_status() is only ever called if op->error is 0, so remove
the op->error checks from the function.
Fixes: e49c7b2f6d ("afs: Build an abstraction around an "operation" concept")
Signed-off-by: David Howells <dhowells@redhat.com>
afs_check_for_remote_deletion() checks to see if error ENOENT is returned
by the server in response to an operation and, if so, marks the primary
vnode as having been deleted as the FID is no longer valid.
However, it's being called from the operation success functions, where no
abort has happened - and if an inline abort is recorded, it's handled by
afs_vnode_commit_status().
Fix this by actually calling the operation aborted method if provided and
having that point to afs_check_for_remote_deletion().
Fixes: e49c7b2f6d ("afs: Build an abstraction around an "operation" concept")
Signed-off-by: David Howells <dhowells@redhat.com>
Fix yfs_fs_fetch_status() to honour the vnode selector in
op->fetch_status.which as does afs_fs_fetch_status() that allows
afs_do_lookup() to use this as an alternative to the InlineBulkStatus RPC
call if not implemented by the server.
This doesn't matter in the current code as YFS servers always implement
InlineBulkStatus, but a subsequent will call it on YFS servers too in some
circumstances.
Fixes: e49c7b2f6d ("afs: Build an abstraction around an "operation" concept")
Signed-off-by: David Howells <dhowells@redhat.com>
The kernel uses internal mounts created by kern_mount() and populated
with files with no lookup path by alloc_file_pseudo() for a variety of
reasons. An example of such a mount is for anonymous pipes. For pipes,
every vfs_write() regardless of filesystem, calls fsnotify_modify()
to notify of any changes which incurs a small amount of overhead in
fsnotify even when there are no watchers. It can also trigger for reads
and readv and writev, it was simply vfs_write() that was noticed first.
A patch is pending that reduces, but does not eliminate, the overhead of
fsnotify but for files that cannot be looked up via a path, even that
small overhead is unnecessary. The user API for all notification
subsystems (inotify, fanotify, ...) is based on the pathname and a dirfd
and proc entries appear to be the only visible representation of the
files. Proc does not have the same pathname as the internal entry and
the proc inode is not the same as the internal inode so even if fanotify
is used on a file under /proc/XX/fd, no useful events are notified.
This patch changes alloc_file_pseudo() to always opt out of fsnotify by
setting FMODE_NONOTIFY flag so that no check is made for fsnotify
watchers on pseudo files. This should be safe as the underlying helper
for the dentry is d_alloc_pseudo() which explicitly states that no
lookups are ever performed meaning that fanotify should have nothing
useful to attach to.
The test motivating this was "perf bench sched messaging --pipe". On
a single-socket machine using threads the difference of the patch was
as follows.
5.7.0 5.7.0
vanilla nofsnotify-v1r1
Amean 1 1.3837 ( 0.00%) 1.3547 ( 2.10%)
Amean 3 3.7360 ( 0.00%) 3.6543 ( 2.19%)
Amean 5 5.8130 ( 0.00%) 5.7233 * 1.54%*
Amean 7 8.1490 ( 0.00%) 7.9730 * 2.16%*
Amean 12 14.6843 ( 0.00%) 14.1820 ( 3.42%)
Amean 18 21.8840 ( 0.00%) 21.7460 ( 0.63%)
Amean 24 28.8697 ( 0.00%) 29.1680 ( -1.03%)
Amean 30 36.0787 ( 0.00%) 35.2640 * 2.26%*
Amean 32 38.0527 ( 0.00%) 38.1223 ( -0.18%)
The difference is small but in some cases it's outside the noise so
while marginal, there is still some small benefit to ignoring fsnotify
for files allocated via alloc_file_pseudo() in some cases.
Link: https://lore.kernel.org/r/20200615121358.GF3183@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
There is a regular need in the kernel to provide a way to declare having a
dynamically sized set of trailing elements in a structure. Kernel code should
always use “flexible array members”[1] for these cases. The older style of
one-element or zero-length arrays should no longer be used[2].
[1] https://en.wikipedia.org/wiki/Flexible_array_member
[2] https://github.com/KSPP/linux/issues/21
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
There is a regular need in the kernel to provide a way to declare having a
dynamically sized set of trailing elements in a structure. Kernel code should
always use “flexible array members”[1] for these cases. The older style of
one-element or zero-length arrays should no longer be used[2].
[1] https://en.wikipedia.org/wiki/Flexible_array_member
[2] https://github.com/KSPP/linux/issues/21
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
There is a regular need in the kernel to provide a way to declare having a
dynamically sized set of trailing elements in a structure. Kernel code should
always use “flexible array members”[1] for these cases. The older style of
one-element or zero-length arrays should no longer be used[2].
[1] https://en.wikipedia.org/wiki/Flexible_array_member
[2] https://github.com/KSPP/linux/issues/21
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
includes the per-inode DAX support, which was dependant on the DAX
infrastructure which came in via the XFS tree, and a number of
regression and bug fixes; most notably the "BUG: using
smp_processor_id() in preemptible code in ext4_mb_new_blocks" reported
by syzkaller.
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAl7mgCcACgkQ8vlZVpUN
gaPftwf8C4w/7SG+CYLdwg0d2u9TKk77yDuWaioFHOcMSjZvG4TCSgtMhZxQnyty
9t4yqacILx12pCj/mZnrZp5BOSn9O2ZbuDoXNKNrFXU0BF+CsbnhvJvrrh1j/MUa
PPtcqyGFdOLSDvHSD9xPVT76juwh79aR8vB7qnQXaEO5wcLodZWoqBEFSKCl6Bo8
hjXs1EXidusKsoarQxW6mEITmnhU2S2fuCVDgVcoM/LmKwzbgqvlWrentq9u8qLH
W+XbjWgUtCM1byeDZWqe5FYyyJ8x+dTv7H5an3KR92EN6hKo5AOvzA0I41pZscq/
bJ9p2THDxJQX4rJBevGAS5mZ6hTkRw==
=z6eO
-----END PGP SIGNATURE-----
Merge tag 'ext4-for-linus-5.8-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull more ext4 updates from Ted Ts'o:
"This is the second round of ext4 commits for 5.8 merge window [1].
It includes the per-inode DAX support, which was dependant on the DAX
infrastructure which came in via the XFS tree, and a number of
regression and bug fixes; most notably the "BUG: using
smp_processor_id() in preemptible code in ext4_mb_new_blocks" reported
by syzkaller"
[1] The pull request actually came in 15 minutes after I had tagged the
rc1 release. Tssk, tssk, late.. - Linus
* tag 'ext4-for-linus-5.8-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
ext4, jbd2: ensure panic by fix a race between jbd2 abort and ext4 error handlers
ext4: support xattr gnu.* namespace for the Hurd
ext4: mballoc: Use this_cpu_read instead of this_cpu_ptr
ext4: avoid utf8_strncasecmp() with unstable name
ext4: stop overwrite the errcode in ext4_setup_super
ext4: fix partial cluster initialization when splitting extent
ext4: avoid race conditions when remounting with options that change dax
Documentation/dax: Update DAX enablement for ext4
fs/ext4: Introduce DAX inode flag
fs/ext4: Remove jflag variable
fs/ext4: Make DAX mount option a tri-state
fs/ext4: Only change S_DAX on inode load
fs/ext4: Update ext4_should_use_dax()
fs/ext4: Change EXT4_MOUNT_DAX to EXT4_MOUNT_DAX_ALWAYS
fs/ext4: Disallow verity if inode is DAX
fs/ext4: Narrow scope of DAX check in setflags
For an exiting process it tries to cancel all its inflight requests. Use
req->task to match such instead of work.pid. We always have req->task
set, and it will be valid because we're matching only current exiting
task.
Also, remove work.pid and everything related, it's useless now.
Reported-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There will be multiple places where req->task is used, so refcount-pin
it lazily with introduced *io_{get,put}_req_task(). We need to always
have valid ->task for cancellation reasons, but don't care about pinning
it in some cases. That's why it sets req->task in io_req_init() and
implements get/put laziness with a flag.
This also removes using @current from polling io_arm_poll_handler(),
etc., but doesn't change observable behaviour.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Instead of waiting for each request one by one, first try to cancel all
of them in a batched manner, and then go over inflight_list/etc to reap
leftovers.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If a process is going away, io_uring_flush() will cancel only 1
request with a matching pid. Cancel all of them
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This adds support for cancelling all io-wq works matching a predicate.
It isn't used yet, so no change in observable behaviour.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Go all over all pending lists and cancel works there, and only then
try to match running requests. No functional changes here, just a
preparation for bulk cancellation.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Abort code UAEOVERFLOW is returned when we try and set a time that's out of
range, but it's currently mapped to EREMOTEIO by the default case.
Fix UAEOVERFLOW to map instead to EOVERFLOW.
Found with the generic/258 xfstest. Note that the test is wrong as it
assumes that the filesystem will support a pre-UNIX-epoch date.
Fixes: 1eda8bab70 ("afs: Add support for the UAE error table")
Signed-off-by: David Howells <dhowells@redhat.com>
Fix the following issues:
(1) Fix writeback to reduce the size of a store operation to i_size,
effectively discarding the extra data.
The problem comes when afs_page_mkwrite() records that a page is about
to be modified by mmap(). It doesn't know what bits of the page are
going to be modified, so it records the whole page as being dirty
(this is stored in page->private as start and end offsets).
Without this, the marshalling for the store to the server extends the
size of the file to the end of the page (in afs_fs_store_data() and
yfs_fs_store_data()).
(2) Fix setattr to actually truncate the pagecache, thereby clearing
the discarded part of a file.
(3) Fix setattr to check that the new size is okay and to disable
ATTR_SIZE if i_size wouldn't change.
(4) Force i_size to be updated as the result of a truncate.
(5) Don't truncate if ATTR_SIZE is not set.
(6) Call pagecache_isize_extended() if the file was enlarged.
Note that truncate_set_size() isn't used because the setting of i_size is
done inside afs_vnode_commit_status() under the vnode->cb_lock.
Found with the generic/029 and generic/393 xfstests.
Fixes: 31143d5d51 ("AFS: implement basic file write support")
Fixes: 4343d00872 ("afs: Get rid of the afs_writeback record")
Signed-off-by: David Howells <dhowells@redhat.com>
The in-kernel afs filesystem ignores ctime because the AFS fileserver
protocol doesn't support ctimes. This, however, causes various xfstests to
fail.
Work around this by:
(1) Setting ctime to attr->ia_ctime in afs_setattr().
(2) Not ignoring ATTR_MTIME_SET, ATTR_TIMES_SET and ATTR_TOUCH settings.
(3) Setting the ctime from the server mtime when on the target file when
creating a hard link to it.
(4) Setting the ctime on directories from their revised mtimes when
renaming/moving a file.
Found by the generic/221 and generic/309 xfstests.
Signed-off-by: David Howells <dhowells@redhat.com>
When doing a partial writeback, afs_write_back_from_locked_page() may
generate an FS.StoreData RPC request that writes out part of a file when a
file has been constructed from pieces by doing seek, write, seek, write,
... as is done by ld.
The FS.StoreData RPC is given the current i_size as the file length, but
the server basically ignores it unless the data length is 0 (in which case
it's just a truncate operation). The revised file length returned in the
result of the RPC may then not reflect what we suggested - and this leads
to i_size getting moved backwards - which causes issues later.
Fix the client to take account of this by ignoring the returned file size
unless the data version number jumped unexpectedly - in which case we're
going to have to clear the pagecache and reload anyway.
This can be observed when doing a kernel build on an AFS mount. The
following pair of commands produce the issue:
ld -m elf_x86_64 -z max-page-size=0x200000 --emit-relocs \
-T arch/x86/realmode/rm/realmode.lds \
arch/x86/realmode/rm/header.o \
arch/x86/realmode/rm/trampoline_64.o \
arch/x86/realmode/rm/stack.o \
arch/x86/realmode/rm/reboot.o \
-o arch/x86/realmode/rm/realmode.elf
arch/x86/tools/relocs --realmode \
arch/x86/realmode/rm/realmode.elf \
>arch/x86/realmode/rm/realmode.relocs
This results in the latter giving:
Cannot read ELF section headers 0/18: Success
as the realmode.elf file got corrupted.
The sequence of events can also be driven with:
xfs_io -t -f \
-c "pwrite -S 0x58 0 0x58" \
-c "pwrite -S 0x59 10000 1000" \
-c "close" \
/afs/example.com/scratch/a
Fixes: 31143d5d51 ("AFS: implement basic file write support")
Signed-off-by: David Howells <dhowells@redhat.com>
Fix afs_write_end() to change i_size under vnode->cb_lock rather than
->wb_lock so that it doesn't race with afs_vnode_commit_status() and
afs_getattr().
The ->wb_lock is only meant to guard access to ->wb_keys which isn't
accessed by that piece of code.
Fixes: 4343d00872 ("afs: Get rid of the afs_writeback record")
Signed-off-by: David Howells <dhowells@redhat.com>
The mtime on an inode needs to be updated when a write is made into an
mmap'ed section. There are three ways in which this could be done: update
it when page_mkwrite is called, update it when a page is changed from dirty
to writeback or leave it to the server and fix the mtime up from the reply
to the StoreData RPC.
Found with the generic/215 xfstest.
Fixes: 1cf7a1518a ("afs: Implement shared-writeable mmap")
Signed-off-by: David Howells <dhowells@redhat.com>
Applications that read EFI variables may see a return
value of -EINTR if they exceed the rate limit and a
signal delivery is attempted while the process is sleeping.
This is quite surprising to the application, which probably
doesn't have code to handle it.
Change the interruptible sleep to a non-interruptible one.
Reported-by: Lennart Poettering <mzxreary@0pointer.de>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/20200528194905.690-3-tony.luck@intel.com
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Some applications want to be able to see when EFI variables
have been updated.
Update the modification time for successful writes.
Reported-by: Lennart Poettering <mzxreary@0pointer.de>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/20200528194905.690-2-tony.luck@intel.com
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
The only use of I_DIRTY_TIME_EXPIRE is to detect in
__writeback_single_inode() that inode got there because flush worker
decided it's time to writeback the dirty inode time stamps (either
because we are syncing or because of age). However we can detect this
directly in __writeback_single_inode() and there's no need for the
strange propagation with I_DIRTY_TIME_EXPIRE flag.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
When we are processing writeback for sync(2), move_expired_inodes()
didn't set any inode expiry value (older_than_this). This can result in
writeback never completing if there's steady stream of inodes added to
b_dirty_time list as writeback rechecks dirty lists after each writeback
round whether there's more work to be done. Fix the problem by using
sync(2) start time is inode expiry value when processing b_dirty_time
list similarly as for ordinarily dirtied inodes. This requires some
refactoring of older_than_this handling which simplifies the code
noticeably as a bonus.
Fixes: 0ae45f63d4 ("vfs: add support for a lazytime mount option")
CC: stable@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Inode's i_io_list list head is used to attach inode to several different
lists - wb->{b_dirty, b_dirty_time, b_io, b_more_io}. When flush worker
prepares a list of inodes to writeback e.g. for sync(2), it moves inodes
to b_io list. Thus it is critical for sync(2) data integrity guarantees
that inode is not requeued to any other writeback list when inode is
queued for processing by flush worker. That's the reason why
writeback_single_inode() does not touch i_io_list (unless the inode is
completely clean) and why __mark_inode_dirty() does not touch i_io_list
if I_SYNC flag is set.
However there are two flaws in the current logic:
1) When inode has only I_DIRTY_TIME set but it is already queued in b_io
list due to sync(2), concurrent __mark_inode_dirty(inode, I_DIRTY_SYNC)
can still move inode back to b_dirty list resulting in skipping
writeback of inode time stamps during sync(2).
2) When inode is on b_dirty_time list and writeback_single_inode() races
with __mark_inode_dirty() like:
writeback_single_inode() __mark_inode_dirty(inode, I_DIRTY_PAGES)
inode->i_state |= I_SYNC
__writeback_single_inode()
inode->i_state |= I_DIRTY_PAGES;
if (inode->i_state & I_SYNC)
bail
if (!(inode->i_state & I_DIRTY_ALL))
- not true so nothing done
We end up with I_DIRTY_PAGES inode on b_dirty_time list and thus
standard background writeback will not writeback this inode leading to
possible dirty throttling stalls etc. (thanks to Martijn Coenen for this
analysis).
Fix these problems by tracking whether inode is queued in b_io or
b_more_io lists in a new I_SYNC_QUEUED flag. When this flag is set, we
know flush worker has queued inode and we should not touch i_io_list.
On the other hand we also know that once flush worker is done with the
inode it will requeue the inode to appropriate dirty list. When
I_SYNC_QUEUED is not set, __mark_inode_dirty() can (and must) move inode
to appropriate dirty list.
Reported-by: Martijn Coenen <maco@android.com>
Reviewed-by: Martijn Coenen <maco@android.com>
Tested-by: Martijn Coenen <maco@android.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Fixes: 0ae45f63d4 ("vfs: add support for a lazytime mount option")
CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Currently, operations on inode->i_io_list are protected by
wb->list_lock. In the following patches we'll need to maintain
consistency between inode->i_state and inode->i_io_list so change the
code so that inode->i_lock protects also all inode's i_io_list handling.
Reviewed-by: Martijn Coenen <maco@android.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
CC: stable@vger.kernel.org # Prerequisite for "writeback: Avoid skipping inode writeback"
Signed-off-by: Jan Kara <jack@suse.cz>
The damn file is constant-sized - 64 bytes. IOW,
* i_size_read() is pointless
* so's dynamic allocation
* so's the 'size' argument of user_dlm_read_lvb()
* ... and so's open-coding simple_read_from_buffer(), while we are at it.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl7lZwgACgkQxWXV+ddt
WDuj6g/9E2JtqeO8zRMLb+Do/n5YX0dFHt+dM1AGY+nw8hb3U9Vlgc8KJa7UpZFX
opl1i9QL+cJLoZMZL5xZhDouMQlum5cGVV3hLwqEPYetRF/ytw/kunWAg5o8OW1R
sJxGcjyiiKpZLVx6nMjGnYjsrbOJv0HlaWfY3NCon4oQ8yQTzTPMPBevPWRM7Iqw
Ssi8pA8zXCc2QoLgyk6Pe/IGeox8+z9RA2akHkJIdMWiPHm43RDF4Yx3Yl9NHHZA
M+pLVKjZoejqwVaai8osBqWVw4Ypax1+CJit6iHGwJDkQyFPcMXMsOc5ZYBnT5or
k/ceVMCs+ejvCK1+L30u7FQRiDqf5Fwhf/SGfq7+y83KbEjMfWOya3Lyk47fbDD4
776rSaS6ejqVklWppbaPhntSrBtPR1NaDOfi55bc9TOe+yW7Du+AsQMlEE0bTJaW
eHl+A4AP/nDlo8Etn1jTWd023bzzO+iySMn3YZfK0vw3vkj3JfrCGXx6DEYipOou
uEUj0jDo/rdiB5S3GdUCujjaPgm/f0wkPudTRB9lpxJas2qFU+qo2TLJhEleELwj
m4laz7W7S+nUFP0LRl8O82AzBfjm+oHjWTpfdloT6JW9Da8/iuZ/x9VBWQ8mFJwX
U0cR3zVqUuWcK78fZa/FFgGPBxlwUv2j+OhRGsS0/orDRlrwcXo=
=5S0s
-----END PGP SIGNATURE-----
Merge tag 'for-5.8-part2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
"This reverts the direct io port to iomap infrastructure of btrfs
merged in the first pull request. We found problems in invalidate page
that don't seem to be fixable as regressions or without changing iomap
code that would not affect other filesystems.
There are four reverts in total, but three of them are followup
cleanups needed to revert a43a67a2d7 cleanly. The result is the
buffer head based implementation of direct io.
Reverts are not great, but under current circumstances I don't see
better options"
* tag 'for-5.8-part2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
Revert "btrfs: switch to iomap_dio_rw() for dio"
Revert "fs: remove dio_end_io()"
Revert "btrfs: remove BTRFS_INODE_READDIO_NEED_LOCK"
Revert "btrfs: split btrfs_direct_IO to read and write part"
This reverts commit a43a67a2d7.
This patch reverts the main part of switching direct io implementation
to iomap infrastructure. There's a problem in invalidate page that
couldn't be solved as regression in this development cycle.
The problem occurs when buffered and direct io are mixed, and the ranges
overlap. Although this is not recommended, filesystems implement
measures or fallbacks to make it somehow work. In this case, fallback to
buffered IO would be an option for btrfs (this already happens when
direct io is done on compressed data), but the change would be needed in
the iomap code, bringing new semantics to other filesystems.
Another problem arises when again the buffered and direct ios are mixed,
invalidation fails, then -EIO is set on the mapping and fsync will fail,
though there's no real error.
There have been discussions how to fix that, but revert seems to be the
least intrusive option.
Link: https://lore.kernel.org/linux-btrfs/20200528192103.xm45qoxqmkw7i5yl@fiona/
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAl7kX6EACgkQiiy9cAdy
T1GdiQwAqMaDRVLZWeV5Uc0EM9AGkrWVu6F5n9nBzuKDTXCAf8aKCyiyYdMz/P20
belQA3bPG4jkLa/4Or1XfTY2OSSV4eGBlTfjHNeW2ZJ5pJWGInqCHuVco/M98om8
57JMTMZDTxN6884U+v3bBl4jDE6MqK3QS0WfA63ufd0T8ZnFOGDBn1DieJKbViyy
ZckpDH0etaAxO171SV5VwzbFe9U7OeTXupD8LYEHngR7vfaFCkX6ZftYYN0aWsvs
uL3p6K1kiNNxTXm0M3Hw6Gpk1nEAM9/6nOR6+TUppor+rQVJCH5F7NKQVrR92MDq
Qgwldt16DP1NjOb0q5L37HIg+9kD2kshKs9CErneen6eWtcfiN0HYT35hBxVi7RT
XT/dMt17wq3waoq92+RY3U4vb47QVWS6asH4/sqsTqUMWrlEYNGkEuCfeniZzJfO
bxglNPVafQ5qy2DWBzsAUX/isaR06FihEKqODK+K78KGTptim/+ip9+yXGjM6ne2
lhdWspC5
=Iwqj
-----END PGP SIGNATURE-----
Merge tag '5.8-rc-smb3-fixes-part2' of git://git.samba.org/sfrench/cifs-2.6
Pull more cifs updates from Steve French:
"12 cifs/smb3 fixes, 2 for stable.
- add support for idsfromsid on create and chgrp/chown allowing
ability to save owner information more naturally for some workloads
- improve query info (getattr) when SMB3.1.1 posix extensions are
negotiated by using new query info level"
* tag '5.8-rc-smb3-fixes-part2' of git://git.samba.org/sfrench/cifs-2.6:
smb3: Add debug message for new file creation with idsfromsid mount option
cifs: fix chown and chgrp when idsfromsid mount option enabled
smb3: allow uid and gid owners to be set on create with idsfromsid mount option
smb311: Add tracepoints for new compound posix query info
smb311: add support for using info level for posix extensions query
smb311: Add support for lookup with posix extensions query info
smb311: Add support for SMB311 query info (non-compounded)
SMB311: Add support for query info using posix extensions (level 100)
smb3: add indatalen that can be a non-zero value to calculation of credit charge in smb2 ioctl
smb3: fix typo in mount options displayed in /proc/mounts
cifs: Add get_security_type_str function to return sec type.
smb3: extend fscache mount volume coherency check
- fix build rules in binderfs sample
- fix build errors when Kbuild recurses to the top Makefile
- covert '---help---' in Kconfig to 'help'
-----BEGIN PGP SIGNATURE-----
iQJJBAABCgAzFiEEbmPs18K1szRHjPqEPYsBB53g2wYFAl7lBuYVHG1hc2FoaXJv
eUBrZXJuZWwub3JnAAoJED2LAQed4NsGHvIP/3iErjPshpg/phwH8NTCS4SFkiti
BZRM+2lupSn7Qs53BTpVzIkXoHBJQZlJxlQ5HY8ScO+fiz28rKZr+b40us+je1Q+
SkvSPfwZzxjEg7lAZutznG4KgItJLWJKmDyh9T8Y8TAuG4f8WO0hKnXoAp3YorS2
zppEIxso8O5spZPjp+fF/fPbxPjIsabGK7Jp2LpSVFR5pVDHI/ycTlKQS+MFpMEx
6JIpdFRw7TkvKew1dr5uAWT5btWHatEqjSR3JeyVHv3EICTGQwHmcHK67cJzGInK
T51+DT7/CpKtmRgGMiTEu/INfMzzoQAKl6Fcu+vMaShTN97Hk9DpdtQyvA6P/h3L
8GA4UBct05J7fjjIB7iUD+GYQ0EZbaFujzRXLYk+dQqEJRbhcCwvdzggGp0WvGRs
1f8/AIpgnQv8JSL/bOMgGMS5uL2dSLsgbzTdr6RzWf1jlYdI1i4u7AZ/nBrwWP+Z
iOBkKsVceEoJrTbaynl3eoYqFLtWyDau+//oBc2gUvmhn8ioM5dfqBRiJjxJnPG9
/giRj6xRIqMMEw8Gg8PCG7WebfWxWyaIQwlWBbPok7DwISURK5mvOyakZL+Q25/y
6MBr2H8NEJsf35q0GTINpfZnot7NX4JXrrndJH8NIRC7HEhwd29S041xlQJdP0rs
E76xsOr3hrAmBu4P
=1NIT
-----END PGP SIGNATURE-----
Merge tag 'kbuild-v5.8-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild
Pull more Kbuild updates from Masahiro Yamada:
- fix build rules in binderfs sample
- fix build errors when Kbuild recurses to the top Makefile
- covert '---help---' in Kconfig to 'help'
* tag 'kbuild-v5.8-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
treewide: replace '---help---' in Kconfig files with 'help'
kbuild: fix broken builds because of GZIP,BZIP2,LZOP variables
samples: binderfs: really compile this sample and fix build issues
- Fix an integer overflow problem in the unshare actor.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAl7fCTEACgkQ+H93GTRK
tOuQxQ//Ya/xLx9UPoZepTzjHQKl2MlYVYRfKCL60NrH6kNpvq9jyGiPg6xOXc3g
KGTe23YDiuP80L3hpIZ9yj/SbJAItI8LsqHHrvVDbAdVSQdK56ajZqq3xwyvOC9u
RqCkGkVzRE+nmToJQbYCSmPA446aqMWuCpmlsTbuGmjvkRKAMgBBG/66nbcplQnC
eeflcVW7IdbbQ45K8QpyP4AeNMobc26B7zmWqXYeZuMxHcFsrnvld3pgke39i8Hk
k0SzMenGddYfb6/FknnxHASMnqnhE7lA1YyWe7F3uDM8OwmpNIseBysqm+6tETkn
DBlcpVeENNJB7ygPhqOJXmmDGnap5Y7vwhAc8jX84yuXRkd0gx5aTRIyH8cNp9lQ
TRwoVY9DTUkUlMkSLpgeCFIOR5SyOW3H4xZV4PC0sJxAWtM0J3B8A5zvAjQ5kVRP
79gVRpl2OUj648nbrPRwhDBwnNZAhflRVvBh9kasteA7SAtuGJFJKZZ162Smltz2
1E9i/2CvUUartNOjKkT3qPzAF6B1Je3AGTMwuDPhcYX9bdW+9pCD09yi1CiGOn7S
QuuwyHTAcLRtZiShNCG6zQhqq++zQCZ58J1IBHYajE73YM1+8r/5wCfTIhB+CPuf
J0rjqS+d151d2qMBnK6oag0t2u5Hj+xlcJw9QnQGqPKs6yIktA0=
=s+Pr
-----END PGP SIGNATURE-----
Merge tag 'iomap-5.8-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull iomap fix from Darrick Wong:
"A single iomap bug fix for a variable type mistake on 32-bit
architectures, fixing an integer overflow problem in the unshare
actor"
* tag 'iomap-5.8-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
iomap: Fix unsharing of an extent >2GB on a 32-bit machine
- Fix a resource leak on an error bailout.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAl7fCpAACgkQ+H93GTRK
tOtlog//ZkKRzp72HXCTgGpQj0IjkCjuZlz0F8FpdVhl9lOANaZPoXDbCIAax8q1
67wfDG7p8wl109KZnMuaPPXSC5KlynaWphSs7XMXqgLFXViha31c6U7PSMyxZmBB
674hE9eKnVNjhkMk98MtVV3ShWge9T5yGVXYhQbXMWDx8GCdNd9NEP3qnMcBEaLt
EPl6yoOfdNnKo37ptrt1Qb2NgORDBDDHYPr6SX/xEYDsppDLp8u+k/YGhuoJVtdc
HGR08ryIn6lctvkLbqDxtFzFxIL8Za7AHrBXilgioJYRJ78v7VyCnj1u8eT/axsa
ZUis/sQXjgvSvlsGZQZkyPdtnfhFbzXCeulyQvrMnEheMuz691dljMid3fEBkfmq
SubqE+HDP8aC6Zs9EkV/lEtdTH+EQ2ojZHH9s5oi6qbvilfFxyoPUfIxog+bhqPO
fwl1sL2nb/eQuBF+DeHg4UxP9WzA06Z1q9nZpDjrY224aMOWnrN8TBOKv4FZiRDt
M1l3VXcVsaDbCmbOsCTXdLh0Ap3przjk4hFPOjPJxlTzTNO9rPLhopvuLd+J3quA
fzNNBA4bMSq3IFSg3VEC2U3YgF3anGrt8PuopIwCH8muc9agCs//fI3Y/eI4k9oT
VOUPSxKckZ6SAEhIr7uTyKFzS+yNFBaYN/Y0FqDGnzbf5Bqr9NM=
=C0rZ
-----END PGP SIGNATURE-----
Merge tag 'xfs-5.8-merge-9' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs fix from Darrick Wong:
"We've settled down into the bugfix phase; this one fixes a resource
leak on an error bailout path"
* tag 'xfs-5.8-merge-9' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: Add the missed xfs_perag_put() for xfs_ifree_cluster()
Since commit 84af7a6194 ("checkpatch: kconfig: prefer 'help' over
'---help---'"), the number of '---help---' has been gradually
decreasing, but there are still more than 2400 instances.
This commit finishes the conversion. While I touched the lines,
I also fixed the indentation.
There are a variety of indentation styles found.
a) 4 spaces + '---help---'
b) 7 spaces + '---help---'
c) 8 spaces + '---help---'
d) 1 space + 1 tab + '---help---'
e) 1 tab + '---help---' (correct indentation)
f) 1 tab + 1 space + '---help---'
g) 1 tab + 2 spaces + '---help---'
In order to convert all of them to 1 tab + 'help', I ran the
following commend:
$ find . -name 'Kconfig*' | xargs sed -i 's/^[[:space:]]*---help---/\thelp/'
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEqG5UsNXhtOCrfGQP+7dXa6fLC2sFAl7U/i8ACgkQ+7dXa6fL
C2u2eg/+Oy6ybq0hPovYVkFI9WIG7ZCz7w9Q6BEnfYMqqn3dnfJxKQ3l4pnQEOWw
f4QfvpvevsYfMtOJkYcG6s66rQgbFdqc5TEyBBy0QNp3acRolN7IXkcopvv9xOpQ
JxedpbFG1PTFLWjvBpyjlrUPouwLzq2FXAf1Ox0ZIMw6165mYOMWoli1VL8dh0A0
Ai7JUB0WrvTNbrwhV413obIzXT/rPCdcrgbQcgrrLPex8lQ47ZAE9bq6k4q5HiwK
KRzEqkQgnzId6cCNTFBfkTWsx89zZunz7jkfM5yx30MvdAtPSxvvpfIPdZRZkXsP
E2K9Fk1/6OQZTC0Op3Pi/bt+hVG/mD1p0sQUDgo2MO3qlSS+5mMkR8h3mJEgwK12
72P4YfOJkuAy2z3v4lL0GYdUDAZY6i6G8TMxERKu/a9O3VjTWICDOyBUS6F8YEAK
C7HlbZxAEOKTVK0BTDTeEUBwSeDrBbvH6MnRlZCG5g1Fos2aWP0udhjiX8IfZLO7
GN6nWBvK1fYzfsUczdhgnoCzQs3suoDo04HnsTPGJ8De52T4x2RsjV+gPx0nrNAq
eWChl1JvMWsY2B3GLnl9XQz4NNN+EreKEkk+PULDGllrArrPsp5Vnhb9FJO1PVCU
hMDJHohPiXnKbc8f4Bd78OhIvnuoGfJPdM5MtNe2flUKy2a2ops=
=YTGf
-----END PGP SIGNATURE-----
Merge tag 'notifications-20200601' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
Pull notification queue from David Howells:
"This adds a general notification queue concept and adds an event
source for keys/keyrings, such as linking and unlinking keys and
changing their attributes.
Thanks to Debarshi Ray, we do have a pull request to use this to fix a
problem with gnome-online-accounts - as mentioned last time:
https://gitlab.gnome.org/GNOME/gnome-online-accounts/merge_requests/47
Without this, g-o-a has to constantly poll a keyring-based kerberos
cache to find out if kinit has changed anything.
[ There are other notification pending: mount/sb fsinfo notifications
for libmount that Karel Zak and Ian Kent have been working on, and
Christian Brauner would like to use them in lxc, but let's see how
this one works first ]
LSM hooks are included:
- A set of hooks are provided that allow an LSM to rule on whether or
not a watch may be set. Each of these hooks takes a different
"watched object" parameter, so they're not really shareable. The
LSM should use current's credentials. [Wanted by SELinux & Smack]
- A hook is provided to allow an LSM to rule on whether or not a
particular message may be posted to a particular queue. This is
given the credentials from the event generator (which may be the
system) and the watch setter. [Wanted by Smack]
I've provided SELinux and Smack with implementations of some of these
hooks.
WHY
===
Key/keyring notifications are desirable because if you have your
kerberos tickets in a file/directory, your Gnome desktop will monitor
that using something like fanotify and tell you if your credentials
cache changes.
However, we also have the ability to cache your kerberos tickets in
the session, user or persistent keyring so that it isn't left around
on disk across a reboot or logout. Keyrings, however, cannot currently
be monitored asynchronously, so the desktop has to poll for it - not
so good on a laptop. This facility will allow the desktop to avoid the
need to poll.
DESIGN DECISIONS
================
- The notification queue is built on top of a standard pipe. Messages
are effectively spliced in. The pipe is opened with a special flag:
pipe2(fds, O_NOTIFICATION_PIPE);
The special flag has the same value as O_EXCL (which doesn't seem
like it will ever be applicable in this context)[?]. It is given up
front to make it a lot easier to prohibit splice&co from accessing
the pipe.
[?] Should this be done some other way? I'd rather not use up a new
O_* flag if I can avoid it - should I add a pipe3() system call
instead?
The pipe is then configured::
ioctl(fds[1], IOC_WATCH_QUEUE_SET_SIZE, queue_depth);
ioctl(fds[1], IOC_WATCH_QUEUE_SET_FILTER, &filter);
Messages are then read out of the pipe using read().
- It should be possible to allow write() to insert data into the
notification pipes too, but this is currently disabled as the
kernel has to be able to insert messages into the pipe *without*
holding pipe->mutex and the code to make this work needs careful
auditing.
- sendfile(), splice() and vmsplice() are disabled on notification
pipes because of the pipe->mutex issue and also because they
sometimes want to revert what they just did - but one or more
notification messages might've been interleaved in the ring.
- The kernel inserts messages with the wait queue spinlock held. This
means that pipe_read() and pipe_write() have to take the spinlock
to update the queue pointers.
- Records in the buffer are binary, typed and have a length so that
they can be of varying size.
This allows multiple heterogeneous sources to share a common
buffer; there are 16 million types available, of which I've used
just a few, so there is scope for others to be used. Tags may be
specified when a watchpoint is created to help distinguish the
sources.
- Records are filterable as types have up to 256 subtypes that can be
individually filtered. Other filtration is also available.
- Notification pipes don't interfere with each other; each may be
bound to a different set of watches. Any particular notification
will be copied to all the queues that are currently watching for it
- and only those that are watching for it.
- When recording a notification, the kernel will not sleep, but will
rather mark a queue as having lost a message if there's
insufficient space. read() will fabricate a loss notification
message at an appropriate point later.
- The notification pipe is created and then watchpoints are attached
to it, using one of:
keyctl_watch_key(KEY_SPEC_SESSION_KEYRING, fds[1], 0x01);
watch_mount(AT_FDCWD, "/", 0, fd, 0x02);
watch_sb(AT_FDCWD, "/mnt", 0, fd, 0x03);
where in both cases, fd indicates the queue and the number after is
a tag between 0 and 255.
- Watches are removed if either the notification pipe is destroyed or
the watched object is destroyed. In the latter case, a message will
be generated indicating the enforced watch removal.
Things I want to avoid:
- Introducing features that make the core VFS dependent on the
network stack or networking namespaces (ie. usage of netlink).
- Dumping all this stuff into dmesg and having a daemon that sits
there parsing the output and distributing it as this then puts the
responsibility for security into userspace and makes handling
namespaces tricky. Further, dmesg might not exist or might be
inaccessible inside a container.
- Letting users see events they shouldn't be able to see.
TESTING AND MANPAGES
====================
- The keyutils tree has a pipe-watch branch that has keyctl commands
for making use of notifications. Proposed manual pages can also be
found on this branch, though a couple of them really need to go to
the main manpages repository instead.
If the kernel supports the watching of keys, then running "make
test" on that branch will cause the testing infrastructure to spawn
a monitoring process on the side that monitors a notifications pipe
for all the key/keyring changes induced by the tests and they'll
all be checked off to make sure they happened.
https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/keyutils.git/log/?h=pipe-watch
- A test program is provided (samples/watch_queue/watch_test) that
can be used to monitor for keyrings, mount and superblock events.
Information on the notifications is simply logged to stdout"
* tag 'notifications-20200601' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
smack: Implement the watch_key and post_notification hooks
selinux: Implement the watch_key security hook
keys: Make the KEY_NEED_* perms an enum rather than a mask
pipe: Add notification lossage handling
pipe: Allow buffers to be marked read-whole-or-error for notifications
Add sample notification program
watch_queue: Add a key/keyring notification facility
security: Add hooks to rule on setting a watch
pipe: Add general notification queue support
pipe: Add O_NOTIFICATION_PIPE
security: Add a hook for the point of notification insertion
uapi: General notification queue definitions
Pavel noticed that a debug message (disabled by default) in creating the security
descriptor context could be useful for new file creation owner fields
(as we already have for the mode) when using mount parm idsfromsid.
[38120.392272] CIFS: FYI: owner S-1-5-88-1-0, group S-1-5-88-2-0
[38125.792637] CIFS: FYI: owner S-1-5-88-1-1000, group S-1-5-88-2-1000
Also cleans up a typo in a comment
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Pull proc fix from Eric Biederman:
"Much to my surprise syzbot found a very old bug in proc that the
recent changes made easier to reproce. This bug is subtle enough it
looks like it fooled everyone who should know better"
* 'proc-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
proc: Use new_inode not new_inode_pseudo
Recently syzbot reported that unmounting proc when there is an ongoing
inotify watch on the root directory of proc could result in a use
after free when the watch is removed after the unmount of proc
when the watcher exits.
Commit 69879c01a0 ("proc: Remove the now unnecessary internal mount
of proc") made it easier to unmount proc and allowed syzbot to see the
problem, but looking at the code it has been around for a long time.
Looking at the code the fsnotify watch should have been removed by
fsnotify_sb_delete in generic_shutdown_super. Unfortunately the inode
was allocated with new_inode_pseudo instead of new_inode so the inode
was not on the sb->s_inodes list. Which prevented
fsnotify_unmount_inodes from finding the inode and removing the watch
as well as made it so the "VFS: Busy inodes after unmount" warning
could not find the inodes to warn about them.
Make all of the inodes in proc visible to generic_shutdown_super,
and fsnotify_sb_delete by using new_inode instead of new_inode_pseudo.
The only functional difference is that new_inode places the inodes
on the sb->s_inodes list.
I wrote a small test program and I can verify that without changes it
can trigger this issue, and by replacing new_inode_pseudo with
new_inode the issues goes away.
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/000000000000d788c905a7dfa3f4@google.com
Reported-by: syzbot+7d2debdcdb3cb93c1e5e@syzkaller.appspotmail.com
Fixes: 0097875bd4 ("proc: Implement /proc/thread-self to point at the directory of the current thread")
Fixes: 021ada7dff ("procfs: switch /proc/self away from proc_dir_entry")
Fixes: 51f0885e54 ("vfs,proc: guarantee unique inodes in /proc")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
In the ext4 filesystem with errors=panic, if one process is recording
errno in the superblock when invoking jbd2_journal_abort() due to some
error cases, it could be raced by another __ext4_abort() which is
setting the SB_RDONLY flag but missing panic because errno has not been
recorded.
jbd2_journal_commit_transaction()
jbd2_journal_abort()
journal->j_flags |= JBD2_ABORT;
jbd2_journal_update_sb_errno()
| ext4_journal_check_start()
| __ext4_abort()
| sb->s_flags |= SB_RDONLY;
| if (!JBD2_REC_ERR)
| return;
journal->j_flags |= JBD2_REC_ERR;
Finally, it will no longer trigger panic because the filesystem has
already been set read-only. Fix this by introduce j_abort_mutex to make
sure journal abort is completed before panic, and remove JBD2_REC_ERR
flag.
Fixes: 4327ba52af ("ext4, jbd2: ensure entering into panic after recording an error in superblock")
Signed-off-by: zhangyi (F) <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20200609073540.3810702-1-yi.zhang@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
idsfromsid was ignored in chown and chgrp causing it to fail
when upcalls were not configured for lookup. idsfromsid allows
mapping users when setting user or group ownership using
"special SID" (reserved for this). Add support for chmod and chgrp
when idsfromsid mount option is enabled.
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Currently idsfromsid mount option allows querying owner information from the
special sids used to represent POSIX uids and gids but needed changes to
populate the security descriptor context with the owner information when
idsfromsid mount option was used.
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Add dynamic tracepoints for new SMB3.1.1. posix extensions query info level (100)
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Adds calls to the newer info level for query info using SMB3.1.1 posix extensions.
The remaining two places that call the older query info (non-SMB3.1.1 POSIX)
require passing in the fid and can be updated in a later patch.
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Improve support for lookup when using SMB3.1.1 posix mounts.
Use new info level 100 (posix query info)
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Add worker function for non-compounded SMB3.1.1 POSIX Extensions query info.
This is needed for revalidate of root (cached) directory for example.
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Adds support for better query info on dentry revalidation (using
the SMB3.1.1 POSIX extensions level 100). Followon patch will
add support for translating the UID/GID from the SID and also
will add support for using the posix query info on lookup.
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Some of tests in xfstests failed with cifsd kernel server since commit
e80ddeb2f7. cifsd kernel server validates credit charge from client
by calculating it base on max((InputCount + OutputCount) and
(MaxInputResponse + MaxOutputResponse)) according to specification.
MS-SMB2 specification describe credit charge calculation of smb2 ioctl :
If Connection.SupportsMultiCredit is TRUE, the server MUST validate
CreditCharge based on the maximum of (InputCount + OutputCount) and
(MaxInputResponse + MaxOutputResponse), as specified in section 3.3.5.2.5.
If the validation fails, it MUST fail the IOCTL request with
STATUS_INVALID_PARAMETER.
This patch add indatalen that can be a non-zero value to calculation of
credit charge in SMB2_ioctl_init().
Fixes: e80ddeb2f7 ("smb3: fix incorrect number of credits when ioctl
MaxOutputResponse > 64K")
Cc: Stable <stable@vger.kernel.org>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Cc: Steve French <smfrench@gmail.com>
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Pull updates from Andrew Morton:
"A few fixes and stragglers.
Subsystems affected by this patch series: mm/memory-failure, ocfs2,
lib/lzo, misc"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
amdgpu: a NULL ->mm does not mean a thread is a kthread
lib/lzo: fix ambiguous encoding bug in lzo-rle
ocfs2: fix build failure when TCP/IP is disabled
mm/memory-failure: send SIGBUS(BUS_MCEERR_AR) only to current thread
mm/memory-failure: prioritize prctl(PR_MCE_KILL) over vm.memory_failure_early_kill
After commit 12abc5ee78 ("tcp: add tcp_sock_set_nodelay") and commit
c488aeadcb ("tcp: add tcp_sock_set_user_timeout"), building the kernel
with OCFS2_FS=y but without INET=y causes it to fail with:
ld: fs/ocfs2/cluster/tcp.o: in function `o2net_accept_many':
tcp.c:(.text+0x21b1): undefined reference to `tcp_sock_set_nodelay'
ld: tcp.c:(.text+0x21c1): undefined reference to `tcp_sock_set_user_timeout'
ld: fs/ocfs2/cluster/tcp.o: in function `o2net_start_connect':
tcp.c:(.text+0x2633): undefined reference to `tcp_sock_set_nodelay'
ld: tcp.c:(.text+0x2643): undefined reference to `tcp_sock_set_user_timeout'
This is due to tcp_sock_set_nodelay() and tcp_sock_set_user_timeout()
being declared in linux/tcp.h and defined in net/ipv4/tcp.c, which
depend on TCP/IP being enabled.
To fix this, make OCFS2_FS depend on INET=y which already requires
NET=y.
Fixes: 12abc5ee78 ("tcp: add tcp_sock_set_nodelay")
Fixes: c488aeadcb ("tcp: add tcp_sock_set_user_timeout")
Signed-off-by: Tom Seewald <tseewald@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Link: http://lkml.kernel.org/r/20200606190827.23954-1-tseewald@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl7iocEQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpj96EACRUW8F6Y9qibPIIYGOAdpW5vf6hdW88oan
hkxOr2+y+9Odyn3WAnQtuMvmIAyOnIpVB1PiGtiXY1mmESWwbFZuxo6m1u4PiqZF
rmvThcrx/o7T1hPzPJt2dUZmR6qBY2rbkGaruD14bcn36DW6fkAicZmsl7UluKTm
pKE2wsxKsjGkcvElYsLYZbVm/xGe+UldaSpNFSp8b+yCAaH6eJLfhjeVC4Db7Yzn
v3Liz012Xed3nmHktgXrihK8vQ1P7zOFaISJlaJ9yRK4z3VAF7wTgvZUjeYGP5FS
GnUW/2p7UOsi5QkX9w2ZwPf/d0aSLZ/Va/5PjZRzAjNORMY5sjPtsfzqdKCohOhq
q8qanyU1pOXRKf1cOEzU40hS81ZDRmoQRTCym6vgwHZrmVtcNnL/Af9soGrWIA8m
+U6S2fpfuxeNP017HSzLHWtCGEOGYvhEc1D70mNBSIB8lElNvNVI6hWZOmxWkbKn
w3O2JIfh9bl9Pk2espwZykJmzehYECP/H8wyhTlF3vBWieFF4uRucBgsmFgQmhvg
NWQ7Iea49zOBt3IV3+LIRS2ulpXe7uu4WJYMa6da5o0a11ayNkngrh5QnBSSJ2rR
HRUKZ9RA99A5edqyxEujDW2QABycNiYdo8ua2gYEFBvRNc9ff1l2CqWAk0n66uxE
4vj4jmVJHg==
=evRQ
-----END PGP SIGNATURE-----
Merge tag 'io_uring-5.8-2020-06-11' of git://git.kernel.dk/linux-block
Pull io_uring fixes from Jens Axboe:
"A few late stragglers in here. In particular:
- Validate full range for provided buffers (Bijan)
- Fix bad use of kfree() in buffer registration failure (Denis)
- Don't allow close of ring itself, it's not fully safe. Making it
fully safe would require making the system call more expensive,
which isn't worth it.
- Buffer selection fix
- Regression fix for O_NONBLOCK retry
- Make IORING_OP_ACCEPT honor O_NONBLOCK (Jiufei)
- Restrict opcode handling for SQ/IOPOLL (Pavel)
- io-wq work handling cleanups and improvements (Pavel, Xiaoguang)
- IOPOLL race fix (Xiaoguang)"
* tag 'io_uring-5.8-2020-06-11' of git://git.kernel.dk/linux-block:
io_uring: fix io_kiocb.flags modification race in IOPOLL mode
io_uring: check file O_NONBLOCK state for accept
io_uring: avoid unnecessary io_wq_work copy for fast poll feature
io_uring: avoid whole io_wq_work copy for requests completed inline
io_uring: allow O_NONBLOCK async retry
io_wq: add per-wq work handler instead of per work
io_uring: don't arm a timeout through work.func
io_uring: remove custom ->func handlers
io_uring: don't derive close state from ->func
io_uring: use kvfree() in io_sqe_buffer_register()
io_uring: validate the full range of provided buffers for access
io_uring: re-set iov base/len for buffer select retry
io_uring: move send/recv IOPOLL check into prep
io_uring: deduplicate io_openat{,2}_prep()
io_uring: do build_open_how() only once
io_uring: fix {SQ,IO}POLL with unsupported opcodes
io_uring: disallow close of ring itself
Fix afs_store_data() so that it sets the mtime in the new operation
descriptor otherwise the mtime on the server gets set to 0 when a write is
stored to the server.
Fixes: e49c7b2f6d ("afs: Build an abstraction around an "operation" concept")
Reported-by: Dave Botsch <botsch@cnf.cornell.edu>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Merge some more updates from Andrew Morton:
- various hotfixes and minor things
- hch's use_mm/unuse_mm clearnups
Subsystems affected by this patch series: mm/hugetlb, scripts, kcov,
lib, nilfs, checkpatch, lib, mm/debug, ocfs2, lib, misc.
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
kernel: set USER_DS in kthread_use_mm
kernel: better document the use_mm/unuse_mm API contract
kernel: move use_mm/unuse_mm to kthread.c
kernel: move use_mm/unuse_mm to kthread.c
stacktrace: cleanup inconsistent variable type
lib: test get_count_order/long in test_bitops.c
mm: add comments on pglist_data zones
ocfs2: fix spelling mistake and grammar
mm/debug_vm_pgtable: fix kernel crash by checking for THP support
lib: fix bitmap_parse() on 64-bit big endian archs
checkpatch: correct check for kernel parameters doc
nilfs2: fix null pointer dereference at nilfs_segctor_do_construct()
lib/lz4/lz4_decompress.c: document deliberate use of `&'
kcov: check kcov_softirq in kcov_remote_stop()
scripts/spelling: add a few more typos
khugepaged: selftests: fix timeout condition in wait_for_scan()
New features and improvements:
- Sunrpc receive buffer sizes only change when establishing a GSS credentials
- Add more sunrpc tracepoints
- Improve on tracepoints to capture internal NFS I/O errors
Other bugfixes and cleanups:
- Move a dprintk() to after a call to nfs_alloc_fattr()
- Fix off-by-one issues in rpc_ntop6
- Fix a few coccicheck warnings
- Use the correct SPDX license identifiers
- Fix rpc_call_done assignment for BIND_CONN_TO_SESSION
- Replace zero-length array with flexible array
- Remove duplicate headers
- Set invalid blocks after NFSv4 writes to update space_used attribute
- Fix direct WRITE throughput regression
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAl7ibyIACgkQ18tUv7Cl
QOsOHBAA1A1stYld0gOhKZtMqxRJi3fnJ5mgroLGtyVQe8uAjpD8Ib1oRleC4MJq
ifpYPozIhMZQCvDiGTAKJ8629OYiXGrN8D5nV6Y2tEGpu5wYv98MyZlU9Y8rVzCP
5vsIMUp5XH8y2wYO8k7fDPPxWNH9Ax89wz5OI16mZxgY/LDm4ojZq+pGbYnWZa4w
oK6Efa66z7yQkPV8oIWuvLe1zZYWGAPibBEwJbrvUWyfygB3owI36sc6nuiEQM+4
hD3h5UtVn8BnudUqvLLa21rnQROMFpgYf4Q/2A1UaNfyRAPoPXMztECBSEYXO0L4
saiMc5o/yTTBCC0ZjV1F+xuGQzMgSQ83KOdbr+a+upvBeFpBynJxccdvMTDEam+q
rl7Ypdc42CsTZ1aVWG/AoIk6GENzR0tXqNR6BcDjYG/yRWvnt/RIZlp6G67IbtRH
b9we+3MbI/lTBoCFGahkkBYO3elTNwilxH3pWcRi8ehNn0GPjlLqHePR17Tmq1tL
QycDlm7QB1m5xNsOOLaBoB4SyguPV0SBprZJ4yYU1B3KC3bGurZVK3+TSLXQrO9V
12RLDt4AOGr0TlctBIhNbkGp8xHY6Dg7HgbdjdrVq8Y9YCfg0C37789BnZA5nVxF
4L101lsTI0puymh+MwmhiyOvCldn30f+MjuWJSm17Id+eRIxYj4=
=a84h
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-5.8-1' of git://git.linux-nfs.org/projects/anna/linux-nfs
Pull NFS client updates from Anna Schumaker:
"New features and improvements:
- Sunrpc receive buffer sizes only change when establishing a GSS credentials
- Add more sunrpc tracepoints
- Improve on tracepoints to capture internal NFS I/O errors
Other bugfixes and cleanups:
- Move a dprintk() to after a call to nfs_alloc_fattr()
- Fix off-by-one issues in rpc_ntop6
- Fix a few coccicheck warnings
- Use the correct SPDX license identifiers
- Fix rpc_call_done assignment for BIND_CONN_TO_SESSION
- Replace zero-length array with flexible array
- Remove duplicate headers
- Set invalid blocks after NFSv4 writes to update space_used attribute
- Fix direct WRITE throughput regression"
* tag 'nfs-for-5.8-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (27 commits)
NFS: Fix direct WRITE throughput regression
SUNRPC: rpc_xprt lifetime events should record xprt->state
xprtrdma: Make xprt_rdma_slot_table_entries static
nfs: set invalid blocks after NFSv4 writes
NFS: remove redundant initialization of variable result
sunrpc: add missing newline when printing parameter 'auth_hashtable_size' by sysfs
NFS: Add a tracepoint in nfs_set_pgio_error()
NFS: Trace short NFS READs
NFS: nfs_xdr_status should record the procedure name
SUNRPC: Set SOFTCONN when destroying GSS contexts
SUNRPC: rpc_call_null_helper() should set RPC_TASK_SOFT
SUNRPC: rpc_call_null_helper() already sets RPC_TASK_NULLCREDS
SUNRPC: trace RPC client lifetime events
SUNRPC: Trace transport lifetime events
SUNRPC: Split the xdr_buf event class
SUNRPC: Add tracepoint to rpc_call_rpcerror()
SUNRPC: Update the RPC_SHOW_SOCKET() macro
SUNRPC: Update the rpc_show_task_flags() macro
SUNRPC: Trace GSS context lifetimes
SUNRPC: receive buffer size estimation values almost never change
...
- Teach XFS to ask the VFS to drop an inode if the administrator changes
the FS_XFLAG_DAX inode flag such that the S_DAX state would change.
This can result in files changing access modes without requiring an
unmount cycle.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAl7WezgACgkQ+H93GTRK
tOvGCg//QCKK9yEBE0sjR0hjeJwmvoxAtdCGyrq+h1wZ1OI+iZCy3K4D9+/NSl/Z
SajLB2bK2mJBBZmERJlRITnTwSFbHX5roMHK4NDTqRAusHdm92JxMWds5ZMXGuTD
HiJ3Luu4VWHBuraRB5JKxHWyDc57xSE0kC1KwPySQCNAuDv+YS42uZMAM6284T/J
hkE4odUztykN+tOZj/nY+FiRjhzLF93cghMmRlDlgmibyHGLMhrfEmaMj99eCA5O
PVepVA6Vk0IHWviAWS8vqX2LADQaP1U2RP4racBAdmk+Z6kqUV+KSotJU1O+6ey/
Zfnr4VytKxmxngXhR4wnQcu3sIZqDdZRF4mcngE/G1yJXKim5iwVX+D04BsvDOpo
OpND9+dOuh4ecbayfOGxp9lmBsPQBzI/qPpUcpbtsRd5xN5IyTL6MyNsSPJg2ZrG
5BaSUO+Hw6hmBcB5MLF36XBuj25QEITl/hxy66Ym2BiQT5im9a3JxRlj8OvoWEVI
1323WehvPSA1bl7m1mNQyE7/h7TjeGA6LiTplSxPqencKzXn93wcTDV09GOZ11No
UEb+yC97hzRepEq4hSTr7tu18RN04ryA28/Vtcm/YeeM2bQ5WpI6jTd3b7g55BE2
EuUGAUFLZRKsvSnLhg/GbuaW442osAKuNttIL1NEyPs0Iw76Ero=
=mKZf
-----END PGP SIGNATURE-----
Merge tag 'vfs-5.8-merge-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull DAX updates part three from Darrick Wong:
"Now that the xfs changes have landed, this third piece changes the
FS_XFLAG_DAX ioctl code in xfs to request that the inode be reloaded
after the last program closes the file, if doing so would make a S_DAX
change happen. The goal here is to make dax access mode switching
quicker when possible.
Summary:
- Teach XFS to ask the VFS to drop an inode if the administrator
changes the FS_XFLAG_DAX inode flag such that the S_DAX state would
change. This can result in files changing access modes without
requiring an unmount cycle"
* tag 'vfs-5.8-merge-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
fs/xfs: Update xfs_ioctl_setattr_dax_invalidate()
fs/xfs: Combine xfs_diflags_to_linux() and xfs_diflags_to_iflags()
fs/xfs: Create function xfs_inode_should_enable_dax()
fs/xfs: Make DAX mount option a tri-state
fs/xfs: Change XFS_MOUNT_DAX to XFS_MOUNT_DAX_ALWAYS
fs/xfs: Remove unnecessary initialization of i_rwsem
I measured a 50% throughput regression for large direct writes.
The observed on-the-wire behavior is that the client sends every
NFS WRITE twice: once as an UNSTABLE WRITE plus a COMMIT, and once
as a FILE_SYNC WRITE.
This is because the nfs_write_match_verf() check in
nfs_direct_commit_complete() fails for every WRITE.
Buffered writes use nfs_write_completion(), which sets req->wb_verf
correctly. Direct writes use nfs_direct_write_completion(), which
does not set req->wb_verf at all. This leaves req->wb_verf set to
all zeroes for every direct WRITE, and thus
nfs_direct_commit_completion() always sets NFS_ODIRECT_RESCHED_WRITES.
This fix appears to restore nearly all of the lost performance.
Fixes: 1f28476dcb ("NFS: Fix O_DIRECT commit verifier handling")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Use the following command to test nfsv4(size of file1M is 1MB):
mount -t nfs -o vers=4.0,actimeo=60 127.0.0.1/dir1 /mnt
cp file1M /mnt
du -h /mnt/file1M -->0 within 60s, then 1M
When write is done(cp file1M /mnt), will call this:
nfs_writeback_done
nfs4_write_done
nfs4_write_done_cb
nfs_writeback_update_inode
nfs_post_op_update_inode_force_wcc_locked(change, ctime, mtime
nfs_post_op_update_inode_force_wcc_locked
nfs_set_cache_invalid
nfs_refresh_inode_locked
nfs_update_inode
nfsd write response contains change, ctime, mtime, the flag will be
clear after nfs_update_inode. Howerver, write response does not contain
space_used, previous open response contains space_used whose value is 0,
so inode->i_blocks is still 0.
nfs_getattr -->called by "du -h"
do_update |= force_sync || nfs_attribute_cache_expired -->false in 60s
cache_validity = READ_ONCE(NFS_I(inode)->cache_validity)
do_update |= cache_validity & (NFS_INO_INVALID_ATTR -->false
if (do_update) {
__nfs_revalidate_inode
}
Within 60s, does not send getattr request to nfsd, thus "du -h /mnt/file1M"
is 0.
Add a NFS_INO_INVALID_BLOCKS flag, set it when nfsv4 write is done.
Fixes: 16e1437517 ("NFS: More fine grained attribute tracking")
Signed-off-by: Zheng Bin <zhengbin13@huawei.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
The variable result is being initialized with a value that is never read
and it is being updated later with a new value. The initialization is
redundant and can be removed.
Addresses-Coverity: ("Unused value")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
A short read can generate an -EIO error without there being an error
on the wire. This tracepoint acts as an eyecatcher when there is no
obvious I/O error.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
When sunrpc trace points are not enabled, the recorded task ID
information alone is not helpful.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
- Keep nfsd clients from unnecessarily breaking their own delegations:
Note this requires a small kthreadd addition, discussed at:
https://lore.kernel.org/r/1588348912-24781-1-git-send-email-bfields@redhat.com
The result is Tejun Heo's suggestion, and he was OK with this going
through my tree.
- Patch nfsd/clients/ to display filenames, and to fix byte-order when
displaying stateid's.
- fix a module loading/unloading bug, from Neil Brown.
- A big series from Chuck Lever with RPC/RDMA and tracing improvements,
and lay some groundwork for RPC-over-TLS.
Note Stephen Rothwell spotted two conflicts in linux-next. Both should
be straightforward:
include/trace/events/sunrpc.h
https://lore.kernel.org/r/20200529105917.50dfc40f@canb.auug.org.au
net/sunrpc/svcsock.c
https://lore.kernel.org/r/20200529131955.26c421db@canb.auug.org.au
-----BEGIN PGP SIGNATURE-----
iQJJBAABCAAzFiEEYtFWavXG9hZotryuJ5vNeUKO4b4FAl7iRYwVHGJmaWVsZHNA
ZmllbGRzZXMub3JnAAoJECebzXlCjuG+yx8QALIfyz/ziPgjGBnNJGCW8BjWHz7+
rGI+1SP2EUpgJ0fGJc9MpGyYTa5T3pTgsENnIRtegyZDISg2OQ5GfifpkTz4U7vg
QbWRihs/W9EhltVYhKvtLASAuSAJ8ETbDfLXVb2ncY7iO6JNvb22xwsgKZILmzm1
uG4qSszmBZzpMUUy51kKJYJZ3ysP+v14qOnyOXEoeEMuJYNK9FkQ9bSPZ6wTJNOn
hvZBMbU7LzRyVIvp358mFHY+vwq5qBNkJfVrZBkURGn4OxWPbWDXzqOi0Zs1oBjA
L+QODIbTLGkopu/rD0r1b872PDtket7p5zsD8MreeI1vJOlt3xwqdCGlicIeNATI
b0RG7sqh+pNv0mvwLxSNTf3rO0EKW6tUySqCnQZUAXFGRH0nYM2TWze4HUr2zfWT
EgRMwxHY/AZUStZBuCIHPJ6inWnKuxSUELMf2a9JHO1BJc/yClRgmwJGdthVwb9u
GP6F3/maFu+9YOO6iROMsqtxDA+q5vch5IBzevNOOBDEQDKqENmogR/knl9DmAhF
sr+FOa3O0u6S4tgXw/TU97JS/h1L2Hu6QVEwU2iVzWtlUUOFVMZQODJTB6Lts4Ka
gKzYXWvCHN+LyETsN6q7uHFg9mtO7xO5vrrIgo72SuVCscDw/8iHkoOOFLief+GE
O0fR0IYjW8U1Rkn2
=YEf0
-----END PGP SIGNATURE-----
Merge tag 'nfsd-5.8' of git://linux-nfs.org/~bfields/linux
Pull nfsd updates from Bruce Fields:
"Highlights:
- Keep nfsd clients from unnecessarily breaking their own
delegations.
Note this requires a small kthreadd addition. The result is Tejun
Heo's suggestion (see link), and he was OK with this going through
my tree.
- Patch nfsd/clients/ to display filenames, and to fix byte-order
when displaying stateid's.
- fix a module loading/unloading bug, from Neil Brown.
- A big series from Chuck Lever with RPC/RDMA and tracing
improvements, and lay some groundwork for RPC-over-TLS"
Link: https://lore.kernel.org/r/1588348912-24781-1-git-send-email-bfields@redhat.com
* tag 'nfsd-5.8' of git://linux-nfs.org/~bfields/linux: (49 commits)
sunrpc: use kmemdup_nul() in gssp_stringify()
nfsd: safer handling of corrupted c_type
nfsd4: make drc_slab global, not per-net
SUNRPC: Remove unreachable error condition in rpcb_getport_async()
nfsd: Fix svc_xprt refcnt leak when setup callback client failed
sunrpc: clean up properly in gss_mech_unregister()
sunrpc: svcauth_gss_register_pseudoflavor must reject duplicate registrations.
sunrpc: check that domain table is empty at module unload.
NFSD: Fix improperly-formatted Doxygen comments
NFSD: Squash an annoying compiler warning
SUNRPC: Clean up request deferral tracepoints
NFSD: Add tracepoints for monitoring NFSD callbacks
NFSD: Add tracepoints to the NFSD state management code
NFSD: Add tracepoints to NFSD's duplicate reply cache
SUNRPC: svc_show_status() macro should have enum definitions
SUNRPC: Restructure svc_udp_recvfrom()
SUNRPC: Refactor svc_recvfrom()
SUNRPC: Clean up svc_release_skb() functions
SUNRPC: Refactor recvfrom path dealing with incomplete TCP receives
SUNRPC: Replace dprintk() call sites in TCP receive path
...
While testing io_uring in arm, we found sometimes io_sq_thread() keeps
polling io requests even though there are not inflight io requests in
block layer. After some investigations, found a possible race about
io_kiocb.flags, see below race codes:
1) in the end of io_write() or io_read()
req->flags &= ~REQ_F_NEED_CLEANUP;
kfree(iovec);
return ret;
2) in io_complete_rw_iopoll()
if (res != -EAGAIN)
req->flags |= REQ_F_IOPOLL_COMPLETED;
In IOPOLL mode, io requests still maybe completed by interrupt, then
above codes are not safe, concurrent modifications to req->flags, which
is not protected by lock or is not atomic modifications. I also had
disassemble io_complete_rw_iopoll() in arm:
req->flags |= REQ_F_IOPOLL_COMPLETED;
0xffff000008387b18 <+76>: ldr w0, [x19,#104]
0xffff000008387b1c <+80>: orr w0, w0, #0x1000
0xffff000008387b20 <+84>: str w0, [x19,#104]
Seems that the "req->flags |= REQ_F_IOPOLL_COMPLETED;" is load and
modification, two instructions, which obviously is not atomic.
To fix this issue, add a new iopoll_completed in io_kiocb to indicate
whether io request is completed.
Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If the dentry name passed to ->d_compare() fits in dentry::d_iname, then
it may be concurrently modified by a rename. This can cause undefined
behavior (possibly out-of-bounds memory accesses or crashes) in
utf8_strncasecmp(), since fs/unicode/ isn't written to handle strings
that may be concurrently modified.
Fix this by first copying the filename to a stack buffer if needed.
This way we get a stable snapshot of the filename.
Fixes: b886ee3e77 ("ext4: Support case-insensitive file name lookups")
Cc: <stable@vger.kernel.org> # v5.2+
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Daniel Rosenberg <drosen@google.com>
Cc: Gabriel Krisman Bertazi <krisman@collabora.co.uk>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Link: https://lore.kernel.org/r/20200601200543.59417-1-ebiggers@kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Now the errcode from ext4_commit_super will overwrite EROFS exists in
ext4_setup_super. Actually, no need to call ext4_commit_super since we
will return EROFS. Fix it by goto done directly.
Fixes: c89128a008 ("ext4: handle errors on ext4_commit_super")
Signed-off-by: yangerkun <yangerkun@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20200601073404.3712492-1-yangerkun@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Fix the bug when calculating the physical block number of the first
block in the split extent.
This bug will cause xfstests shared/298 failure on ext4 with bigalloc
enabled occasionally. Ext4 error messages indicate that previously freed
blocks are being freed again, and the following fsck will fail due to
the inconsistency of block bitmap and bg descriptor.
The following is an example case:
1. First, Initialize a ext4 filesystem with cluster size '16K', block size
'4K', in which case, one cluster contains four blocks.
2. Create one file (e.g., xxx.img) on this ext4 filesystem. Now the extent
tree of this file is like:
...
36864:[0]4:220160
36868:[0]14332:145408
51200:[0]2:231424
...
3. Then execute PUNCH_HOLE fallocate on this file. The hole range is
like:
..
ext4_ext_remove_space: dev 254,16 ino 12 since 49506 end 49506 depth 1
ext4_ext_remove_space: dev 254,16 ino 12 since 49544 end 49546 depth 1
ext4_ext_remove_space: dev 254,16 ino 12 since 49605 end 49607 depth 1
...
4. Then the extent tree of this file after punching is like
...
49507:[0]37:158047
49547:[0]58:158087
...
5. Detailed procedure of punching hole [49544, 49546]
5.1. The block address space:
```
lblk ~49505 49506 49507~49543 49544~49546 49547~
---------+------+-------------+----------------+--------
extent | hole | extent | hole | extent
---------+------+-------------+----------------+--------
pblk ~158045 158046 158047~158083 158084~158086 158087~
```
5.2. The detailed layout of cluster 39521:
```
cluster 39521
<------------------------------->
hole extent
<----------------------><--------
lblk 49544 49545 49546 49547
+-------+-------+-------+-------+
| | | | |
+-------+-------+-------+-------+
pblk 158084 1580845 158086 158087
```
5.3. The ftrace output when punching hole [49544, 49546]:
- ext4_ext_remove_space (start 49544, end 49546)
- ext4_ext_rm_leaf (start 49544, end 49546, last_extent [49507(158047), 40], partial [pclu 39522 lblk 0 state 2])
- ext4_remove_blocks (extent [49507(158047), 40], from 49544 to 49546, partial [pclu 39522 lblk 0 state 2]
- ext4_free_blocks: (block 158084 count 4)
- ext4_mballoc_free (extent 1/6753/1)
5.4. Ext4 error message in dmesg:
EXT4-fs error (device vdb): mb_free_blocks:1457: group 1, block 158084:freeing already freed block (bit 6753); block bitmap corrupt.
EXT4-fs error (device vdb): ext4_mb_generate_buddy:747: group 1, block bitmap and bg descriptor inconsistent: 19550 vs 19551 free clusters
In this case, the whole cluster 39521 is freed mistakenly when freeing
pblock 158084~158086 (i.e., the first three blocks of this cluster),
although pblock 158087 (the last remaining block of this cluster) has
not been freed yet.
The root cause of this isuue is that, the pclu of the partial cluster is
calculated mistakenly in ext4_ext_remove_space(). The correct
partial_cluster.pclu (i.e., the cluster number of the first block in the
next extent, that is, lblock 49597 (pblock 158086)) should be 39521 rather
than 39522.
Fixes: f4226d9ea4 ("ext4: fix partial cluster initialization")
Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Eric Whitney <enwlinux@gmail.com>
Cc: stable@kernel.org # v3.19+
Link: https://lore.kernel.org/r/1590121124-37096-1-git-send-email-jefflexu@linux.alibaba.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Trying to change dax mount options when remounting could allow mount
options to be enabled for a small amount of time, and then the mount
option change would be reverted.
In the case of "mount -o remount,dax", this can cause a race where
files would temporarily treated as DAX --- and then not.
Cc: stable@kernel.org
Reported-by: syzbot+bca9799bf129256190da@syzkaller.appspotmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
This adds the same per-file/per-directory DAX support for ext4 as was
done for xfs, now that we finally have consensus over what the
interface should be.
Some architectures like arm64 and s390 require USER_DS to be set for
kernel threads to access user address space, which is the whole purpose of
kthread_use_mm, but other like x86 don't. That has lead to a huge mess
where some callers are fixed up once they are tested on said
architectures, while others linger around and yet other like io_uring try
to do "clever" optimizations for what usually is just a trivial asignment
to a member in the thread_struct for most architectures.
Make kthread_use_mm set USER_DS, and kthread_unuse_mm restore to the
previous value instead.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Felipe Balbi <balbi@kernel.org>
Cc: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Zhenyu Wang <zhenyuw@linux.intel.com>
Cc: Zhi Wang <zhi.a.wang@intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: http://lkml.kernel.org/r/20200404094101.672954-7-hch@lst.de
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Switch the function documentation to kerneldoc comments, and add
WARN_ON_ONCE asserts that the calling thread is a kernel thread and does
not have ->mm set (or has ->mm set in the case of unuse_mm).
Also give the functions a kthread_ prefix to better document the use case.
[hch@lst.de: fix a comment typo, cover the newly merged use_mm/unuse_mm caller in vfio]
Link: http://lkml.kernel.org/r/20200416053158.586887-3-hch@lst.de
[sfr@canb.auug.org.au: powerpc/vas: fix up for {un}use_mm() rename]
Link: http://lkml.kernel.org/r/20200422163935.5aa93ba5@canb.auug.org.au
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Acked-by: Felix Kuehling <Felix.Kuehling@amd.com>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> [usb]
Acked-by: Haren Myneni <haren@linux.ibm.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Felipe Balbi <balbi@kernel.org>
Cc: Jason Wang <jasowang@redhat.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Zhenyu Wang <zhenyuw@linux.intel.com>
Cc: Zhi Wang <zhi.a.wang@intel.com>
Link: http://lkml.kernel.org/r/20200404094101.672954-6-hch@lst.de
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "improve use_mm / unuse_mm", v2.
This series improves the use_mm / unuse_mm interface by better documenting
the assumptions, and my taking the set_fs manipulations spread over the
callers into the core API.
This patch (of 3):
Use the proper API instead.
Link: http://lkml.kernel.org/r/20200404094101.672954-1-hch@lst.de
These helpers are only for use with kernel threads, and I will tie them
more into the kthread infrastructure going forward. Also move the
prototypes to kthread.h - mmu_context.h was a little weird to start with
as it otherwise contains very low-level MM bits.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Acked-by: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Felipe Balbi <balbi@kernel.org>
Cc: Jason Wang <jasowang@redhat.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Zhenyu Wang <zhenyuw@linux.intel.com>
Cc: Zhi Wang <zhi.a.wang@intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: http://lkml.kernel.org/r/20200404094101.672954-1-hch@lst.de
Link: http://lkml.kernel.org/r/20200416053158.586887-1-hch@lst.de
Link: http://lkml.kernel.org/r/20200404094101.672954-5-hch@lst.de
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
After commit c3aab9a0bd ("mm/filemap.c: don't initiate writeback if
mapping has no dirty pages"), the following null pointer dereference has
been reported on nilfs2:
BUG: kernel NULL pointer dereference, address: 00000000000000a8
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 0 P4D 0
Oops: 0000 [#1] SMP PTI
...
RIP: 0010:percpu_counter_add_batch+0xa/0x60
...
Call Trace:
__test_set_page_writeback+0x2d3/0x330
nilfs_segctor_do_construct+0x10d3/0x2110 [nilfs2]
nilfs_segctor_construct+0x168/0x260 [nilfs2]
nilfs_segctor_thread+0x127/0x3b0 [nilfs2]
kthread+0xf8/0x130
...
This crash turned out to be caused by set_page_writeback() call for
segment summary buffers at nilfs_segctor_prepare_write().
set_page_writeback() can call inc_wb_stat(inode_to_wb(inode),
WB_WRITEBACK) where inode_to_wb(inode) is NULL if the inode of
underlying block device does not have an associated wb.
This fixes the issue by calling inode_attach_wb() in advance to ensure
to associate the bdev inode with its wb.
Fixes: c3aab9a0bd ("mm/filemap.c: don't initiate writeback if mapping has no dirty pages")
Reported-by: Walton Hoops <me@waltonhoops.com>
Reported-by: Tomas Hlavaty <tom@logand.com>
Reported-by: ARAI Shun-ichi <hermes@ceres.dti.ne.jp>
Reported-by: Hideki EIRAKU <hdk1983@gmail.com>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Cc: <stable@vger.kernel.org> [5.4+]
Link: http://lkml.kernel.org/r/20200608.011819.1399059588922299158.konishi.ryusuke@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull epoll update from Al Viro:
"epoll conversion to read_iter from Jens; I thought there might be more
epoll stuff this cycle, but uaccess took too much time"
* 'work.epoll' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
eventfd: convert to f_op->read_iter()
If the socket is O_NONBLOCK, we should complete the accept request
with -EAGAIN when data is not ready.
Signed-off-by: Jiufei Xue <jiufei.xue@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Basically IORING_OP_POLL_ADD command and async armed poll handlers
for regular commands don't touch io_wq_work, so only REQ_F_WORK_INITIALIZED
is set, can we do io_wq_work copy and restore.
Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If requests can be submitted and completed inline, we don't need to
initialize whole io_wq_work in io_init_req(), which is an expensive
operation, add a new 'REQ_F_WORK_INITIALIZED' to determine whether
io_wq_work is initialized and add a helper io_req_init_async(), users
must call io_req_init_async() for the first time touching any members
of io_wq_work.
I use /dev/nullb0 to evaluate performance improvement in my physical
machine:
modprobe null_blk nr_devices=1 completion_nsec=0
sudo taskset -c 60 fio -name=fiotest -filename=/dev/nullb0 -iodepth=128
-thread -rw=read -ioengine=io_uring -direct=1 -bs=4k -size=100G -numjobs=1
-time_based -runtime=120
before this patch:
Run status group 0 (all jobs):
READ: bw=724MiB/s (759MB/s), 724MiB/s-724MiB/s (759MB/s-759MB/s),
io=84.8GiB (91.1GB), run=120001-120001msec
With this patch:
Run status group 0 (all jobs):
READ: bw=761MiB/s (798MB/s), 761MiB/s-761MiB/s (798MB/s-798MB/s),
io=89.2GiB (95.8GB), run=120001-120001msec
About 5% improvement.
Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pull vfs fixes from Al Viro:
"A couple of trivial patches that fell through the cracks last cycle"
* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
fs: fix indentation in deactivate_super()
vfs: Remove duplicated d_mountpoint check in __is_local_mountpoint
Pull sysctl fixes from Al Viro:
"Fixups to regressions in sysctl series"
* 'work.sysctl' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
sysctl: reject gigantic reads/write to sysctl files
cdrom: fix an incorrect __user annotation on cdrom_sysctl_info
trace: fix an incorrect __user annotation on stack_trace_sysctl
random: fix an incorrect __user annotation on proc_do_entropy
net/sysctl: remove leftover __user annotations on neigh_proc_dointvec*
net/sysctl: use cpumask_parse in flow_limit_cpu_sysctl
Pull misc uaccess updates from Al Viro:
"Assorted uaccess patches for this cycle - the stuff that didn't fit
into thematic series"
* 'uaccess.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
bpf: make bpf_check_uarg_tail_zero() use check_zeroed_user()
x86: kvm_hv_set_msr(): use __put_user() instead of 32bit __clear_user()
user_regset_copyout_zero(): use clear_user()
TEST_ACCESS_OK _never_ had been checked anywhere
x86: switch cp_stat64() to unsafe_put_user()
binfmt_flat: don't use __put_user()
binfmt_elf_fdpic: don't use __... uaccess primitives
binfmt_elf: don't bother with __{put,copy_to}_user()
pselect6() and friends: take handling the combined 6th/7th args into helper
Pull proc fix from Eric Biederman:
"Syzbot found a NULL pointer dereference if kzalloc of s_fs_info fails"
* 'proc-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
proc: s_fs_info may be NULL when proc_kill_sb is called
syzbot found that proc_fill_super() fails before filling up sb->s_fs_info,
deactivate_locked_super() will be called and sb->s_fs_info will be NULL.
The proc_kill_sb() does not expect fs_info to be NULL which is wrong.
Link: https://lore.kernel.org/lkml/0000000000002d7ca605a7b8b1c5@google.com
Reported-by: syzbot+4abac52934a48af5ff19@syzkaller.appspotmail.com
Fixes: fa10fed30f ("proc: allow to mount many instances of proc in one pid namespace")
Signed-off-by: Alexey Gladkov <gladkov.alexey@gmail.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Instead of triggering a WARN_ON deep down in the page allocator just
give up early on allocations that are way larger than the usual sysctl
values.
Fixes: 32927393dc ("sysctl: pass kernel pointers to ->proc_handler")
Reported-by: Vegard Nossum <vegard.nossum@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Missing the final 's' in "max_channels" mount option when displayed in
/proc/mounts (or by mount command)
CC: Stable <stable@vger.kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Shyam Prasad N <nspmangalore@gmail.com>
We can assume that O_NONBLOCK is always honored, even if we don't
have a ->read/write_iter() for the file type. Also unify the read/write
checking for allowing async punt, having the write side factoring in the
REQ_F_NOWAIT flag as well.
Cc: stable@vger.kernel.org
Fixes: 490e89676a ("io_uring: only force async punt if poll based retry can't handle it")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCXt/0GAAKCRDh3BK/laaZ
PIJjAP48TurDqomsQMBLiOsSUy0YIhd5QC/G5MYLKSBojXoR+gD+KfqXhVIDz0En
OI+K4674cNhf4CXNzUedU3qSOaJLfAU=
=PqbB
-----END PGP SIGNATURE-----
Merge tag 'fuse-update-5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse
Pull fuse updates from Miklos Szeredi:
- Fix a rare deadlock in virtiofs
- Fix st_blocks in writeback cache mode
- Fix wrong checks in splice move causing spurious warnings
- Fix a race between a GETATTR request and a FUSE_NOTIFY_INVAL_INODE
notification
- Use rb-tree instead of linear search for pages currently under
writeout by userspace
- Fix copy_file_range() inconsistencies
* tag 'fuse-update-5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
fuse: copy_file_range should truncate cache
fuse: fix copy_file_range cache issues
fuse: optimize writepages search
fuse: update attr_version counter on fuse_notify_inval_inode()
fuse: don't check refcount after stealing page
fuse: fix weird page warning
fuse: use dump_page
virtiofs: do not use fuse_fill_super_common() for device installation
fuse: always allow query of st_dev
fuse: always flush dirty data on close(2)
fuse: invalidate inode attr in writeback cache mode
fuse: Update stale comment in queue_interrupt()
fuse: BUG_ON correction in fuse_dev_splice_write()
virtiofs: Add mount option and atime behavior to the doc
virtiofs: schedule blocking async replies in separate worker
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCXt9klAAKCRDh3BK/laaZ
PBeeAP9GRI0yajPzBzz2ZK9KkDc6A7wPiaAec+86Q+c02VncVwEAvq5Pi4um5RTZ
7SVv56ggKO3Cqx779zVyZTRYDs3+YA4=
=bpKI
-----END PGP SIGNATURE-----
Merge tag 'ovl-update-5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs
Pull overlayfs updates from Miklos Szeredi:
"Fixes:
- Resolve mount option conflicts consistently
- Sync before remount R/O
- Fix file handle encoding corner cases
- Fix metacopy related issues
- Fix an unintialized return value
- Add missing permission checks for underlying layers
Optimizations:
- Allow multipe whiteouts to share an inode
- Optimize small writes by inheriting SB_NOSEC from upper layer
- Do not call ->syncfs() multiple times for sync(2)
- Do not cache negative lookups on upper layer
- Make private internal mounts longterm"
* tag 'ovl-update-5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs: (27 commits)
ovl: remove unnecessary lock check
ovl: make oip->index bool
ovl: only pass ->ki_flags to ovl_iocb_to_rwf()
ovl: make private mounts longterm
ovl: get rid of redundant members in struct ovl_fs
ovl: add accessor for ofs->upper_mnt
ovl: initialize error in ovl_copy_xattr
ovl: drop negative dentry in upper layer
ovl: check permission to open real file
ovl: call secutiry hook in ovl_real_ioctl()
ovl: verify permissions in ovl_path_open()
ovl: switch to mounter creds in readdir
ovl: pass correct flags for opening real directory
ovl: fix redirect traversal on metacopy dentries
ovl: initialize OVL_UPPERDATA in ovl_lookup()
ovl: use only uppermetacopy state in ovl_lookup()
ovl: simplify setting of origin for index lookup
ovl: fix out of bounds access warning in ovl_check_fb_len()
ovl: return required buffer size for file handles
ovl: sync dirty data when remounting to ro mode
...
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEqG5UsNXhtOCrfGQP+7dXa6fLC2sFAl7fxCMACgkQ+7dXa6fL
C2tUNw//VdGTqw3SstdpCqlIQptfXjxIJqFBp5QsQ95b2yHEFcmaeqLP9SWOMPZ9
588xBMj4K9iN9WZgJdJwTGj5D1XmRwiISYnDh1pzNUPH2IrnlcTXfpsZ+BSWst/P
XMJ1N3ZtzNymYTi4wzxcV/SjMJd4eX75jBiJn9rgp2PzVnaCUl+a21sLvryON2Eu
YOpg08IlpRYLMsuHiISrqqgy7/iUzo34RXWbhgJV69zG07xoHtBwGf/iwetbh56G
lmKp3xix1Fq181HyQLOn4/QkXTIFR1kyKFDdI3HxdRXsBybSZUk9IGaf74o09c7M
PTS1FsF0lnGz+61Y5mb1UzPRlJSx8CAeCBSN6KWlbm/g9WPXHuxJFwT/KVPV5T13
qEYn1U1RL8yMlbjezMDM5TXSM4IhBRxn9hQ3qbhamiDI0AT8kr55lJKbxqTO8Cvv
6ZjYfPtX0D5Z9kce4E4nWFiRWb8xRsANqWdkVsIcxp1yyZvGkHot5LaCedagF+0b
71Zs0A5YMA4VaaL7sn1xMt/tjZ2ei+5a+cLbfEOoCniE6XqpLm5ZR6WwdbzwwUbV
hzUn3ByQ5eqC5tkwehNufDLFt1HOzMbnTG9eCxbpUXAH1Q8CDwuROUMf4FGjR97K
eeYHcEqLhEKgKRCmGYT00K/q1Ce39rcgGZLygdbbwaigEUCjLvU=
=rH8G
-----END PGP SIGNATURE-----
Merge tag 'afs-fixes-20200609' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
Pull AFS fixes from David Howells:
"A set of small patches to fix some things, most of them minor.
- Fix a memory leak in afs_put_sysnames()
- Fix an oops in AFS file locking
- Fix new use of BUG()
- Fix debugging statements containing %px
- Remove afs_zero_fid as it's unused
- Make afs_zap_data() static"
* tag 'afs-fixes-20200609' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
afs: Make afs_zap_data() static
afs: Remove afs_zero_fid as it's not used
afs: Fix debugging statements with %px to be %p
afs: Fix use of BUG()
afs: Fix file locking
afs: Fix memory leak in afs_put_sysnames()
In this round, we've added some knobs to enhance compression feature and harden
testing environment. In addition, we've fixed several bugs reported from Android
devices such as long discarding latency, device hanging during quota_sync, etc.
Enhancement:
- support lzo-rle algorithm
- add two ioctls to release and reserve blocks for compression
- support partial truncation/fiemap on compressed file
- introduce sysfs entries to attach IO flags explicitly
- add iostat trace point along with read io stat
Bug fix:
- fix long discard latency
- flush quota data by f2fs_quota_sync correctly
- fix to recover parent inode number for power-cut recovery
- fix lz4/zstd output buffer budget
- parse checkpoint mount option correctly
- avoid inifinite loop to wait for flushing node/meta pages
- manage discard space correctly
And some refactoring and clean up patches were added.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE00UqedjCtOrGVvQiQBSofoJIUNIFAl7fojgACgkQQBSofoJI
UNKaZQ//Rd6r7Z25SJkAoy+y/m6QDaKg4Ap1wR6+QmirR7HNxtpr3dXSVvmj4Xhu
ZDJ3LHmerFiwR/X4zFPud+PAoBe3gJa2k7GT8q0g4YkgLy0hfX9PXt0t3I9F8vlk
8m34j+hQaL9/3FBK4/PSG541vR/UUnwvu6t2pJMnz7rgnLej5I6yOIaoaihz7m+i
k0ofK5ckuTNcZReAZ2tCIehQku7tDOBLdS5KxvBZBgRh0i5iSXXIa4ddvaMJdT/M
WcjTZ6N8bFu0hCZ5hz9dyGGYo1XchQosLdLGhcEugsyxNp9Yuftyf5/Ie1wJNiEl
ZsoRc15X7wfRPKKMMyDFljzPBPFiHr78p30uJ34bcYCu0j0CYi+gbKQztmEMZ2dy
9M+sDG3jd5R7ACXrwS2ElSEDyLBnTaxbeSdCpErGjn/U19TLllbzhnMA9KR9elDI
pEWgRc7DPmPbRZaStXMxIamf7pbmUSm0akAYbzGFvMHcSx4MXuQFICGK9t/mhSDm
sO2b1Ir39yk65sVNdjFsnqDsi6jTPgrLSe3FY4eMhkn15OSiVGhcz7ddQMD7Fbuq
WLpHFqER650I28i0EXh8bxzjkrj+aJQKhGcVbmwVS33MtKVfBdh4GfQMvS6MbeOM
MsZ10E7Dr9ildKxqHP5SgLlggkl512lpj3+d6j0mUSSSUP2jtUw=
=MiEC
-----END PGP SIGNATURE-----
Merge tag 'f2fs-for-5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs
Pull f2fs updates from Jaegeuk Kim:
"In this round, we've added some knobs to enhance compression feature
and harden testing environment. In addition, we've fixed several bugs
reported from Android devices such as long discarding latency, device
hanging during quota_sync, etc.
Enhancements:
- support lzo-rle algorithm
- add two ioctls to release and reserve blocks for compression
- support partial truncation/fiemap on compressed file
- introduce sysfs entries to attach IO flags explicitly
- add iostat trace point along with read io stat
Bug fixes:
- fix long discard latency
- flush quota data by f2fs_quota_sync correctly
- fix to recover parent inode number for power-cut recovery
- fix lz4/zstd output buffer budget
- parse checkpoint mount option correctly
- avoid inifinite loop to wait for flushing node/meta pages
- manage discard space correctly
And some refactoring and clean up patches were added"
* tag 'f2fs-for-5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (51 commits)
f2fs: attach IO flags to the missing cases
f2fs: add node_io_flag for bio flags likewise data_io_flag
f2fs: remove unused parameter of f2fs_put_rpages_mapping()
f2fs: handle readonly filesystem in f2fs_ioc_shutdown()
f2fs: avoid utf8_strncasecmp() with unstable name
f2fs: don't return vmalloc() memory from f2fs_kmalloc()
f2fs: fix retry logic in f2fs_write_cache_pages()
f2fs: fix wrong discard space
f2fs: compress: don't compress any datas after cp stop
f2fs: remove unneeded return value of __insert_discard_tree()
f2fs: fix wrong value of tracepoint parameter
f2fs: protect new segment allocation in expand_inode_data
f2fs: code cleanup by removing ifdef macro surrounding
f2fs: avoid inifinite loop to wait for flushing node pages at cp_error
f2fs: flush dirty meta pages when flushing them
f2fs: fix checkpoint=disable:%u%%
f2fs: compress: fix zstd data corruption
f2fs: add compressed/gc data read IO stat
f2fs: fix potential use-after-free issue
f2fs: compress: don't handle non-compressed data in workqueue
...
This reverts commit b75b7ca7c2.
The patch restores a helper that was not necessary after direct IO port
to iomap infrastructure, which gets reverted.
Signed-off-by: David Sterba <dsterba@suse.com>
This reverts commit 5f008163a5.
The patch is a simplification after direct IO port to iomap
infrastructure, which gets reverted.
Signed-off-by: David Sterba <dsterba@suse.com>
This reverts commit d8f3e73587.
The patch is a cleanup of direct IO port to iomap infrastructure,
which gets reverted.
Signed-off-by: David Sterba <dsterba@suse.com>
Fix a couple of %px to be %p in debugging statements.
Fixes: e49c7b2f6d ("afs: Build an abstraction around an "operation" concept")
Fixes: 8a070a9648 ("afs: Detect cell aliases 1 - Cells with root volumes")
Reported-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
This Kunit update for Linux 5.8-rc1 consists of:
- Several config fragment fixes from Anders Roxell to improve
test coverage.
- Improvements to kunit run script to use defconfig as default and
restructure the code for config/build/exec/parse from Vitor Massaru Iha
and David Gow.
- Miscellaneous documentation warn fix.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEPZKym/RZuOCGeA/kCwJExA0NQxwFAl7etrcACgkQCwJExA0N
QxzGYg/+KHpPhB31IAjNFKCRqwDooftst3dohhzguxJLpDHdEmVJ4moQhLr4gL+/
qpi3T9hr4Rx++n/A5NoxDvyJvGr+FAL40U+Of7F2UyHpqQmfKPj37I+yvyeR1JEL
z4+yXEpfQLZaQkmZ7f3GWHyqN3+xwvyTEy7NYUad7xMxLF/99No+I6RMD6yp3srS
wUUeuBIesSFT0LXYrgI+wgsNGUESlj/McjiP5eMj6UtlMgKpzmfzH56Fia8uw1pw
6QtpntxDHjtxVfp8YKM4qExI54YI2t6sgHTIoOUsMWD5Q2kHd8kNf1L+lb1sKYUF
j7lzol5nuqqchAVQYjHzNHa8XKndvexGyWMsPz1gAnkpgVrvBTSFcavdDpDuDZ0T
HoJZnk9XPsguBQjDxapzPYfAQ81Un/rEmZQ8/X2TaNjdSIH1hHljhaP2OZ6eND/Q
iobq9x8nC9D95TIqjDbRw3Sp2na/pZLN8Gp27hmKlc+L1XzV8NuZe/WGOUe3lsrq
fG1ZSLo/iRau8gHuF6fRSrGIzQSCEMGKl3jlQ28OT9HGMAgTlncEwVzQId48/AsS
UOY+bIAnRZuK+B5F/vw6L3o1e3c17z5bruVlb0M0alP5b7P9/3WLNHsHA3r8haZF
F6PwIWu41wdRjJf2HI7zD5LaQe/7oU3jfwvuA7n2z8Py+zGx7m4=
=S+HY
-----END PGP SIGNATURE-----
Merge tag 'linux-kselftest-kunit-5.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest
Pull Kunit updates from Shuah Khan:
"This consists of:
- Several config fragment fixes from Anders Roxell to improve test
coverage.
- Improvements to kunit run script to use defconfig as default and
restructure the code for config/build/exec/parse from Vitor Massaru
Iha and David Gow.
- Miscellaneous documentation warn fix"
* tag 'linux-kselftest-kunit-5.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest:
security: apparmor: default KUNIT_* fragments to KUNIT_ALL_TESTS
fs: ext4: default KUNIT_* fragments to KUNIT_ALL_TESTS
drivers: base: default KUNIT_* fragments to KUNIT_ALL_TESTS
lib: Kconfig.debug: default KUNIT_* fragments to KUNIT_ALL_TESTS
kunit: default KUNIT_* fragments to KUNIT_ALL_TESTS
kunit: Kconfig: enable a KUNIT_ALL_TESTS fragment
kunit: Fix TabError, remove defconfig code and handle when there is no kunitconfig
kunit: use KUnit defconfig by default
kunit: use --build_dir=.kunit as default
Documentation: test.h - fix warnings
kunit: kunit_tool: Separate out config/build/exec/parse
Add new APIs to assert that mmap_sem is held.
Using this instead of rwsem_is_locked and lockdep_assert_held[_write]
makes the assertions more tolerant of future changes to the lock type.
Signed-off-by: Michel Lespinasse <walken@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Davidlohr Bueso <dbueso@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Laurent Dufour <ldufour@linux.ibm.com>
Cc: Liam Howlett <Liam.Howlett@oracle.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ying Han <yinghan@google.com>
Link: http://lkml.kernel.org/r/20200520052908.204642-10-walken@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Convert the last few remaining mmap_sem rwsem calls to use the new mmap
locking API. These were missed by coccinelle for some reason (I think
coccinelle does not support some of the preprocessor constructs in these
files ?)
[akpm@linux-foundation.org: convert linux-next leftovers]
[akpm@linux-foundation.org: more linux-next leftovers]
[akpm@linux-foundation.org: more linux-next leftovers]
Signed-off-by: Michel Lespinasse <walken@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Laurent Dufour <ldufour@linux.ibm.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Davidlohr Bueso <dbueso@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Liam Howlett <Liam.Howlett@oracle.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ying Han <yinghan@google.com>
Link: http://lkml.kernel.org/r/20200520052908.204642-6-walken@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "mm: consolidate definitions of page table accessors", v2.
The low level page table accessors (pXY_index(), pXY_offset()) are
duplicated across all architectures and sometimes more than once. For
instance, we have 31 definition of pgd_offset() for 25 supported
architectures.
Most of these definitions are actually identical and typically it boils
down to, e.g.
static inline unsigned long pmd_index(unsigned long address)
{
return (address >> PMD_SHIFT) & (PTRS_PER_PMD - 1);
}
static inline pmd_t *pmd_offset(pud_t *pud, unsigned long address)
{
return (pmd_t *)pud_page_vaddr(*pud) + pmd_index(address);
}
These definitions can be shared among 90% of the arches provided
XYZ_SHIFT, PTRS_PER_XYZ and xyz_page_vaddr() are defined.
For architectures that really need a custom version there is always
possibility to override the generic version with the usual ifdefs magic.
These patches introduce include/linux/pgtable.h that replaces
include/asm-generic/pgtable.h and add the definitions of the page table
accessors to the new header.
This patch (of 12):
The linux/mm.h header includes <asm/pgtable.h> to allow inlining of the
functions involving page table manipulations, e.g. pte_alloc() and
pmd_alloc(). So, there is no point to explicitly include <asm/pgtable.h>
in the files that include <linux/mm.h>.
The include statements in such cases are remove with a simple loop:
for f in $(git grep -l "include <linux/mm.h>") ; do
sed -i -e '/include <asm\/pgtable.h>/ d' $f
done
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Cain <bcain@codeaurora.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Greg Ungerer <gerg@linux-m68k.org>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: Guo Ren <guoren@kernel.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Ley Foon Tan <ley.foon.tan@intel.com>
Cc: Mark Salter <msalter@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Nick Hu <nickhu@andestech.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vincent Chen <deanbo422@gmail.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Link: http://lkml.kernel.org/r/20200514170327.31389-1-rppt@kernel.org
Link: http://lkml.kernel.org/r/20200514170327.31389-2-rppt@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix afs_compare_addrs() to use WARN_ON(1) instead of BUG() and return 1
(ie. srx_a > srx_b).
There's no point trying to put actual error handling in as this should not
occur unless a new transport address type is allowed by AFS. And even if
it does, in this particular case, it'll just never match unknown types of
addresses. This BUG() was more of a 'you need to add a case here'
indicator.
Reported-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Fix AFS file locking to use the correct vnode pointer and remove a member
of the afs_operation struct that is never set, but it is read and followed,
causing an oops.
This can be triggered by:
flock -s /afs/example.com/foo sleep 1
when it calls the kernel to get a file lock.
Fixes: e49c7b2f6d ("afs: Build an abstraction around an "operation" concept")
Reported-by: Dave Botsch <botsch@cnf.cornell.edu>
Signed-off-by: David Howells <dhowells@redhat.com>
Tested-by: Dave Botsch <botsch@cnf.cornell.edu>
Fix afs_put_sysnames() to actually free the specified afs_sysnames
object after its reference count has been decreased to zero and
its contents have been released.
Fixes: 6f8880d8e6 ("afs: Implement @sys substitution handling")
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: David Howells <dhowells@redhat.com>
This code calls brelse(bh) and then dereferences "bh" on the next line
resulting in a possible use after free. The brelse() should just be
moved down a line.
Fixes: b676fdbcf4c8 ("exfat: standardize checksum calculation")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
There is check error in range condition that can never be entered
even with invalid input.
Replace incorrent checking code with already existing valid checker.
Signed-off-by: hyeongseok.kim <hyeongseok@gmail.com>
Acked-by: Sungjong Seo <sj1557.seo@samsung.com>
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
At truncate, there is a problem of incorrect updating in the file entry
pointer instead of stream entry. This will cause the problem of
overwriting the time field of the file entry to new_size. Fix it to
update stream entry.
Fixes: 98d917047e ("exfat: add file operations")
Cc: stable@vger.kernel.org # v5.7
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
kbuild test robot reported :
fs/exfat/nls.c:531:22: warning: Variable 'p_uniname->name_len'
is reassigned a value before the old one has been used.
The reassignment of p_uniname->name_len is not needed and remove it.
Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
To clarify that it is a 16-bit checksum, the parts related to the 16-bit
checksum are renamed and change type to u16.
Furthermore, replace checksum calculation in exfat_load_upcase_table()
with exfat_calc_checksum32().
Signed-off-by: Tetsuhiro Kohada <kohada.t2@gmail.com>
Reviewed-by: Sungjong Seo <sj1557.seo@samsung.com>
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Add Boot-Regions verification specified in exFAT specification.
Note that the checksum type is strongly related to the raw structure,
so the'u32 'type is used to clarify the number of bits.
Signed-off-by: Tetsuhiro Kohada <kohada.t2@gmail.com>
Reviewed-by: Sungjong Seo <sj1557.seo@samsung.com>
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Separate the boot sector analysis to read_boot_sector().
And add a check for the fs_name field.
Furthermore, add a strict consistency check, because overlapping areas
can cause serious corruption.
Signed-off-by: Tetsuhiro Kohada <kohada.t2@gmail.com>
Reviewed-by: Sungjong Seo <sj1557.seo@samsung.com>
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Aggregate PBR related definitions and redefine as "boot_sector" to comply
with the exFAT specification.
And, rename variable names including 'pbr'.
Signed-off-by: Tetsuhiro Kohada <kohada.t2@gmail.com>
Reviewed-by: Sungjong Seo <sj1557.seo@samsung.com>
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Optimize directory access based on exfat_entry_set_cache.
- Hold bh instead of copied d-entry.
- Modify bh->data directly instead of the copied d-entry.
- Write back the retained bh instead of rescanning the d-entry-set.
And
- Remove unused cache related definitions.
Signed-off-by: Tetsuhiro Kohada <kohada.tetsuhiro@dc.mitsubishielectric.co.jp>
Reviewed-by: Sungjong Seo <sj1557.seo@samsung.com>
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Replace time_ms with time_cs in the file directory entry structure
and related functions.
The unit of create_time_ms/modify_time_ms in File Directory Entry are not
'milli-second', but 'centi-second'.
The exfat specification uses the term '10ms', but instead use 'cs' as in
msdos_fs.h.
Signed-off-by: Tetsuhiro Kohada <kohada.t2@gmail.com>
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
There is no need to init 'sync' in exfat_set_vol_flags().
This also fixes the following coccicheck warning:
fs/exfat/super.c:104:6-10: WARNING: Assignment of 0/1 to bool variable
Signed-off-by: Jason Yan <yanaijie@huawei.com>
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
After applying previous two patches, these functions are not used anymore.
Signed-off-by: Pali Rohár <pali@kernel.org>
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Function partial_name_hash() takes long type value into which can be stored
one Unicode code point. Therefore conversion from UTF-32 to UTF-16 is not
needed.
Signed-off-by: Pali Rohár <pali@kernel.org>
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Remove the direct use of KERN_<LEVEL> in functions by creating
separate exfat_<level> macros.
Miscellanea:
o Remove several unnecessary terminating newlines in formats
o Realign arguments and fit to 80 columns where appropriate
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
If two Unicode code points represented in UTF-16 are different then also
their UTF-32 representation must be different. Therefore conversion from
UTF-32 to UTF-16 is not needed.
Signed-off-by: Pali Rohár <pali@kernel.org>
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
This code is more organized and robust.
Signed-off-by: Kenneth D'souza <kdsouza@redhat.com>
Signed-off-by: Roberto Bergantinos Corpas <rbergant@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Widen the type used for counting the number of bytes unshared.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
xfs_ifree_cluster() calls xfs_perag_get() at the beginning, but forgets to
call xfs_perag_put() in one failed path.
Add the missed function call to fix it.
Fixes: ce92464c18 ("xfs: make xfs_trans_get_buf return an error code")
Signed-off-by: Chuhong Yuan <hslester96@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
This patch adds another way to attach bio flags to node writes.
Description: Give a way to attach REQ_META|FUA to node writes
given temperature-based bits. Now the bits indicate:
* REQ_META | REQ_FUA |
* 5 | 4 | 3 | 2 | 1 | 0 |
* Cold | Warm | Hot | Cold | Warm | Hot |
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If mountpoint is readonly, we should allow shutdowning filesystem
successfully, this fixes issue found by generic/599 testcase of
xfstest.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If the dentry name passed to ->d_compare() fits in dentry::d_iname, then
it may be concurrently modified by a rename. This can cause undefined
behavior (possibly out-of-bounds memory accesses or crashes) in
utf8_strncasecmp(), since fs/unicode/ isn't written to handle strings
that may be concurrently modified.
Fix this by first copying the filename to a stack buffer if needed.
This way we get a stable snapshot of the filename.
Fixes: 2c2eb7a300 ("f2fs: Support case-insensitive file name lookups")
Cc: <stable@vger.kernel.org> # v5.4+
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Daniel Rosenberg <drosen@google.com>
Cc: Gabriel Krisman Bertazi <krisman@collabora.co.uk>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
kmalloc() returns kmalloc'ed memory, and kvmalloc() returns either
kmalloc'ed or vmalloc'ed memory. But the f2fs wrappers, f2fs_kmalloc()
and f2fs_kvmalloc(), both return both kinds of memory.
It's redundant to have two functions that do the same thing, and also
breaking the standard naming convention is causing bugs since people
assume it's safe to kfree() memory allocated by f2fs_kmalloc(). See
e.g. the various allocations in fs/f2fs/compress.c.
Fix this by making f2fs_kmalloc() just use kmalloc(). And to avoid
re-introducing the allocation failures that the vmalloc fallback was
intended to fix, convert the largest allocations to use f2fs_kvmalloc().
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
- OSD/MDS latency and caps cache metrics infrastructure for the
filesytem (Xiubo Li). Currently available through debugfs and
will be periodically sent to the MDS in the future.
- support for replica reads (balanced and localized reads) for
rbd and the filesystem (myself). The default remains to always
read from primary, users can opt-in with the new crush_location
and read_from_replica options. Note that reading from replica
is safe for general use only since Octopus.
- support for RADOS allocation hint flags (myself). Currently
used by rbd to propagate the compressible/incompressible hint
given with the new compression_hint map option and ready for
passing on more advanced hints, e.g. based on fadvise() from
the filesystem.
- support for efficient cross-quota-realm renames (Luis Henriques)
- assorted cap handling improvements and cleanups, particularly
untangling some of the locking (Jeff Layton)
-----BEGIN PGP SIGNATURE-----
iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAl7eZP0THGlkcnlvbW92
QGdtYWlsLmNvbQAKCRBKf944AhHziwJDB/98bH+dsJidUkRctVerX933DvgmRGva
sIxR0otqCK2zlucKSy8R8awbhVQ2lz4DQm9vrlwFQHBjZqXnrMzDG4rd/PukmKap
l8DjHRgEsH698zjwDlyyz7/1ZqOOUcCKr5fly3Erqr92yWGoy2ve76LtTKgB5jnv
wdwMk5v/NBWoxZ3Q1cvexbCtc60l0FCSH4FnH7NtT8eR9zCmL9vlpZWdjKi+U5em
6tTONuSq+0F4a9eXEv6QHEjRjkRo1WlttGdK3bX7mXD4O22TslgKg9hYsVoQVTiW
Cc9n6Pggv2tbUnPgn/x342W26QyMgcoHCzrYPR7w0JrU61TzBewxqfpg
=4fqQ
-----END PGP SIGNATURE-----
Merge tag 'ceph-for-5.8-rc1' of git://github.com/ceph/ceph-client
Pull ceph updates from Ilya Dryomov:
"The highlights are:
- OSD/MDS latency and caps cache metrics infrastructure for the
filesytem (Xiubo Li). Currently available through debugfs and will
be periodically sent to the MDS in the future.
- support for replica reads (balanced and localized reads) for rbd
and the filesystem (myself). The default remains to always read
from primary, users can opt-in with the new crush_location and
read_from_replica options. Note that reading from replica is safe
for general use only since Octopus.
- support for RADOS allocation hint flags (myself). Currently used by
rbd to propagate the compressible/incompressible hint given with
the new compression_hint map option and ready for passing on more
advanced hints, e.g. based on fadvise() from the filesystem.
- support for efficient cross-quota-realm renames (Luis Henriques)
- assorted cap handling improvements and cleanups, particularly
untangling some of the locking (Jeff Layton)"
* tag 'ceph-for-5.8-rc1' of git://github.com/ceph/ceph-client: (29 commits)
rbd: compression_hint option
libceph: support for alloc hint flags
libceph: read_from_replica option
libceph: support for balanced and localized reads
libceph: crush_location infrastructure
libceph: decode CRUSH device/bucket types and names
libceph: add non-asserting rbtree insertion helper
ceph: skip checking caps when session reconnecting and releasing reqs
ceph: make sure mdsc->mutex is nested in s->s_mutex to fix dead lock
ceph: don't return -ESTALE if there's still an open file
libceph, rbd: replace zero-length array with flexible-array
ceph: allow rename operation under different quota realms
ceph: normalize 'delta' parameter usage in check_quota_exceeded
ceph: ceph_kick_flushing_caps needs the s_mutex
ceph: request expedited service on session's last cap flush
ceph: convert mdsc->cap_dirty to a per-session list
ceph: reset i_requested_max_size if file write is not wanted
ceph: throw a warning if we destroy session with mutex still locked
ceph: fix potential race in ceph_check_caps
ceph: document what protects i_dirty_item and i_flushing_item
...
io_uring is the only user of io-wq, and now it uses only io-wq callback
for all its requests, namely io_wq_submit_work(). Instead of storing
work->runner callback in each instance of io_wq_work, keep it in io-wq
itself.
pros:
- reduces io_wq_work size
- more robust -- ->func won't be invalidated with mem{cpy,set}(req)
- helps other work
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Remove io_link_work_cb() -- the last custom work.func.
Not the prettiest thing, but works. Instead of queueing a linked timeout
in io_link_work_cb() mark a request with REQ_F_QUEUE_TIMEOUT and do
enqueueing based on the flag in io_wq_submit_work().
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In preparation of getting rid of work.func, this removes almost all
custom instances of it, leaving only io_wq_submit_work() and
io_link_work_cb(). And the last one will be dealt later.
Nothing fancy, just routinely remove *_finish() function and inline
what's left. E.g. remove io_fsync_finish() + inline __io_fsync() into
io_fsync().
As no users of io_req_cancelled() are left, delete it as well. The patch
adds extra switch lookup on cold-ish path, but that's overweighted by
nice diffstat and other benefits of the following patches.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Relying on having a specific work.func is dangerous, even if an opcode
handler set it itself. E.g. io_wq_assign_next() can modify it.
io_close() sets a custom work.func to indicate that
__close_fd_get_file() was already called. Fortunately, there is no bugs
with io_wq_assign_next() and close yet.
Still, do it safe and always be prepared to be called through
io_wq_submit_work(). Zero req->close.put_file in prep, and call
__close_fd_get_file() IFF it's NULL.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
- An iopen glock locking scheme rework that speeds up deletes of
inodes accessed from multiple nodes.
- Various bug fixes and debugging improvements.
- Convert gfs2-glocks.txt to ReST.
-----BEGIN PGP SIGNATURE-----
iQJIBAABCAAyFiEEJZs3krPW0xkhLMTc1b+f6wMTZToFAl7eYjMUHGFncnVlbmJh
QHJlZGhhdC5jb20ACgkQ1b+f6wMTZTr/9g//cJ6jgiD/+qzh0VzougVksVZIduAl
RMB+kldOjBS2ORbyYM87Jm1tdyakgZuFO91XlSwChWRdC3Y2mqdaJIEE/kATqfY9
7Frlw++SyFKLvIf04kDYGk2hXX+umXXYFfrIiKb0tzDSGkPRaARUb3RM4TRvlSDP
/U0JlYA/4aXMUge+VpYsbpSGeqHNfEzmcmCyPXGmZYyh1MZ/RocMZFYEsP9NH82J
l07fxowUd10LJPEmBajzjD2NmEvjdvF4gBCOfJVNIfOzCj0CwXL3vmtu1SUNOKr+
em266EWZF89eOcvfdtE6xF0w81oCAK43wYRjIODSgI9JCLXGiOYmlWZVwZoqu5iy
2GQDhj/taq3ObuVqjR5n6GYuqMoJ+D0LSD13qccDALq/Bdy4lq9TMLSdDbkrVIM/
8BVn0nI5MzUlTV3mq6uxhU0HqtYDwUEiHWURWw6bYRug5OvQy3nbg/XZptYlnD87
XQccE4ErjlgSHLiYx1YckFz/GG6ytrRAKl9bGMkZ0u2+XmDsH+iJJgLcaXCPUP9h
/hhYagKI55UBDer7we4tppbu+gnJrg3PXlgImf53iMc7ia0KQHd+TfSIFkGPuydI
aEKKhIQzje23JayMbPRnwPbNlw9zU1loPi7hPS3rCpDY+w8oawpFyIieEOcxJyEt
pYkOt4BQi9LvpGc=
=PsCY
-----END PGP SIGNATURE-----
Merge tag 'gfs2-for-5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2
Pull gfs2 updates from Andreas Gruenbacher:
- An iopen glock locking scheme rework that speeds up deletes of inodes
accessed from multiple nodes
- Various bug fixes and debugging improvements
- Convert gfs2-glocks.txt to ReST
* tag 'gfs2-for-5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
gfs2: fix use-after-free on transaction ail lists
gfs2: new slab for transactions
gfs2: initialize transaction tr_ailX_lists earlier
gfs2: Smarter iopen glock waiting
gfs2: Wake up when setting GLF_DEMOTE
gfs2: Check inode generation number in delete_work_func
gfs2: Move inode generation number check into gfs2_inode_lookup
gfs2: Minor gfs2_lookup_by_inum cleanup
gfs2: Try harder to delete inodes locally
gfs2: Give up the iopen glock on contention
gfs2: Turn gl_delete into a delayed work
gfs2: Keep track of deleted inode generations in LVBs
gfs2: Allow ASPACE glocks to also have an lvb
gfs2: instrumentation wrt log_flush stuck
gfs2: introduce new gfs2_glock_assert_withdraw
gfs2: print mapping->nrpages in glock dump for address space glocks
gfs2: Only do glock put in gfs2_create_inode for free inodes
gfs2: Allow lock_nolock mount to specify jid=X
gfs2: Don't ignore inode write errors during inode_go_sync
docs: filesystems: convert gfs2-glocks.txt to ReST
Merge still more updates from Andrew Morton:
"Various trees. Mainly those parts of MM whose linux-next dependents
are now merged. I'm still sitting on ~160 patches which await merges
from -next.
Subsystems affected by this patch series: mm/proc, ipc, dynamic-debug,
panic, lib, sysctl, mm/gup, mm/pagemap"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (52 commits)
doc: cgroup: update note about conditions when oom killer is invoked
module: move the set_fs hack for flush_icache_range to m68k
nommu: use flush_icache_user_range in brk and mmap
binfmt_flat: use flush_icache_user_range
exec: use flush_icache_user_range in read_code
exec: only build read_code when needed
m68k: implement flush_icache_user_range
arm: rename flush_cache_user_range to flush_icache_user_range
xtensa: implement flush_icache_user_range
sh: implement flush_icache_user_range
asm-generic: add a flush_icache_user_range stub
mm: rename flush_icache_user_range to flush_icache_user_page
arm,sparc,unicore32: remove flush_icache_user_range
riscv: use asm-generic/cacheflush.h
powerpc: use asm-generic/cacheflush.h
openrisc: use asm-generic/cacheflush.h
m68knommu: use asm-generic/cacheflush.h
microblaze: use asm-generic/cacheflush.h
ia64: use asm-generic/cacheflush.h
hexagon: use asm-generic/cacheflush.h
...
load_flat_file works on user addresses.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Greg Ungerer <gerg@linux-m68k.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Link: http://lkml.kernel.org/r/20200515143646.3857579-28-hch@lst.de
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
read_code operates on user addresses.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Link: http://lkml.kernel.org/r/20200515143646.3857579-27-hch@lst.de
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Only build read_code when binary formats that use it are built into the
kernel.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Link: http://lkml.kernel.org/r/20200515143646.3857579-26-hch@lst.de
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
After a recent change introduced by Vlastimil's series [0], kernel is
able now to handle sysctl parameters on kernel command line; also, the
series introduced a simple infrastructure to convert legacy boot
parameters (that duplicate sysctls) into sysctl aliases.
This patch converts the watchdog parameters softlockup_panic and
{hard,soft}lockup_all_cpu_backtrace to use the new alias infrastructure.
It fixes the documentation too, since the alias only accepts values 0 or
1, not the full range of integers.
We also took the opportunity here to improve the documentation of the
previously converted hung_task_panic (see the patch series [0]) and put
the alias table in alphabetical order.
[0] http://lkml.kernel.org/r/20200427180433.7029-1-vbabka@suse.cz
Signed-off-by: Guilherme G. Piccoli <gpiccoli@canonical.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Kees Cook <keescook@chromium.org>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Link: http://lkml.kernel.org/r/20200507214624.21911-1-gpiccoli@canonical.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We can now handle sysctl parameters on kernel command line and have
infrastructure to convert legacy command line options that duplicate
sysctl to become a sysctl alias.
This patch converts the hung_task_panic parameter. Note that the sysctl
handler is more strict and allows only 0 and 1, while the legacy
parameter allowed any non-zero value. But there is little reason anyone
would not be using 1.
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: "Eric W . Biederman" <ebiederm@xmission.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Guilherme G . Piccoli" <gpiccoli@canonical.com>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: Ivan Teterevkov <ivan.teterevkov@nutanix.com>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20200427180433.7029-4-vbabka@suse.cz
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We can now handle sysctl parameters on kernel command line, but
historically some parameters introduced their own command line
equivalent, which we don't want to remove for compatibility reasons.
We can, however, convert them to the generic infrastructure with a table
translating the legacy command line parameters to their sysctl names,
and removing the one-off param handlers.
This patch adds the support and makes the first conversion to
demonstrate it, on the (deprecated) numa_zonelist_order parameter.
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: "Eric W . Biederman" <ebiederm@xmission.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Guilherme G . Piccoli" <gpiccoli@canonical.com>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: Ivan Teterevkov <ivan.teterevkov@nutanix.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20200427180433.7029-3-vbabka@suse.cz
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "support setting sysctl parameters from kernel command line", v3.
This series adds support for something that seems like many people
always wanted but nobody added it yet, so here's the ability to set
sysctl parameters via kernel command line options in the form of
sysctl.vm.something=1
The important part is Patch 1. The second, not so important part is an
attempt to clean up legacy one-off parameters that do the same thing as
a sysctl. I don't want to remove them completely for compatibility
reasons, but with generic sysctl support the idea is to remove the
one-off param handlers and treat the parameters as aliases for the
sysctl variants.
I have identified several parameters that mention sysctl counterparts in
Documentation/admin-guide/kernel-parameters.txt but there might be more.
The conversion also has varying level of success:
- numa_zonelist_order is converted in Patch 2 together with adding the
necessary infrastructure. It's easy as it doesn't really do anything
but warn on deprecated value these days.
- hung_task_panic is converted in Patch 3, but there's a downside that
now it only accepts 0 and 1, while previously it was any integer
value
- nmi_watchdog maps to two sysctls nmi_watchdog and hardlockup_panic,
so there's no straighforward conversion possible
- traceoff_on_warning is a flag without value and it would be required
to handle that somehow in the conversion infractructure, which seems
pointless for a single flag
This patch (of 5):
A recently proposed patch to add vm_swappiness command line parameter in
addition to existing sysctl [1] made me wonder why we don't have a
general support for passing sysctl parameters via command line.
Googling found only somebody else wondering the same [2], but I haven't
found any prior discussion with reasons why not to do this.
Settings the vm_swappiness issue aside (the underlying issue might be
solved in a different way), quick search of kernel-parameters.txt shows
there are already some that exist as both sysctl and kernel parameter -
hung_task_panic, nmi_watchdog, numa_zonelist_order, traceoff_on_warning.
A general mechanism would remove the need to add more of those one-offs
and might be handy in situations where configuration by e.g.
/etc/sysctl.d/ is impractical.
Hence, this patch adds a new parse_args() pass that looks for parameters
prefixed by 'sysctl.' and tries to interpret them as writes to the
corresponding sys/ files using an temporary in-kernel procfs mount.
This mechanism was suggested by Eric W. Biederman [3], as it handles
all dynamically registered sysctl tables, even though we don't handle
modular sysctls. Errors due to e.g. invalid parameter name or value
are reported in the kernel log.
The processing is hooked right before the init process is loaded, as
some handlers might be more complicated than simple setters and might
need some subsystems to be initialized. At the moment the init process
can be started and eventually execute a process writing to /proc/sys/
then it should be also fine to do that from the kernel.
Sysctls registered later on module load time are not set by this
mechanism - it's expected that in such scenarios, setting sysctl values
from userspace is practical enough.
[1] https://lore.kernel.org/r/BL0PR02MB560167492CA4094C91589930E9FC0@BL0PR02MB5601.namprd02.prod.outlook.com/
[2] https://unix.stackexchange.com/questions/558802/how-to-set-sysctl-using-kernel-command-line-parameter
[3] https://lore.kernel.org/r/87bloj2skm.fsf@x220.int.ebiederm.org/
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: Ivan Teterevkov <ivan.teterevkov@nutanix.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: "Eric W . Biederman" <ebiederm@xmission.com>
Cc: "Guilherme G . Piccoli" <gpiccoli@canonical.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Link: http://lkml.kernel.org/r/20200427180433.7029-1-vbabka@suse.cz
Link: http://lkml.kernel.org/r/20200427180433.7029-2-vbabka@suse.cz
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
posix_acl_permission() does not care about MAY_NOT_BLOCK, and in fact
the permission logic internally must not check that bit (it's only for
upper layers to decide whether they can block to do IO to look up the
acl information or not).
But the way the code was written, it _looked_ like it cared, since the
function explicitly did not mask that bit off.
But it has exactly two callers: one for when that bit is set, which
first clears the bit before calling posix_acl_permission(), and the
other call site when that bit was clear.
So stop the silly games "saving" the MAY_NOT_BLOCK bit that must not be
used for the actual permission test, and that currently is pointlessly
cleared by the callers when the function itself should just not care.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Rasmus Villemoes points out that the 'in_group_p()' tests can be a
noticeable expense, and often completely unnecessary. A common
situation is that the 'group' bits are the same as the 'other' bits
wrt the permissions we want to test.
So rewrite 'acl_permission_check()' to not bother checking for group
ownership when the permission check doesn't care.
For example, if we're asking for read permissions, and both 'group' and
'other' allow reading, there's really no reason to check if we're part
of the group or not: either way, we'll allow it.
Rasmus says:
"On a bog-standard Ubuntu 20.04 install, a workload consisting of
compiling lots of userspace programs (i.e., calling lots of
short-lived programs that all need to get their shared libs mapped in,
and the compilers poking around looking for system headers - lots of
/usr/lib, /usr/bin, /usr/include/ accesses) puts in_group_p around
0.1% according to perf top.
System-installed files are almost always 0755 (directories and
binaries) or 0644, so in most cases, we can avoid the binary search
and the cost of pulling the cred->groups array and in_group_p() .text
into the cpu cache"
Reported-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Account for the number of provided buffers when validating the address
range.
Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Directory is always locked until "out_unlock" label. So lock check is not
needed.
Signed-off-by: youngjun <her0gyugyu@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
+ Features
- Replace zero-length array with flexible-array
- add a valid state flags check
- add consistency check between state and dfa diff encode flags
- add apparmor subdir to proc attr interface
- fail unpack if profile mode is unknown
- add outofband transition and use it in xattr match
- ensure that dfa state tables have entries
+ Cleanups
- Use true and false for bool variable
- Remove semicolon
- Clean code by removing redundant instructions
- Replace two seq_printf() calls by seq_puts() in aa_label_seq_xprint()
- remove duplicate check of xattrs on profile attachment
- remove useless aafs_create_symlink
+ Bug fixes
- Fix memory leak of profile proxy
- fix introspection of of task mode for unconfined tasks
- fix nnp subset test for unconfined
- check/put label on apparmor_sk_clone_security()
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE7cSDD705q2rFEEf7BS82cBjVw9gFAl7dUf4ACgkQBS82cBjV
w9j8rA//R3qbVeiN3SJtxLhiF3AAdP2cVbZ/mAhQLwYObI6flb1bliiahJHRf8Ey
FaVb4srOH8NlmzNINZehXOvD3UDwX/sbpw8h0Y0JolO+v1m3UXkt/eRoMt6gRz7I
jtaImY1/V+G4O5rV5fGA1HQI8Geg+W9Abt32d16vyKIIpnBS/Pfv8ppM0NcHCZ4G
e8935T/dMNK5K0Y7HNb1nMjyzEr0LtEXvXznBOrGVpCtDQ45m0/NBvAqpfhuKsVm
FE5Na8rgtiB9sU72LaoNXNr8Y5LVgkXPmBr/e1FqZtF01XEarKb7yJDGOLrLpp1o
rGYpY9DQSBT/ZZrwMaLFqCd1XtnN1BAmhlM6TXfnm25ArEnQ49ReHFc7ZHZRSTZz
LWVBD6atZbapvqckk1SU49eCLuGs5wmRj/CmwdoQUbZ+aOfR68zF+0PANbP5xDo4
862MmeMsm8JHndeCelpZQRbhtXt0t9MDzwMBevKhxV9hbpt4g8DcnC5tNUc9AnJi
qJDsMkytYhazIW+/4MsnLTo9wzhqzXq5kBeE++Xl7vDE/V+d5ocvQg73xtwQo9sx
LzMlh3cPmBvOnlpYfnONZP8pJdjDAuESsi/H5+RKQL3cLz7NX31CLWR8dXLBHy80
Dvxqvy84Cf7buigqwSzgAGKjDI5HmeOECAMjpLbEB2NS9xxQYuk=
=U7d2
-----END PGP SIGNATURE-----
Merge tag 'apparmor-pr-2020-06-07' of git://git.kernel.org/pub/scm/linux/kernel/git/jj/linux-apparmor
Pull apparmor updates from John Johansen:
"Features:
- Replace zero-length array with flexible-array
- add a valid state flags check
- add consistency check between state and dfa diff encode flags
- add apparmor subdir to proc attr interface
- fail unpack if profile mode is unknown
- add outofband transition and use it in xattr match
- ensure that dfa state tables have entries
Cleanups:
- Use true and false for bool variable
- Remove semicolon
- Clean code by removing redundant instructions
- Replace two seq_printf() calls by seq_puts() in aa_label_seq_xprint()
- remove duplicate check of xattrs on profile attachment
- remove useless aafs_create_symlink
Bug fixes:
- Fix memory leak of profile proxy
- fix introspection of of task mode for unconfined tasks
- fix nnp subset test for unconfined
- check/put label on apparmor_sk_clone_security()"
* tag 'apparmor-pr-2020-06-07' of git://git.kernel.org/pub/scm/linux/kernel/git/jj/linux-apparmor:
apparmor: Fix memory leak of profile proxy
apparmor: fix introspection of of task mode for unconfined tasks
apparmor: check/put label on apparmor_sk_clone_security()
apparmor: Use true and false for bool variable
security/apparmor/label.c: Clean code by removing redundant instructions
apparmor: Replace zero-length array with flexible-array
apparmor: ensure that dfa state tables have entries
apparmor: remove duplicate check of xattrs on profile attachment.
apparmor: add outofband transition and use it in xattr match
apparmor: fail unpack if profile mode is unknown
apparmor: fix nnp subset test for unconfined
apparmor: remove useless aafs_create_symlink
apparmor: add proc subdir to attrs
apparmor: add consistency check between state and dfa diff encode flags
apparmor: add a valid state flags check
AppArmor: Remove semicolon
apparmor: Replace two seq_printf() calls by seq_puts() in aa_label_seq_xprint()
Here is the set of driver core patches for 5.8-rc1.
Not all that huge this release, just a number of small fixes and
updates:
- software node fixes
- kobject now sends KOBJ_REMOVE when it is removed from sysfs,
not when it is removed from memory (which could come much
later)
- device link additions and fixes based on testing on more
devices
- firmware core cleanups
- other minor changes, full details in the shortlog
All have been in linux-next for a while with no reported issues.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCXtzmXg8cZ3JlZ0Brcm9h
aC5jb20ACgkQMUfUDdst+ymaAQCfZZ9prH3AMLF7DIkG3vMw0njLXt0An2FxrKYU
wetHRG4KL9vTkdz7+TqU
=t5LE
-----END PGP SIGNATURE-----
Merge tag 'driver-core-5.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core updates from Greg KH:
"Here is the set of driver core patches for 5.8-rc1.
Not all that huge this release, just a number of small fixes and
updates:
- software node fixes
- kobject now sends KOBJ_REMOVE when it is removed from sysfs, not
when it is removed from memory (which could come much later)
- device link additions and fixes based on testing on more devices
- firmware core cleanups
- other minor changes, full details in the shortlog
All have been in linux-next for a while with no reported issues"
* tag 'driver-core-5.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (23 commits)
driver core: Update device link status correctly for SYNC_STATE_ONLY links
firmware_loader: change enum fw_opt to u32
software node: implement software_node_unregister()
kobject: send KOBJ_REMOVE uevent when the object is removed from sysfs
driver core: Remove unnecessary is_fwnode_dev variable in device_add()
drivers property: When no children in primary, try secondary
driver core: platform: Fix spelling errors in platform.c
driver core: Remove check in driver_deferred_probe_force_trigger()
of: platform: Batch fwnode parsing when adding all top level devices
driver core: fw_devlink: Add support for batching fwnode parsing
driver core: Look for waiting consumers only for a fwnode's primary device
driver core: Move code to the right part of the file
Revert "Revert "driver core: Set fw_devlink to "permissive" behavior by default""
drivers: base: Fix NULL pointer exception in __platform_driver_probe() if a driver developer is foolish
firmware_loader: move fw_fallback_config to a private kernel symbol namespace
driver core: Add missing '\n' in log messages
driver/base/soc: Use kobj_to_dev() API
Add documentation on meaning of -EPROBE_DEFER
driver core: platform: remove redundant assignment to variable ret
debugfs: Use the correct style for SPDX License Identifier
...
It is better to check volume id and creation time, not just
the root inode number to verify if the volume has changed
when remounting.
Reviewed-by: David Howells <dhowells@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Conversion: John Hubbard's conversion from get_user_pages() to pin_user_pages()
cleanup: Colin Ian King's removal of an unneeded variable initialization.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEIGSFVdO6eop9nER2z0QOqevODb4FAl7ao4AACgkQz0QOqevO
Db4MqRAAjb+nRMDA6aYfCY1Veh9BUnVeVPAqWdlSPbo7TdDfhCBJKgwZ8Y3GmOnR
vYnVn6Yz/uhFwoZWmSd0AXgB55kGKyXNcHlyPO4FcaWGDNd/Dn8WWc7/lsqHnDFX
cg0Ioy7VeYS+Y+iW3ZkEjkeGyVFProl/OsJrf9vfJiuyZrLp4th/ctlbV/sIA2R6
XNk0ld21gEB5YbrTCQebtyhdJLp9+hhI0BB2Lxm/JUZyyK4J2rN8H+SF/5JE8wEj
SJD7K5kukxED2Kh3pU1fvRVr0VvHZjLHQav6TgW6GPokmb6EWZwvIYfUa8go50Jz
5fkpyRc8d3zibxDSdL6/Gr6mZQxceFnYvfPs27Vq3O08J/dhWHX0LlKctD4pEHNR
NUsF8Piko+16JwPg+EeXTIsYrMiW8g5FhoTCl7FILQ06F2P6NDCyOUyHiLOU4KaN
+bp6YmIyJBfIBgXz58Pqq2JwibkN6zYiRrb5jDDWv2Nw9ykpjNykrAXH6Fbz8JKV
dPbKkxzx5HQuHBaxYCRKV4r3UNBYKAECrWcLQglG8D+EzvYrBxFZQoPu3qapJPJG
8fQKCKqU8hM6BGYpI76FIK8o/BrOdYr0ttqDsambDWt/uRXRKXwvKJVz3wOB8Q7n
om+XlIk4UE/7vgpTru/wzVX0pmq8WWskbDOKkGqgeUC0BLuZcJw=
=3A+d
-----END PGP SIGNATURE-----
Merge tag 'for-linus-5.8-ofs1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux
Pull orangefs updates from Mike Marshall:
- John Hubbard's conversion from get_user_pages() to pin_user_pages()
- Colin Ian King's removal of an unneeded variable initialization
* tag 'for-linus-5.8-ofs1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux:
orangefs: convert get_user_pages() --> pin_user_pages()
orangefs: remove redundant assignment to variable ret
This set includes a couple minor cleanups, and dropping the
interruptible from a wait_event that waits for an event from
the userspace cluster management.
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJe2op+AAoJEDgbc8f8gGmqJKkP/jE0BCnE7VbpID0sDaBLpntk
+v53jxR/bpw6g+LwlJGBDLCRrcn8MnhC72W1OPZyFJ4AiuN5BHdvTVVSPSCBSHnU
WfJKqbVSRt+PRaATJd8iCLdoMj858vLWFo380XVMTRsAaxYJ4dwRUMBUOx/pL8Dy
pcDXDgWTC06nuRmSY8eYJYJWX3c6SdpHHbg4IA127Y9oTjIAvYyeFC4ohRtRJqrj
bQlpjnfJ9bNwd5N15turGiR5IzY7UXI9wa/qylngr5B5gVxJttFpaE4OFF80HOkY
Xxqqqhi9dIcyAAzftJaRQpyAq6kRmFgFAjH5Rvx+7n0BEMe+AAhQC9Coei1Zysh1
fssYrxRC3fqX1kg2alhpdBdCugFq7IUdFuwH044M+w8jcsoggGq8vGRMJEuKgnnm
kLLBEi5HtAv0S9rAE9Acnqanc6lRvzJIc0vysG/0LQqzWr+F7HtRnS7kkOkO2+e1
014k20FmooJiu5XGrCz+dvmnIACpr6Z+S119uXntYGDDA+YGcJQTMs6pzLh5UTqB
40aIZPPRRj6K/Vvx2VJgEbzwmwWjjdtsYkcJEq+QPPsAkPa8EnbqICCyJDkUuY5K
kltYHQ2IOa466KC8JHBS9jdVnHftk0jIl75eiP7vK+ZUEW+iwRgEXY11oiWeY23J
EKfPgi0oikfmlj2Y3jme
=VTRw
-----END PGP SIGNATURE-----
Merge tag 'dlm-5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm
Pull dlm updates from David Teigland:
"This set includes a couple minor cleanups, and dropping the
interruptible from a wait_event that waits for an event from the
userspace cluster management"
* tag 'dlm-5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm:
dlm: remove BUG() before panic()
dlm: Switch to using wait_event()
fs:dlm:remove unneeded semicolon in rcom.c
dlm: user: Replace zero-length array with flexible-array member
dlm: dlm_internal: Replace zero-length array with flexible-array member
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAl7aelsACgkQiiy9cAdy
T1HDmwv9Fj6OaXXx+btNvbB6xTWvCwMVKHwTPURMx+IjBYjJC65yPGkInPPkfUVo
7L9h55XCLwFohECleZJCkKOrJtnX1P8SsHtZck6QqjvUETJl/L3pAXpMMYACHLpg
x4DE/NFkcW95J38s9Jtjhphq8ZGUhuDhaT+QeEd2Iq8HzAxk5ND47ZXkomMx1EEM
ZsOrmJF+k2YQyDDpfhJeVF5iZDkbpASqA/TlLxxGH34IdAZIUB9qtGKADNLZ6YyT
qpG601CSrEdl3tVY+SlRMHqwVTRhCViPD6Q3fMw8Xha436RIiWJJ4Rvn6bSP/ZQl
PDPuSVRB2zmd70C/3ojXdku9+VfQLO52qkO3bf2IjgVJ3ARrxFxW7cb7bmYRqdyT
WI5N1+8gETrIAK7aB3QKdmkcRFDtJD3wOTfBcgctuB8WrYrDvW2MNKkPbQdY5tnN
xfp4f10Dg4d+/8knSytJrdKkDublU0kGbfLAa0oupjAzV6WB0qNt0TNGxE42L+ug
iFSJOZxi
=gCcP
-----END PGP SIGNATURE-----
Merge tag '5.8-rc-smb3-fixes-part-1' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs updates from Steve French:
"22 changesets, 2 for stable.
Includes big performance improvement for large i/o when using
multichannel, also includes DFS fixes"
* tag '5.8-rc-smb3-fixes-part-1' of git://git.samba.org/sfrench/cifs-2.6: (22 commits)
cifs: update internal module version number
cifs: multichannel: try to rebind when reconnecting a channel
cifs: multichannel: use pointer for binding channel
smb3: remove static checker warning
cifs: multichannel: move channel selection above transport layer
cifs: multichannel: always zero struct cifs_io_parms
cifs: dump Security Type info in DebugData
smb3: fix incorrect number of credits when ioctl MaxOutputResponse > 64K
smb3: default to minimum of two channels when multichannel specified
cifs: multichannel: move channel selection in function
cifs: fix minor typos in comments and log messages
smb3: minor update to compression header definitions
cifs: minor fix to two debug messages
cifs: Standardize logging output
smb3: Add new parm "nodelete"
cifs: move some variables off the stack in smb2_ioctl_query_info
cifs: reduce stack use in smb2_compound_op
cifs: get rid of unused parameter in reconn_setup_dfs_targets()
cifs: handle hostnames that resolve to same ip in failover
cifs: set up next DFS target before generic_ip_connect()
...
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEqG5UsNXhtOCrfGQP+7dXa6fLC2sFAl7ZC5kACgkQ+7dXa6fL
C2uv9A/+NKlTSXyv2ZuvtmXADelndcXJ+nC+3bwI7Jh43aa8uCCsAVYD0VE+dxor
Ingj/LUJ2sjjp6RXCeeqqETXCoCVt0zK2g216+An7k84KJ+ms+MDa8dNN7l6280S
1jw4hnT0+g9Ln6elgqBroV980MJC2NGL0Eaete8zFO8UqYZy5w1ge0HfGck2l45U
2lr6egCWYSUPmtFKXJnLV8luwRvq7DzvTk9WrJu3kwOjaY1AQP1+1VpdhChJLrRc
/4Ddy1On5IXiFrPi5OtHA422bfirUpIv2HbmI047W9uiZ05MiXwSvNS1qJLTa1AA
T/SK88d3FCeSYw3olAne2kEl9uewvGByr98fDKFOcDHZj18abd9/VtUp33RXxYBy
lN2wqlWP++LlZ4sMCbbvLXX8OB1tekQzWQC0vJ5rhRSgveOlhL9TLG2Y05xokFs+
AwK8zTlDIZ6Pa/JIHfp2E0ZhXEazWTSmP+d7NkgzF0iiORukvsmxjOVUZC4+UCqK
rYN6goJ5g8qpejRv5NhfP6/olb1NK33f/F2QSSFfxv9zda4HNlayvcoSnFrdUEnt
IfBhSKPkeDVWs1yse7glDuw19tHp94B9UYwJ46qfHngQPArgy+gp23d0cSy41Pr5
FRQ23eNvBWIP4srt1gSCBexSGA1h/ACji41CPTJbF2jg5uWFAUE=
=YVwD
-----END PGP SIGNATURE-----
Merge tag 'afs-next-20200604' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
Pull AFS updates from David Howells:
"There's some core VFS changes which affect a couple of filesystems:
- Make the inode hash table RCU safe and providing some RCU-safe
accessor functions. The search can then be done without taking the
inode_hash_lock. Care must be taken because the object may be being
deleted and no wait is made.
- Allow iunique() to avoid taking the inode_hash_lock.
- Allow AFS's callback processing to avoid taking the inode_hash_lock
when using the inode table to find an inode to notify.
- Improve Ext4's time updating. Konstantin Khlebnikov said "For now,
I've plugged this issue with try-lock in ext4 lazy time update.
This solution is much better."
Then there's a set of changes to make a number of improvements to the
AFS driver:
- Improve callback (ie. third party change notification) processing
by:
(a) Relying more on the fact we're doing this under RCU and by
using fewer locks. This makes use of the RCU-based inode
searching outlined above.
(b) Moving to keeping volumes in a tree indexed by volume ID
rather than a flat list.
(c) Making the server and volume records logically part of the
cell. This means that a server record now points directly at
the cell and the tree of volumes is there. This removes an N:M
mapping table, simplifying things.
- Improve keeping NAT or firewall channels open for the server
callbacks to reach the client by actively polling the fileserver on
a timed basis, instead of only doing it when we have an operation
to process.
- Improving detection of delayed or lost callbacks by including the
parent directory in the list of file IDs to be queried when doing a
bulk status fetch from lookup. We can then check to see if our copy
of the directory has changed under us without us getting notified.
- Determine aliasing of cells (such as a cell that is pointed to be a
DNS alias). This allows us to avoid having ambiguity due to
apparently different cells using the same volume and file servers.
- Improve the fileserver rotation to do more probing when it detects
that all of the addresses to a server are listed as non-responsive.
It's possible that an address that previously stopped responding
has become responsive again.
Beyond that, lay some foundations for making some calls asynchronous:
- Turn the fileserver cursor struct into a general operation struct
and hang the parameters off of that rather than keeping them in
local variables and hang results off of that rather than the call
struct.
- Implement some general operation handling code and simplify the
callers of operations that affect a volume or a volume component
(such as a file). Most of the operation is now done by core code.
- Operations are supplied with a table of operations to issue
different variants of RPCs and to manage the completion, where all
the required data is held in the operation object, thereby allowing
these to be called from a workqueue.
- Put the standard "if (begin), while(select), call op, end" sequence
into a canned function that just emulates the current behaviour for
now.
There are also some fixes interspersed:
- Don't let the EACCES from ICMP6 mapping reach the user as such,
since it's confusing as to whether it's a filesystem error. Convert
it to EHOSTUNREACH.
- Don't use the epoch value acquired through probing a server. If we
have two servers with the same UUID but in different cells, it's
hard to draw conclusions from them having different epoch values.
- Don't interpret the argument to the CB.ProbeUuid RPC as a
fileserver UUID and look up a fileserver from it.
- Deal with servers in different cells having the same UUIDs. In the
event that a CB.InitCallBackState3 RPC is received, we have to
break the callback promises for every server record matching that
UUID.
- Don't let afs_statfs return values that go below 0.
- Don't use running fileserver probe state to make server selection
and address selection decisions on. Only make decisions on final
state as the running state is cleared at the start of probing"
Acked-by: Al Viro <viro@zeniv.linux.org.uk> (fs/inode.c part)
* tag 'afs-next-20200604' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs: (27 commits)
afs: Adjust the fileserver rotation algorithm to reprobe/retry more quickly
afs: Show more a bit more server state in /proc/net/afs/servers
afs: Don't use probe running state to make decisions outside probe code
afs: Fix afs_statfs() to not let the values go below zero
afs: Fix the by-UUID server tree to allow servers with the same UUID
afs: Reorganise volume and server trees to be rooted on the cell
afs: Add a tracepoint to track the lifetime of the afs_volume struct
afs: Detect cell aliases 3 - YFS Cells with a canonical cell name op
afs: Detect cell aliases 2 - Cells with no root volumes
afs: Detect cell aliases 1 - Cells with root volumes
afs: Implement client support for the YFSVL.GetCellName RPC op
afs: Retain more of the VLDB record for alias detection
afs: Fix handling of CB.ProbeUuid cache manager op
afs: Don't get epoch from a server because it may be ambiguous
afs: Build an abstraction around an "operation" concept
afs: Rename struct afs_fs_cursor to afs_operation
afs: Remove the error argument from afs_protocol_error()
afs: Set error flag rather than return error from file status decode
afs: Make callback processing more efficient.
afs: Show more information in /proc/net/afs/servers
...
* Fix performance problems found in dioread_nolock now that it is the
default, caused by transaction leaks.
* Clean up fiemap handling in ext4
* Clean up and refactor multiple block allocator (mballoc) code
* Fix a problem with mballoc with a smaller file systems running out
of blocks because they couldn't properly use blocks that had been
reserved by inode preallocation.
* Fixed a race in ext4_sync_parent() versus rename()
* Simplify the error handling in the extent manipulation code
* Make sure all metadata I/O errors are felected to ext4_ext_dirty()'s and
ext4_make_inode_dirty()'s callers.
* Avoid passing an error pointer to brelse in ext4_xattr_set()
* Fix race which could result to freeing an inode on the dirty last
in data=journal mode.
* Fix refcount handling if ext4_iget() fails
* Fix a crash in generic/019 caused by a corrupted extent node
-----BEGIN PGP SIGNATURE-----
iQEyBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAl7Ze8kACgkQ8vlZVpUN
gaNChAf4xn0ytFSrweI/S2Sp05G/2L/ocZ2TZZk2ZdGeN1E+ABdSIv/zIF9zuFgZ
/pY/C+fyEZWt4E3FlNO8gJzoEedkzMCMnUhSIfI+wZbcclyTOSNMJtnrnJKAEtVH
HOvGZJmg357jy407RCGhZpJ773nwU2xhBTr5OFxvSf9mt/vzebxIOnw5D7HPlC1V
Fgm6Du8q+tRrPsyjv1Yu4pUEVXMJ7qUcvt326AXVM3kCZO1Aa5GrURX0w3J4mzW1
tc1tKmtbLcVVYTo9CwHXhk/edbxrhAydSP2iACand3tK6IJuI6j9x+bBJnxXitnr
vsxsfTYMG18+2SxrJ9LwmagqmrRq
=HMTs
-----END PGP SIGNATURE-----
Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 updates from Ted Ts'o:
"A lot of bug fixes and cleanups for ext4, including:
- Fix performance problems found in dioread_nolock now that it is the
default, caused by transaction leaks.
- Clean up fiemap handling in ext4
- Clean up and refactor multiple block allocator (mballoc) code
- Fix a problem with mballoc with a smaller file systems running out
of blocks because they couldn't properly use blocks that had been
reserved by inode preallocation.
- Fixed a race in ext4_sync_parent() versus rename()
- Simplify the error handling in the extent manipulation code
- Make sure all metadata I/O errors are felected to
ext4_ext_dirty()'s and ext4_make_inode_dirty()'s callers.
- Avoid passing an error pointer to brelse in ext4_xattr_set()
- Fix race which could result to freeing an inode on the dirty last
in data=journal mode.
- Fix refcount handling if ext4_iget() fails
- Fix a crash in generic/019 caused by a corrupted extent node"
* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (58 commits)
ext4: avoid unnecessary transaction starts during writeback
ext4: don't block for O_DIRECT if IOCB_NOWAIT is set
ext4: remove the access_ok() check in ext4_ioctl_get_es_cache
fs: remove the access_ok() check in ioctl_fiemap
fs: handle FIEMAP_FLAG_SYNC in fiemap_prep
fs: move fiemap range validation into the file systems instances
iomap: fix the iomap_fiemap prototype
fs: move the fiemap definitions out of fs.h
fs: mark __generic_block_fiemap static
ext4: remove the call to fiemap_check_flags in ext4_fiemap
ext4: split _ext4_fiemap
ext4: fix fiemap size checks for bitmap files
ext4: fix EXT4_MAX_LOGICAL_BLOCK macro
add comment for ext4_dir_entry_2 file_type member
jbd2: avoid leaking transaction credits when unreserving handle
ext4: drop ext4_journal_free_reserved()
ext4: mballoc: use lock for checking free blocks while retrying
ext4: mballoc: refactor ext4_mb_good_group()
ext4: mballoc: introduce pcpu seqcnt for freeing PA to improve ENOSPC handling
ext4: mballoc: refactor ext4_mb_discard_preallocations()
...
A few large, long discussed works this time. The RNBD block driver has
been posted for nearly two years now, and the removal of FMR has been a
recurring discussion theme for a long time. The usual smattering of
features and bug fixes.
- Various small driver bugs fixes in rxe, mlx5, hfi1, and efa
- Continuing driver cleanups in bnxt_re, hns
- Big cleanup of mlx5 QP creation flows
- More consistent use of src port and flow label when LAG is used and a
mlx5 implementation
- Additional set of cleanups for IB CM
- 'RNBD' network block driver and target. This is a network block RDMA
device specific to ionos's cloud environment. It brings strong multipath
and resiliency capabilities.
- Accelerated IPoIB for HFI1
- QP/WQ/SRQ ioctl migration for uverbs, and support for multiple async fds
- Support for exchanging the new IBTA defiend ECE data during RDMA CM
exchanges
- Removal of the very old and insecure FMR interface from all ULPs and
drivers. FRWR should be preferred for at least a decade now.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEfB7FMLh+8QxL+6i3OG33FX4gmxoFAl7X/IwACgkQOG33FX4g
mxp2uw/+MI2S/aXqEBvZfTT8yrkAwqYezS0VeTDnwH/T6UlTMDhHVN/2Ji3tbbX3
FEKT1i2mnAL5RqUAL1lr9g4sG/bVozrpN46Ws5Lu9dTbIPLKTNPWDuLFQDUShKY7
OyMI/bRx6anGnsOy20iiBqnrQbrrZj5TECgnmrkAl62QFdcl7aBWe/yYjy4CT11N
ub+aBXBREN1F1pc0HIjd2tI+8gnZc+mNm1LVVDRH9Capun/pI26qDNh7e6QwGyIo
n8ItraC8znLwv/nsUoTE7/JRcsTEe6vJI26PQmczZfNJs/4O65G7fZg0eSBseZYi
qKf7Uwtb3qW0R7jRUMEgFY4DKXVAA0G2ph40HXBuzOSsqlT6HqYMO2wgG8pJkrTc
qAjoSJGzfAHIsjxzxKI8wKuufCddjCm30VWWU7EKeriI6h1J0uPVqKkQMfYBTkik
696eZSBycAVgwayOng3XaehiTxOL7qGMTjUpDjUR6UscbiPG919vP+QsbIUuBXdb
YoddBQJdyGJiaCXv32ciJjo9bjPRRi/bII7Q5qzCNI2mi4ZVbudF4ffzyQvdHtNJ
nGnpRXoPi7kMvUrKTMPWkFjj0R5/UsPszsA51zbxPydfgBe0Dlc2PrrIG8dlzYAp
wbV0Lec+iJucKlt7EZtrjz1xOiOOaQt/5/cW1bWqL+wk2t6gAuY=
=9zTe
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
Pull rdma updates from Jason Gunthorpe:
"A more active cycle than most of the recent past, with a few large,
long discussed works this time.
The RNBD block driver has been posted for nearly two years now, and
flowing through RDMA due to it also introducing a new ULP.
The removal of FMR has been a recurring discussion theme for a long
time.
And the usual smattering of features and bug fixes.
Summary:
- Various small driver bugs fixes in rxe, mlx5, hfi1, and efa
- Continuing driver cleanups in bnxt_re, hns
- Big cleanup of mlx5 QP creation flows
- More consistent use of src port and flow label when LAG is used and
a mlx5 implementation
- Additional set of cleanups for IB CM
- 'RNBD' network block driver and target. This is a network block
RDMA device specific to ionos's cloud environment. It brings strong
multipath and resiliency capabilities.
- Accelerated IPoIB for HFI1
- QP/WQ/SRQ ioctl migration for uverbs, and support for multiple
async fds
- Support for exchanging the new IBTA defiend ECE data during RDMA CM
exchanges
- Removal of the very old and insecure FMR interface from all ULPs
and drivers. FRWR should be preferred for at least a decade now"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (247 commits)
RDMA/cm: Spurious WARNING triggered in cm_destroy_id()
RDMA/mlx5: Return ECE DC support
RDMA/mlx5: Don't rely on FW to set zeros in ECE response
RDMA/mlx5: Return an error if copy_to_user fails
IB/hfi1: Use free_netdev() in hfi1_netdev_free()
RDMA/hns: Uninitialized variable in modify_qp_init_to_rtr()
RDMA/core: Move and rename trace_cm_id_create()
IB/hfi1: Fix hfi1_netdev_rx_init() error handling
RDMA: Remove 'max_map_per_fmr'
RDMA: Remove 'max_fmr'
RDMA/core: Remove FMR device ops
RDMA/rdmavt: Remove FMR memory registration
RDMA/mthca: Remove FMR support for memory registration
RDMA/mlx4: Remove FMR support for memory registration
RDMA/i40iw: Remove FMR leftovers
RDMA/bnxt_re: Remove FMR leftovers
RDMA/mlx5: Remove FMR leftovers
RDMA/core: Remove FMR pool API
RDMA/rds: Remove FMR support for memory registration
RDMA/srp: Remove support for FMR memory registration
...
now that toolchains long support PT_GNU_STACK marking and there's no
need anymore to force modern programs into having all its user mappings
executable instead of only the stack and the PROT_EXEC ones. Disable
that automatic READ_IMPLIES_EXEC forcing on x86-64 and arm64. Add tables
documenting how READ_IMPLIES_EXEC is handled on x86-64, arm and arm64.
By Kees Cook.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAl7YFDIACgkQEsHwGGHe
VUpnzxAAmXdODNOb1gGQvt+KJthkfkWh2A2R+tWxCRmFtjFTcS/eRxFfvGu2KmFY
2b2AcJzuJeGjs7WIvQU0pkR2p6STyzuSBBLj5J/OJR9FonQ4pPah38df4A0fOgI6
GJyJV9Ie7O2Ph1w2iLOeWBdmR90CnYuabxsfipgOL+sjHlEI0RqLSDgARRQsxTEj
KM+JVAFD472KcUJnQKBVBOD1I1DOVBGu12r3y6chgsOtwshLNW/cO15cDgYrgnJZ
OlR3EIUukCEEc1KQzUCihsypLuGfrmdq1MyPN8CME8gLfmOBsJyGRDhvmdbS+Wxh
kAMYQ9BuNP/jMVtN950qV0qUtnZCeIPlj1sDb9STWz5fInLsXDSCS0eYi32yBFi+
7yviVU95ml6Mda1Qd5axItTHFAjKIn0qfMZszkLOtUszIzNinCgH7t3ThoXeV223
BqrpntRwiGZVpXDdcp0QFYBsWSMchR47yuhL8pB4SWxQzgNzXqAEg2KFQU0XMDKp
pdia9IzUozg/BrjG5cnRfZhq2lBra7fy3Dn6fw5+NR5vqhka0Wr8L6dyM1Rj74EU
HPk5bRXgt0OIiIFPi4139ApY7k+8j2nbf12qUchue1ZVVKzbvK996FDXbrGgW3zD
Wis1wglxB9urSUTmC1bMOeyOd+gebo3i/ACAjgSo+EbDN7qW0Qw=
=2L7y
-----END PGP SIGNATURE-----
Merge tag 'core_core_updates_for_5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull READ_IMPLIES_EXEC changes from Borislav Petkov:
"Split the old READ_IMPLIES_EXEC workaround from executable
PT_GNU_STACK now that toolchains long support PT_GNU_STACK marking and
there's no need anymore to force modern programs into having all its
user mappings executable instead of only the stack and the PROT_EXEC
ones.
Disable that automatic READ_IMPLIES_EXEC forcing on x86-64 and
arm64.
Add tables documenting how READ_IMPLIES_EXEC is handled on x86-64, arm
and arm64.
By Kees Cook"
* tag 'core_core_updates_for_5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
arm64/elf: Disable automatic READ_IMPLIES_EXEC for 64-bit address spaces
arm32/64/elf: Split READ_IMPLIES_EXEC from executable PT_GNU_STACK
arm32/64/elf: Add tables to document READ_IMPLIES_EXEC
x86/elf: Disable automatic READ_IMPLIES_EXEC on 64-bit
x86/elf: Split READ_IMPLIES_EXEC from executable PT_GNU_STACK
x86/elf: Add table to document READ_IMPLIES_EXEC
Before this patch, transactions could be merged into the system
transaction by function gfs2_merge_trans(), but the transaction ail
lists were never merged. Because the ail flushing mechanism can run
separately, bd elements can be attached to the transaction's buffer
list during the transaction (trans_add_meta, etc) but quickly moved
to its ail lists. Later, in function gfs2_trans_end, the transaction
can be freed (by gfs2_trans_end) while it still has bd elements
queued to its ail lists, which can cause it to either lose track of
the bd elements altogether (memory leak) or worse, reference the bd
elements after the parent transaction has been freed.
Although I've not seen any serious consequences, the problem becomes
apparent with the previous patch's addition of:
gfs2_assert_warn(sdp, list_empty(&tr->tr_ail1_list));
to function gfs2_trans_free().
This patch adds logic into gfs2_merge_trans() to move the merged
transaction's ail lists to the sdp transaction. This prevents the
use-after-free. To do this properly, we need to hold the ail lock,
so we pass sdp into the function instead of the transaction itself.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
This patch adds a new slab for gfs2 transactions. That allows us to
reduce kernel memory fragmentation, have better organization of data
for analysis of vmcore dumps. A new centralized function is added to
free the slab objects, and it exposes use-after-free by giving
warnings if a transaction is freed while it still has bd elements
attached to its buffers or ail lists. We make sure to initialize
those transaction ail lists so we can check their integrity when freeing.
At a later time, we should add a slab initialization function to
make it more efficient, but for this initial patch I wanted to
minimize the impact.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Since transactions may be freed shortly after they're created, before
a log_flush occurs, we need to initialize their ail1 and ail2 lists
earlier. Before this patch, the ail1 list was initialized in gfs2_log_flush().
This moves the initialization to the point when the transaction is first
created.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
When trying to upgrade the iopen glock from a shared to an exclusive lock in
gfs2_evict_inode, abort the wait if there is contention on the corresponding
inode glock: in that case, the inode must still be in active use on another
node, and we're not guaranteed to get the iopen glock anytime soon.
To make this work even better, when we notice contention on the iopen glock and
we can't evict the corresponsing inode and release the iopen glock immediately,
poke the inode glock. The other node(s) trying to acquire the lock can then
abort instead of timing out.
Thanks to Heinz Mauelshagen for pointing out a locking bug in a previous
version of this patch.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
In delete_work_func, if the iopen glock still has an inode attached,
limit the inode lookup to that specific generation number: in the likely
case that the inode was deleted on the node on which the inode's link
count dropped to zero, we can skip verifying the on-disk block type and
reading in the inode. The same applies if another node that had the
inode open managed to delete the inode before us.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Move the inode generation number check from gfs2_lookup_by_inum into
gfs2_inode_lookup: gfs2_inode_lookup may be able to decide that an inode with
the given inode generation number cannot exist without having to verify the
block type or reading the inode from disk.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Use a zero no_formal_ino instead of a NULL pointer to indicate that any inode
generation number will qualify: a valid inode never has a zero no_formal_ino.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
When an inode's link count drops to zero and the inode is cached on
other nodes, the current behavior of gfs2 is to immediately give up and
to rely on the other node(s) to delete the inode if there is iopen glock
contention. This leads to resource group glock bouncing and the loss of
caching. With the previous patches in place, we can fix that by not
giving up immediately.
When the inode is still open on other nodes, those nodes won't be able
to evict the inode and give up the iopen glock. In that case, our lock
conversion request will time out. The unlink system call will block for
the duration of the iopen lock conversion request. We're also holding
the inode glock in EX mode for an extended duration, so other nodes
won't be able to make progress on the inode, either.
This is worse than what we had before, but we can prevent other nodes
from getting stuck by aborting our iopen locking request if there is
contention on the inode glock. This will the the subject of a future
patch.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
When there's contention on the iopen glock, it means that the link count
of the corresponding inode has dropped to zero on a remote node which is
now trying to delete the inode. In that case, try to evict the inode so
that the iopen glock will be released, which will allow the remote node
to do its job.
When the inode is still open locally, the inode's reference count won't
drop to zero and so we'll keep holding the inode and its iopen glock.
The remote node will time out its request to grab the iopen glock, and
when the inode is finally closed locally, we'll try to delete it
ourself.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
This requires flushing delayed work items in gfs2_make_fs_ro (which is called
before unmounting a filesystem).
When inodes are deleted and then recreated, pending gl_delete work items would
have no effect because the inode generations will have changed, so we can
cancel any pending gl_delete works before reusing iopen glocks.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
When deleting an inode, keep track of the generation of the deleted inode in
the inode glock Lock Value Block (LVB). When trying to delete an inode
remotely, check the last-known inode generation against the deleted inode
generation to skip duplicate remote deletes. This avoids taking the resource
group glock in order to verify the block type.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
This adds checks for gfs2_log_flush being stuck, similarly to the check
in gfs2_ail1_flush. To faciliate this and make the strings easy to grep
we move the ail1 emptying to its own function, empty_ail1_list.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Before this patch, asserts based on glocks did not print the glock with
the error. This patch introduces a new macro, gfs2_glock_assert_withdraw
which first prints the glock, then takes the assert.
This also changes a few glock asserts to the new macro.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
This patch makes the glock dumps in debugfs print the number of pages
(nrpages) for address space glocks. This will aid in debugging.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Merge yet more updates from Andrew Morton:
- More MM work. 100ish more to go. Mike Rapoport's "mm: remove
__ARCH_HAS_5LEVEL_HACK" series should fix the current ppc issue
- Various other little subsystems
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (127 commits)
lib/ubsan.c: fix gcc-10 warnings
tools/testing/selftests/vm: remove duplicate headers
selftests: vm: pkeys: fix multilib builds for x86
selftests: vm: pkeys: use the correct page size on powerpc
selftests/vm/pkeys: override access right definitions on powerpc
selftests/vm/pkeys: test correct behaviour of pkey-0
selftests/vm/pkeys: introduce a sub-page allocator
selftests/vm/pkeys: detect write violation on a mapped access-denied-key page
selftests/vm/pkeys: associate key on a mapped page and detect write violation
selftests/vm/pkeys: associate key on a mapped page and detect access violation
selftests/vm/pkeys: improve checks to determine pkey support
selftests/vm/pkeys: fix assertion in test_pkey_alloc_exhaust()
selftests/vm/pkeys: fix number of reserved powerpc pkeys
selftests/vm/pkeys: introduce powerpc support
selftests/vm/pkeys: introduce generic pkey abstractions
selftests: vm: pkeys: use the correct huge page size
selftests/vm/pkeys: fix alloc_random_pkey() to make it really random
selftests/vm/pkeys: fix assertion in pkey_disable_set/clear()
selftests/vm/pkeys: fix pkey_disable_clear()
selftests: vm: pkeys: add helpers for pkey bits
...
Currently copy_string_kernel is just a wrapper around copy_strings that
simplifies the calling conventions and uses set_fs to allow passing a
kernel pointer. But due to the fact the we only need to handle a single
kernel argument pointer, the logic can be sigificantly simplified while
getting rid of the set_fs.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Link: http://lkml.kernel.org/r/20200501104105.2621149-3-hch@lst.de
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
copy_strings_kernel is always used with a single argument,
adjust the calling convention to that.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Link: http://lkml.kernel.org/r/20200501104105.2621149-2-hch@lst.de
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use a more common logging style.
Add and use pr_fmt, coalesce the format string, align arguments,
use better grammar.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Vasily Averin <vvs@virtuozzo.com>
Link: http://lkml.kernel.org/r/96ff603230ca1bd60034c36519be3930c3a3a226.camel@perches.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Current readahead for FAT entries is very simple but is having some flaws,
so it is not working well for some environments. This patch improves the
readahead more or less.
The key points of modification are,
- make the readahead size tunable by using bdi->ra_pages
- care the bdi->io_pages to avoid the small size I/O request
- update readahead window before fully exhausting
With this patch, on slow USB connected 2TB hdd:
[before]
383.18sec
[after]
51.03sec
Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: hyeongseok.kim <hyeongseok.kim@lge.com>
Reviewed-by: hyeongseok.kim <hyeongseok.kim@lge.com>
Link: http://lkml.kernel.org/r/87d08e1dlh.fsf@mail.parknet.co.jp
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If FAT length == 0, the image doesn't have any data. And it can be the
cause of overlapping the root dir and FAT entries.
Also Windows treats it as invalid format.
Reported-by: syzbot+6f1624f937d9d6911e2d@syzkaller.appspotmail.com
Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Marco Elver <elver@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Link: http://lkml.kernel.org/r/87r1wz8mrd.fsf@mail.parknet.co.jp
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The ifndef was added a long time ago to support archs that would define
their own mapping function. The last user was the metag arch which was
removed from the tree, and as such there are no users left. Let's kill
it.
Signed-off-by: Anthony Iliopoulos <ailiop@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20200402161543.4119-1-ailiop@suse.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
"catch" is reserved keyword in C++, rename it to something both gcc and
g++ accept.
Rename "ign" for symmetry.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20200331210905.GA31680@avx2
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull execve updates from Eric Biederman:
"Last cycle for the Nth time I ran into bugs and quality of
implementation issues related to exec that could not be easily be
fixed because of the way exec is implemented. So I have been digging
into exec and cleanup up what I can.
I don't think I have exec sorted out enough to fix the issues I
started with but I have made some headway this cycle with 4 sets of
changes.
- promised cleanups after introducing exec_update_mutex
- trivial cleanups for exec
- control flow simplifications
- remove the recomputation of bprm->cred
The net result is code that is a bit easier to understand and work
with and a decrease in the number of lines of code (if you don't count
the added tests)"
* 'exec-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (24 commits)
exec: Compute file based creds only once
exec: Add a per bprm->file version of per_clear
binfmt_elf_fdpic: fix execfd build regression
selftests/exec: Add binfmt_script regression test
exec: Remove recursion from search_binary_handler
exec: Generic execfd support
exec/binfmt_script: Don't modify bprm->buf and then return -ENOEXEC
exec: Move the call of prepare_binprm into search_binary_handler
exec: Allow load_misc_binary to call prepare_binprm unconditionally
exec: Convert security_bprm_set_creds into security_bprm_repopulate_creds
exec: Factor security_bprm_creds_for_exec out of security_bprm_set_creds
exec: Teach prepare_exec_creds how exec treats uids & gids
exec: Set the point of no return sooner
exec: Move handling of the point of no return to the top level
exec: Run sync_mm_rss before taking exec_update_mutex
exec: Fix spelling of search_binary_handler in a comment
exec: Move the comment from above de_thread to above unshare_sighand
exec: Rename flush_old_exec begin_new_exec
exec: Move most of setup_new_exec into flush_old_exec
exec: In setup_new_exec cache current in the local variable me
...
Pull proc updates from Eric Biederman:
"This has four sets of changes:
- modernize proc to support multiple private instances
- ensure we see the exit of each process tid exactly
- remove has_group_leader_pid
- use pids not tasks in posix-cpu-timers lookup
Alexey updated proc so each mount of proc uses a new superblock. This
allows people to actually use mount options with proc with no fear of
messing up another mount of proc. Given the kernel's internal mounts
of proc for things like uml this was a real problem, and resulted in
Android's hidepid mount options being ignored and introducing security
issues.
The rest of the changes are small cleanups and fixes that came out of
my work to allow this change to proc. In essence it is swapping the
pids in de_thread during exec which removes a special case the code
had to handle. Then updating the code to stop handling that special
case"
* 'proc-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
proc: proc_pid_ns takes super_block as an argument
remove the no longer needed pid_alive() check in __task_pid_nr_ns()
posix-cpu-timers: Replace __get_task_for_clock with pid_for_clock
posix-cpu-timers: Replace cpu_timer_pid_type with clock_pid_type
posix-cpu-timers: Extend rcu_read_lock removing task_struct references
signal: Remove has_group_leader_pid
exec: Remove BUG_ON(has_group_leader_pid)
posix-cpu-timer: Unify the now redundant code in lookup_task
posix-cpu-timer: Tidy up group_leader logic in lookup_task
proc: Ensure we see the exit of each process tid exactly once
rculist: Add hlists_swap_heads_rcu
proc: Use PIDTYPE_TGID in next_tgid
Use proc_pid_ns() to get pid_namespace from the proc superblock
proc: use named enums for better readability
proc: use human-readable values for hidepid
docs: proc: add documentation for "hidepid=4" and "subset=pid" options and new mount behavior
proc: add option to mount only a pids subset
proc: instantiate only pids that we can ptrace on 'hidepid=4' mount option
proc: allow to mount many instances of proc in one pid namespace
proc: rename struct proc_fs_info to proc_fs_opts
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAl7Y7B4ACgkQnJ2qBz9k
QNnWtAf+OJz782G6BsJrZtOgm5Vm+CSHmdKN8GHnDACT+mlNrTrLZi1OvfWjXtU/
UxX+l9w3OU/RW5uiMYrgN1Ajt5eIxT7AmszA1v7hbpLwIQzstW23DgEZLwB74+JA
xLMH7xCb2jiVXWb0yQPLTiVHfGN99I4RHSWnc+OaIXe6qO6yIS3uS/k7PWMk9sSx
BRfDKAxXjoz6Is9r6BYg1Ds4ZsmwmouoDIoA5h/PhRH07VArqTkMw3ahy2rZ61Ls
1IkU8zYKZdV2oKTRfQYxlCaEWE+65GZerTyAPuzHya93pAXAlfosIiXg6EnjiovB
jseIlGbzVtZbuAug+OhXivd2U7H+Aw==
=lWbb
-----END PGP SIGNATURE-----
Merge tag 'for_v5.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull ext2 and reiserfs cleanups from Jan Kara:
"Two small cleanups for ext2 and one for reiserfs"
* tag 'for_v5.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
reiserfs: Replace kmalloc with kcalloc in the comment
ext2: code cleanup by removing ifdef macro surrounding
ext2: Fix i_op setting for special inode
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAl7Y2McACgkQnJ2qBz9k
QNlHzwf/e4oz9oRCXPqBwh6C318nl6ksQO5ooW+Dhb535cr/Cn99nuZa3GrvW+aq
eSbypsvZQMguk0/okEc4jcTgLmEw+KubpBXOi/DJZ9dzGQrvjT2nBkQmaTqwp9dO
WMZcJLmszkrtokjKD4lVjyQArcwqQF/v/moEKIImw5A6CY4R4odTaUOCPnTwF7P6
OXsDPwRfAccJ25ZUZ8hjc+fRl/Ncex6szciaJ08T4btlaAtc5UIn5Sy/u8BqNNiw
0VRheD4sJ2c25hLOIQJ5RETIeuYaRcR/BA3vm+k1d2iIiw4ubj9+ppwiaWOryA9U
5fXnBmXKuUUrwFihzmiLSckIpm3IPg==
=kghV
-----END PGP SIGNATURE-----
Merge tag 'fsnotify_for_v5.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull fsnotify updates from Jan Kara:
"Several smaller fixes and cleanups for fsnotify subsystem"
* tag 'fsnotify_for_v5.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
fanotify: fix ignore mask logic for events on child and on dir
fanotify: don't write with size under sizeof(response)
fsnotify: Remove proc_fs.h include
fanotify: remove reference to fill_event_metadata()
fsnotify: add mutex destroy
fanotify: prefix should_merge()
fanotify: Replace zero-length array with flexible-array
inotify: Fix error return code assignment flow.
fsnotify: Add missing annotation for fsnotify_finish_user_wait() and for fsnotify_prepare_user_wait()
Only one patch in this pull request to cleanup handling of uuid using
the import_uuid() helper, from Andy.
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQSRPv8tYSvhwAzJdzjdoc3SxdoYdgUCXtg3xAAKCRDdoc3SxdoY
dp+AAQCIAQpe4qyF5hJtwLPY+qffDDuHDxHjrERpA6c7fpKicgD+K6uDIwZ8Y6L8
XXYPmKer58rV61jX4hvZGCAYwLmzRwA=
=Vc+W
-----END PGP SIGNATURE-----
Merge tag 'zonefs-5.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs
Pull zonefs update from Damien Le Moal:
"Only one patch in this pull request to cleanup handling of uuid using
the import_uuid() helper, from Andy"
* tag 'zonefs-5.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs:
zonefs: Replace uuid_copy() with import_uuid()
first steps in trying to make channels properly reconnect.
* add cifs_ses_find_chan() function to find the enclosing cifs_chan
struct it belongs to
* while we have the session lock and are redoing negprot and
sess.setup in smb2_reconnect() redo the binding of channels.
Signed-off-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Add a cifs_chan pointer in struct cifs_ses that points to the channel
currently being bound if ses->binding is true.
Previously it was always the channel past the established count.
This will make reconnecting (and rebinding) a channel easier later on.
Signed-off-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Remove static checker warning pointed out by Dan Carpenter:
The patch feeaec621c09: "cifs: multichannel: move channel selection
above transport layer" from Apr 24, 2020, leads to the following
static checker warning:
fs/cifs/smb2pdu.c:149 smb2_hdr_assemble()
error: we previously assumed 'tcon->ses' could be null (see line 133)
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
CC: Aurelien Aptel <aptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Move the channel (TCP_Server_Info*) selection from the tranport
layer to higher in the call stack so that:
- credit handling is done with the server that will actually be used
to send.
* ->wait_mtu_credit
* ->set_credits / set_credits
* ->add_credits / add_credits
* add_credits_and_wake_if
- potential reconnection (smb2_reconnect) done when initializing a
request is checked and done with the server that will actually be
used to send.
To do this:
- remove the cifs_pick_channel() call out of compound_send_recv()
- select channel and pass it down by adding a cifs_pick_channel(ses)
call in:
- smb311_posix_mkdir
- SMB2_open
- SMB2_ioctl
- __SMB2_close
- query_info
- SMB2_change_notify
- SMB2_flush
- smb2_async_readv (if none provided in context param)
- SMB2_read (if none provided in context param)
- smb2_async_writev (if none provided in context param)
- SMB2_write (if none provided in context param)
- SMB2_query_directory
- send_set_info
- SMB2_oplock_break
- SMB311_posix_qfs_info
- SMB2_QFS_info
- SMB2_QFS_attr
- smb2_lockv
- SMB2_lease_break
- smb2_compound_op
- smb2_set_ea
- smb2_ioctl_query_info
- smb2_query_dir_first
- smb2_query_info_comound
- smb2_query_symlink
- cifs_writepages
- cifs_write_from_iter
- cifs_send_async_read
- cifs_read
- cifs_readpages
- add TCP_Server_Info *server param argument to:
- cifs_send_recv
- compound_send_recv
- SMB2_open_init
- SMB2_query_info_init
- SMB2_set_info_init
- SMB2_close_init
- SMB2_ioctl_init
- smb2_iotcl_req_init
- SMB2_query_directory_init
- SMB2_notify_init
- SMB2_flush_init
- build_qfs_info_req
- smb2_hdr_assemble
- smb2_reconnect
- fill_small_buf
- smb2_plain_req_init
- __smb2_plain_req_init
The read/write codepath is different than the rest as it is using
pages, io iterators and async calls. To deal with those we add a
server pointer in the cifs_writedata/cifs_readdata/cifs_io_parms
context struct and set it in:
- cifs_writepages (wdata)
- cifs_write_from_iter (wdata)
- cifs_readpages (rdata)
- cifs_send_async_read (rdata)
The [rw]data->server pointer is eventually copied to
cifs_io_parms->server to pass it down to SMB2_read/SMB2_write.
If SMB2_read/SMB2_write is called from a different place that doesn't
set the server field it will pick a channel.
Some places do not pick a channel and just use ses->server or
cifs_ses_server(ses). All cifs_ses_server(ses) calls are in codepaths
involving negprot/sess.setup.
- SMB2_negotiate (binding channel)
- SMB2_sess_alloc_buffer (binding channel)
- SMB2_echo (uses provided one)
- SMB2_logoff (uses master)
- SMB2_tdis (uses master)
(list not exhaustive)
Signed-off-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
SMB2_read/SMB2_write check and use cifs_io_parms->server, which might
be uninitialized memory.
This change makes all callers zero-initialize the struct.
Signed-off-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Currently the end user is unaware with what sec type the
cifs share is mounted if no sec=<type> option is parsed.
With this patch one can easily check from DebugData.
Example:
1) Name: x.x.x.x Uses: 1 Capability: 0x8001f3fc Session Status: 1 Security type: RawNTLMSSP
Signed-off-by: Kenneth D'souza <kdsouza@redhat.com>
Signed-off-by: Roberto Bergantinos Corpas <rbergant@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Acked-by: Aurelien Aptel <aaptel@suse.com>
In case a compressed file is getting overwritten, the current retry
logic doesn't include the current page to be retried now as it sets
the new start index as 0 and new end index as writeback_index - 1.
This causes the corresponding cluster to be uncompressed and written
as normal pages without compression. Fix this by allowing writeback to
be retried for the current page as well (in case of compressed page
getting retried due to index mismatch with cluster index). So that
this cluster can be written compressed in case of overwrite.
Also, align f2fs_write_cache_pages() according to the change -
<64081362e8ff>("mm/page-writeback.c: fix range_cyclic writeback vs
writepages deadlock").
Signed-off-by: Sahitya Tummala <stummala@codeaurora.org>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
We already have the buffer selected, but we should set the iter list
again.
Cc: stable@vger.kernel.org # v5.7
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Fail recv/send in case of IORING_SETUP_IOPOLL earlier during prep,
so it'd be done only once. Removes duplication as well
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
io_openat_prep() and io_openat2_prep() are identical except for how
struct open_how is built. Deduplicate it with a helper.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
build_open_how() is just adjusting open_flags/mode. Do it once during
prep. It looks better than storing raw values for the future.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
IORING_SETUP_IOPOLL is defined only for read/write, other opcodes should
be disallowed, otherwise it'll get an error as below. Also refuse
open/close with SQPOLL, as the polling thread wouldn't know which file
table to use.
RIP: 0010:io_iopoll_getevents+0x111/0x5a0
Call Trace:
? _raw_spin_unlock_irqrestore+0x24/0x40
? do_send_sig_info+0x64/0x90
io_iopoll_reap_events.part.0+0x5e/0xa0
io_ring_ctx_wait_and_kill+0x132/0x1c0
io_uring_release+0x20/0x30
__fput+0xcd/0x230
____fput+0xe/0x10
task_work_run+0x67/0xa0
do_exit+0x353/0xb10
? handle_mm_fault+0xd4/0x200
? syscall_trace_enter+0x18c/0x2c0
do_group_exit+0x43/0xa0
__x64_sys_exit_group+0x18/0x20
do_syscall_64+0x60/0x1e0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
[axboe: allow provide/remove buffers and files update]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Adjust the fileserver rotation algorithm so that if we've tried all the
addresses on a server (cumulatively over multiple operations) until we've
run out of untried addresses, immediately reprobe all that server's
interfaces and retry the op at least once before we move onto the next
server.
Signed-off-by: David Howells <dhowells@redhat.com>
Display more information about the state of a server record, including the
flags, rtt and break counter plus the probe state for each server in
/proc/net/afs/servers.
Rearrange the server flags a bit to make them easier to read at a glance in
the proc file.
Signed-off-by: David Howells <dhowells@redhat.com>
Don't use the running state for fileserver probes to make decisions about
which server to use as the state is cleared at the start of a probe and
also intermediate values might be misleading.
Instead, add a separate 'latest known' rtt in the afs_server struct and a
flag to indicate if the server is known to be responding and update these
as and when we know what to change them to.
Signed-off-by: David Howells <dhowells@redhat.com>
Fix afs_statfs() so that the value for f_bavail and f_bfree don't go
"negative" if the number of blocks in use by a volume exceeds the max quota
for that volume.
Signed-off-by: David Howells <dhowells@redhat.com>
Whilst it shouldn't happen, it is possible for multiple fileservers to
share a UUID, particularly if an entire cell has been duplicated, UUIDs and
all. In such a case, it's not necessarily possible to map the effect of
the CB.InitCallBackState3 incoming RPC to a specific server unambiguously
by UUID and thus to a specific cell.
Indeed, there's a problem whereby multiple server records may need to
occupy the same spot in the rb_tree rooted in the afs_net struct.
Fix this by allowing servers to form a list, with the head of the list in
the tree. When the front entry in the list is removed, the second in the
list just replaces it. afs_init_callback_state() then just goes down the
line, poking each server in the list.
This means that some servers will be unnecessarily poked, unfortunately.
An alternative would be to route by call parameters.
Reported-by: Jeffrey Altman <jaltman@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Fixes: d2ddc776a4 ("afs: Overhaul volume and server record caching and fileserver rotation")
Reorganise afs_volume objects such that they're in a tree keyed on volume
ID, rooted at on an afs_cell object rather than being in multiple trees,
each of which is rooted on an afs_server object.
afs_server structs become per-cell and acquire a pointer to the cell.
The process of breaking a callback then starts with finding the server by
its network address, following that to the cell and then looking up each
volume ID in the volume tree.
This is simpler than the afs_vol_interest/afs_cb_interest N:M mapping web
and allows those structs and the code for maintaining them to be simplified
or removed.
It does make a couple of things a bit more tricky, though:
(1) Operations now start with a volume, not a server, so there can be more
than one answer as to whether or not the server we'll end up using
supports the FS.InlineBulkStatus RPC.
(2) CB RPC operations that specify the server UUID. There's still a tree
of servers by UUID on the afs_net struct, but the UUIDs in it aren't
guaranteed unique.
Signed-off-by: David Howells <dhowells@redhat.com>
YFS Volume Location servers have an operation by which the cell name may be
queried. Use this to find out what a YFS server thinks the canonical cell
name should be.
Signed-off-by: David Howells <dhowells@redhat.com>
Implement the second phase of cell alias detection. This part handles
alias detection for cells that don't have root.cell volumes and so we have
to find some other volume or fileserver to query.
We take the first volume from each such cell and attempt to look it up in
the new cell. If found, we compare the records, if they are the same, we
judge the cell names to be aliases.
Signed-off-by: David Howells <dhowells@redhat.com>
Put in the first phase of cell alias detection. This part handles alias
detection for cells that have root.cell volumes (which is expected to be
likely).
When a cell becomes newly active, it is probed for its root.cell volume,
and if it has one, this volume is compared against other root.cell volumes
to find out if the list of fileserver UUIDs have any in common - and if
that's the case, do the address lists of those fileservers have any
addresses in common. If they do, the new cell is adjudged to be an alias
of the old cell and the old cell is used instead.
Comparing is aided by the server list in struct afs_server_list being
sorted in UUID order and the addresses in the fileserver address lists
being sorted in address order.
The cell then retains the afs_volume object for the root.cell volume, even
if it's not mounted for future alias checking.
This necessary because:
(1) Whilst fileservers have UUIDs that are meant to be globally unique, in
practice they are not because cells get cloned without changing the
UUIDs - so afs_server records need to be per cell.
(2) Sometimes the DNS is used to make cell aliases - but if we don't know
they're the same, we may end up with multiple superblocks and multiple
afs_server records for the same thing, impairing our ability to
deliver callback notifications of third party changes
(3) The fileserver RPC API doesn't contain the cell name, so it can't tell
us which cell it's notifying and can't see that a change made to to
one cell should notify the same client that's also accessed as the
other cell.
Reported-by: Jeffrey Altman <jaltman@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Implement client support for the YFSVL.GetCellName RPC operation by which
YFS permits the canonical cell name to be queried from a VL server.
Signed-off-by: David Howells <dhowells@redhat.com>
Save more bits from the volume location database record obtained for a
server so that we can use this information in cell alias detection.
Signed-off-by: David Howells <dhowells@redhat.com>
The AFS filesystem driver is handling the CB.ProbeUuid request incorrectly.
The UUID presented in the request is that of the cache manager, not the
fileserver, so afs_deliver_cb_probe_uuid() shouldn't be using that UUID to
look up the server.
Fix this by looking up the server by address instead.
Signed-off-by: David Howells <dhowells@redhat.com>
Don't get the epoch from a server, particularly one that we're looking up
by UUID, as UUIDs may be ambiguous and may map to more than one server - so
we can't draw any conclusions from it.
Reported-by: Jeffrey Altman <jaltman@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Turn the afs_operation struct into the main way that most fileserver
operations are managed. Various things are added to the struct, including
the following:
(1) All the parameters and results of the relevant operations are moved
into it, removing corresponding fields from the afs_call struct.
afs_call gets a pointer to the op.
(2) The target volume is made the main focus of the operation, rather than
the target vnode(s), and a bunch of op->vnode->volume are made
op->volume instead.
(3) Two vnode records are defined (op->file[]) for the vnode(s) involved
in most operations. The vnode record (struct afs_vnode_param)
contains:
- The vnode pointer.
- The fid of the vnode to be included in the parameters or that was
returned in the reply (eg. FS.MakeDir).
- The status and callback information that may be returned in the
reply about the vnode.
- Callback break and data version tracking for detecting
simultaneous third-parth changes.
(4) Pointers to dentries to be updated with new inodes.
(5) An operations table pointer. The table includes pointers to functions
for issuing AFS and YFS-variant RPCs, handling the success and abort
of an operation and handling post-I/O-lock local editing of a
directory.
To make this work, the following function restructuring is made:
(A) The rotation loop that issues calls to fileservers that can be found
in each function that wants to issue an RPC (such as afs_mkdir()) is
extracted out into common code, in a new file called fs_operation.c.
(B) The rotation loops, such as the one in afs_mkdir(), are replaced with
a much smaller piece of code that allocates an operation, sets the
parameters and then calls out to the common code to do the actual
work.
(C) The code for handling the success and failure of an operation are
moved into operation functions (as (5) above) and these are called
from the core code at appropriate times.
(D) The pseudo inode getting stuff used by the dynamic root code is moved
over into dynroot.c.
(E) struct afs_iget_data is absorbed into the operation struct and
afs_iget() expects to be given an op pointer and a vnode record.
(F) Point (E) doesn't work for the root dir of a volume, but we know the
FID in advance (it's always vnode 1, unique 1), so a separate inode
getter, afs_root_iget(), is provided to special-case that.
(G) The inode status init/update functions now also take an op and a vnode
record.
(H) The RPC marshalling functions now, for the most part, just take an
afs_operation struct as their only argument. All the data they need
is held there. The result delivery functions write their answers
there as well.
(I) The call is attached to the operation and then the operation core does
the waiting.
And then the new operation code is, for the moment, made to just initialise
the operation, get the appropriate vnode I/O locks and do the same rotation
loop as before.
This lays the foundation for the following changes in the future:
(*) Overhauling the rotation (again).
(*) Support for asynchronous I/O, where the fileserver rotation must be
done asynchronously also.
Signed-off-by: David Howells <dhowells@redhat.com>
Overlayfs is using clone_private_mount() to create internal mounts for
underlying layers. These are used for operations requiring a path, such as
dentry_open().
Since these private mounts are not in any namespace they are treated as
short term, "detached" mounts and mntput() involves taking the global
mount_lock, which can result in serious cacheline pingpong.
Make these private mounts longterm instead, which trade the penalty on
mntput() for a slightly longer shutdown time due to an added RCU grace
period when putting these mounts.
Introduce a new helper kern_unmount_many() that can take care of multiple
longterm mounts with a single RCU grace period.
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
ofs->upper_mnt is copied to ->layers[0].mnt and ->layers[0].trap could be
used instead of a separate ->upperdir_trap.
Split the lowerdir option early to get the number of layers, then allocate
the ->layers array, and finally fill the upper and lower layers, as before.
Get rid of path_put_init() in ovl_lower_dir(), since the only caller will
take care of that.
[Colin Ian King] Fix null pointer dereference on null stack pointer on
error return found by Coverity.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
In ovl_copy_xattr, if all the xattrs to be copied are overlayfs private
xattrs, the copy loop will terminate without assigning anything to the
error variable, thus returning an uninitialized value.
If ovl_copy_xattr is called from ovl_clear_empty, this uninitialized error
value is put into a pointer by ERR_PTR(), causing potential invalid memory
accesses down the line.
This commit initialize error with 0. This is the correct value because when
there's no xattr to copy, because all xattrs are private, ovl_copy_xattr
should succeed.
This bug is discovered with the help of INIT_STACK_ALL and clang.
Signed-off-by: Yuxuan Shui <yshuiv7@gmail.com>
Link: https://bugs.chromium.org/p/chromium/issues/detail?id=1050405
Fixes: 0956254a2d ("ovl: don't copy up opaqueness")
Cc: stable@vger.kernel.org # v4.8
Signed-off-by: Alexander Potapenko <glider@google.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
We were not checking to see if ioctl requests asked for more than
64K (ie when CIFSMaxBufSize was > 64K) so when setting larger
CIFSMaxBufSize then ioctls would fail with invalid parameter errors.
When requests ask for more than 64K in MaxOutputResponse then we
need to ask for more than 1 credit.
Signed-off-by: Steve French <stfrench@microsoft.com>
CC: Stable <stable@vger.kernel.org>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
When "multichannel" is specified on mount, make sure to default to
at least two channels.
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Merge more updates from Andrew Morton:
"More mm/ work, plenty more to come
Subsystems affected by this patch series: slub, memcg, gup, kasan,
pagealloc, hugetlb, vmscan, tools, mempolicy, memblock, hugetlbfs,
thp, mmap, kconfig"
* akpm: (131 commits)
arm64: mm: use ARCH_HAS_DEBUG_WX instead of arch defined
x86: mm: use ARCH_HAS_DEBUG_WX instead of arch defined
riscv: support DEBUG_WX
mm: add DEBUG_WX support
drivers/base/memory.c: cache memory blocks in xarray to accelerate lookup
mm/thp: rename pmd_mknotpresent() as pmd_mkinvalid()
powerpc/mm: drop platform defined pmd_mknotpresent()
mm: thp: don't need to drain lru cache when splitting and mlocking THP
hugetlbfs: get unmapped area below TASK_UNMAPPED_BASE for hugetlbfs
sparc32: register memory occupied by kernel as memblock.memory
include/linux/memblock.h: fix minor typo and unclear comment
mm, mempolicy: fix up gup usage in lookup_node
tools/vm/page_owner_sort.c: filter out unneeded line
mm: swap: memcg: fix memcg stats for huge pages
mm: swap: fix vmstats for huge pages
mm: vmscan: limit the range of LRU type balancing
mm: vmscan: reclaim writepage is IO cost
mm: vmscan: determine anon/file pressure balance at the reclaim root
mm: balance LRU lists based on relative thrashing
mm: only count actual rotations as LRU reclaim cost
...
ext4_writepages() currently works in a loop like:
start a transaction
scan inode for pages to write
map and submit these pages
stop the transaction
This loop results in starting transaction once more than is needed
because in the last iteration we start a transaction only to scan the
inode and find there are no pages to write. This can be significant
increase in number of transaction starts for single-extent files or
files that have all blocks already mapped. Furthermore we already know
from previous iteration whether there are more pages to write or not. So
propagate the information from mpage_prepare_extent_to_map() and avoid
unnecessary looping in case there are no more pages to write.
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20200525081215.29451-1-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Running with some debug patches to detect illegal blocking triggered the
extend/unaligned condition in ext4. If ext4 needs to extend the file (and
hence go to buffered IO), or if the app is doing unaligned IO, then ext4
asks the iomap code to wait for IO completion. If the caller asked for
no-wait semantics by setting IOCB_NOWAIT, then ext4 should return -EAGAIN
instead.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Link: https://lore.kernel.org/r/76152096-2bbb-7682-8fce-4cb498bcd909@kernel.dk
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
access_ok just checks we are fed a proper user pointer. We also do that
in copy_to_user itself, so no need to do this early.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20200523073016.2944131-10-hch@lst.de
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
access_ok just checks we are fed a proper user pointer. We also do that
in copy_to_user itself, so no need to do this early.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Link: https://lore.kernel.org/r/20200523073016.2944131-9-hch@lst.de
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
By moving FIEMAP_FLAG_SYNC handling to fiemap_prep we ensure it is
handled once instead of duplicated, but can still be done under fs locks,
like xfs/iomap intended with its duplicate handling. Also make sure the
error value of filemap_write_and_wait is propagated to user space.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Link: https://lore.kernel.org/r/20200523073016.2944131-8-hch@lst.de
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Replace fiemap_check_flags with a fiemap_prep helper that also takes the
inode and mapped range, and performs the sanity check and truncation
previously done in fiemap_check_range. This way the validation is inside
the file system itself and thus properly works for the stacked overlayfs
case as well.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Link: https://lore.kernel.org/r/20200523073016.2944131-7-hch@lst.de
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
iomap_fiemap should take u64 start and len arguments, just like the
->fiemap prototype.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Link: https://lore.kernel.org/r/20200523073016.2944131-6-hch@lst.de
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
No need to pull the fiemap definitions into almost every file in the
kernel build.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Link: https://lore.kernel.org/r/20200523073016.2944131-5-hch@lst.de
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
There is no caller left outside of ioctl.c.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Link: https://lore.kernel.org/r/20200523073016.2944131-4-hch@lst.de
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
iomap_fiemap already calls fiemap_check_flags first thing, so this
additional check is redundant.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20200523073016.2944131-3-hch@lst.de
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
The fiemap and EXT4_IOC_GET_ES_CACHE cases share almost no code, so split
them into entirely separate functions.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20200523073016.2944131-2-hch@lst.de
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Add an extra validation of the len parameter, as for ext4 some files
might have smaller file size limits than others. This also means the
redundant size check in ext4_ioctl_get_es_cache can go away, as all
size checking is done in the shared fiemap handler.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20200505154324.3226743-3-hch@lst.de
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
ext4 supports max number of logical blocks in a file to be 0xffffffff.
(This is since ext4_extent's ee_block is __le32).
This means that EXT4_MAX_LOGICAL_BLOCK should be 0xfffffffe (starting
from 0 logical offset). This patch fixes this.
The issue was seen when ext4 moved to iomap_fiemap API and when
overlayfs was mounted on top of ext4. Since overlayfs was missing
filemap_check_ranges(), so it could pass a arbitrary huge length which
lead to overflow of map.m_len logic.
This patch fixes that.
Fixes: d3b6f23f71 ("ext4: move ext4_fiemap to use iomap framework")
Reported-by: syzbot+77fa5bdb65cc39711820@syzkaller.appspotmail.com
Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20200505154324.3226743-2-hch@lst.de
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
When reserved transaction handle is unused, we subtract its reserved
credits in __jbd2_journal_unreserve_handle() called from
jbd2_journal_stop(). However this function forgets to remove reserved
credits from transaction->t_outstanding_credits and thus the transaction
space that was reserved remains effectively leaked. The leaked
transaction space can be quite significant in some cases and leads to
unnecessarily small transactions and thus reducing throughput of the
journalling machinery. E.g. fsmark workload creating lots of 4k files
was observed to have about 20% lower throughput due to this when ext4 is
mounted with dioread_nolock mount option.
Subtract reserved credits from t_outstanding_credits as well.
CC: stable@vger.kernel.org
Fixes: 8f7d89f368 ("jbd2: transaction reservation support")
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20200520133119.1383-3-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Remove ext4_journal_free_reserved() function. It is never used.
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Link: https://lore.kernel.org/r/20200520133119.1383-2-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Currently while doing block allocation grp->bb_free may be getting
modified if discard is happening in parallel.
For e.g. consider a case where there are lot of threads who have
preallocated lot of blocks and there is a thread which is trying
to discard all of this group's PA. Now it could happen that
we see all of those group's bb_free is zero and fail the allocation
while there is sufficient space if we free up all the PA.
So this patch adds another flag "EXT4_MB_STRICT_CHECK" which will be set
if we are unable to allocate any blocks in the first try (since we may
not have considered blocks about to be discarded from PA lists).
So during retry attempt to allocate blocks we will use ext4_lock_group()
for checking if the group is good or not.
Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/9cb740a117c958c36596f167b12af1beae9a68b7.1589955723.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
ext4_mb_good_group() definition was changed some time back
and now it even initializes the buddy cache (via ext4_mb_init_group()),
if in case the EXT4_MB_GRP_NEED_INIT() is true for a group.
Note that ext4_mb_init_group() could sleep and so should not be called
under a spinlock held.
This is fine as of now because ext4_mb_good_group() is called before
loading the buddy bitmap without ext4_lock_group() held
and again called after loading the bitmap, only this time with
ext4_lock_group() held.
But still this whole thing is confusing.
So this patch refactors out ext4_mb_good_group_nolock() which should be
called when without holding ext4_lock_group().
Also in further patches we hold the spinlock (ext4_lock_group()) while
doing any calculations which involves grp->bb_free or grp->bb_fragments.
Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/d9f7d031a5fbe1c943fae6bf1ff5cdf0604ae722.1589955723.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
There could be a race in function ext4_mb_discard_group_preallocations()
where the 1st thread may iterate through group's bb_prealloc_list and
remove all the PAs and add to function's local list head.
Now if the 2nd thread comes in to discard the group preallocations,
it will see that the group->bb_prealloc_list is empty and will return 0.
Consider for a case where we have less number of groups
(for e.g. just group 0),
this may even return an -ENOSPC error from ext4_mb_new_blocks()
(where we call for ext4_mb_discard_group_preallocations()).
But that is wrong, since 2nd thread should have waited for 1st thread
to release all the PAs and should have retried for allocation.
Since 1st thread was anyway going to discard the PAs.
The algorithm using this percpu seq counter goes below:
1. We sample the percpu discard_pa_seq counter before trying for block
allocation in ext4_mb_new_blocks().
2. We increment this percpu discard_pa_seq counter when we either allocate
or free these blocks i.e. while marking those blocks as used/free in
mb_mark_used()/mb_free_blocks().
3. We also increment this percpu seq counter when we successfully identify
that the bb_prealloc_list is not empty and hence proceed for discarding
of those PAs inside ext4_mb_discard_group_preallocations().
Now to make sure that the regular fast path of block allocation is not
affected, as a small optimization we only sample the percpu seq counter
on that cpu. Only when the block allocation fails and when freed blocks
found were 0, that is when we sample percpu seq counter for all cpus using
below function ext4_get_discard_pa_seq_sum(). This happens after making
sure that all the PAs on grp->bb_prealloc_list got freed or if it's empty.
It can be well argued that why don't just check for grp->bb_free to
see if there are any free blocks to be allocated. So here are the two
concerns which were discussed:-
1. If for some reason the blocks available in the group are not
appropriate for allocation logic (say for e.g.
EXT4_MB_HINT_GOAL_ONLY, although this is not yet implemented), then
the retry logic may result into infinte looping since grp->bb_free is
non-zero.
2. Also before preallocation was clubbed with block allocation with the
same ext4_lock_group() held, there were lot of races where grp->bb_free
could not be reliably relied upon.
Due to above, this patch considers discard_pa_seq logic to determine if
we should retry for block allocation. Say if there are are n threads
trying for block allocation and none of those could allocate or discard
any of the blocks, then all of those n threads will fail the block
allocation and return -ENOSPC error. (Since the seq counter for all of
those will match as no block allocation/discard was done during that
duration).
Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/7f254686903b87c419d798742fd9a1be34f0657b.1589955723.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Implement ext4_mb_discard_preallocations_should_retry()
which we will need in later patches to add more logic
like check for sequence number match to see if we should
retry for block allocation or not.
There should be no functionality change in this patch.
Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/1cfae0098d2aa9afbeb59331401258182868c8f2.1589955723.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
ext4_mb_discard_preallocations() only checks for grp->bb_prealloc_list
of every group to discard the group's PA to free up the space if
allocation request fails. Consider below race:-
Process A Process B
1. allocate blocks
1. Fails block allocation from
ext4_mb_regular_allocator()
ext4_lock_group()
allocated blocks
more than ac_o_ex.fe_len
ext4_unlock_group()
2. Scans the
grp->bb_prealloc_list (under
ext4_lock_group()) and
find nothing and thus return
-ENOSPC.
2. Add the additional blocks to PA list
ext4_lock_group()
add blocks to grp->bb_prealloc_list
ext4_unlock_group()
Above race could be avoided if we add those additional blocks to
grp->bb_prealloc_list at the same time with block allocation when
ext4_lock_group() was still held.
With this discard-PA will know if there are actually any blocks which
could be freed from the PA
Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/a2217dd782585b42328981832e6d396abaaccb80.1589955723.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
No one currently needs EXT4_INODE_CASEFOLD, but add it to keep the
EXT4_INODE_* definitions in sync with the EXT4_*_FL definitions.
Also make it clearer that the casefold flag is only for directories.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20200510215252.87833-1-ebiggers@kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
The path performing block allocations in ext4_ext_map_blocks() contains
code trimming the length of a new extent that is repeated later
in the function. This code is both redundant and unnecessary as the
exact length of the new extent has already been calculated. Rewrite the
instantiation of the map struct in this case to use the available
values, avoiding the overhead of unnecessary conversions and improving
clarity. Add another map struct instantiation tailored specifically to
the separate case for an existing written extent. Remove an old comment
that no longer appears applicable to the current code.
Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Link: https://lore.kernel.org/r/20200510155805.18808-1-enwlinux@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
ext_debug() msgs could be helpful, provided those could be enabled
without recompiling kernel and also if we could selectively enable
only required prints for case by case debugging.
So make ext_debug() implementation use pr_debug().
Also change ext_debug() to be defined with CONFIG_EXT4_DEBUG.
So EXT_DEBUG macro now mostly remain for below 3 functions.
ext4_ext_show_path/leaf/move() (whose print msgs use ext_debug()
which again could be dynamically enabled using pr_debug())
This also changes the ext_debug() to take inode as a parameter
to add inode no. in all of it's msgs.
Prints additional info like process name / pid, superblock id etc.
This also removes any explicit function names passed in ext_debug().
Since ext_debug() on it's own prints file, func and line no.
Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/d31dc189b0aeda9384fe7665e36da7cd8c61571f.1589086800.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
mb_debug() msg had only 1 control level for all type of msgs.
And if we enable mballoc_debug then all of those msgs would be enabled.
Instead of adding multiple debug levels for mb_debug() msgs, use
pr_debug() with which we could have finer control to print msgs at all
of different levels (i.e. at file, func, line no.).
Also add process name/pid, superblk id, and other info in mb_debug()
msg. This also kills the mballoc_debug module parameter, since it is
not needed any more.
Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/f0c660cbde9e2edbe95c67942ca9ad80dd2231eb.1589086800.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
ext4_map_blocks() has ext_debug msg early at the start of function.
We also get ext_debug msg if we could allocate a block from
ext4_ext_map_blocks(). But there is no ext_debug() msg in case of
block allocation failure. So add one along with error code.
Also add more info in ext_debug() msg like how many blocks were allocated
v/s how many were requested in ext4_ext_map_blocks().
Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/1610ec2aa932396be00f9d552fe29da473ead176.1589086800.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Make sure to check for e4b->bd_info->bb_bitmap == NULL, in
mb_cmp_bitmaps() and return if NULL, to avoid possible NULL ptr
dereference. Similar to how we do this in other ifdef DOUBLE_CHECK
functions.
Also remove the BUG_ON() logic if kmalloc() or ext4_read_block_bitmap()
fails. We should simply mark grp->bb_bitmap as NULL if above happens.
In fact ext4_read_block_bitmap() may even return an error in case of resize
ioctl. Hence remove this BUG_ON logic (fstests ext4/032 may trigger
this).
Link: https://lore.kernel.org/r/9a54f8a696ff17c057cd571be3d15ac3ec1407f1.1589086800.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
This patch implemets mb_group_bb_bitmap_alloc() and
mb_group_bb_bitmap_free() function to remove #ifdef DOUBLE_CHECK macro
and it's related code from inside
ext4_mb_add_groupinfo()/ext4_mb_release().
There should be no functionality change in this patch.
Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/8c2095d74b779f0254a19b24982490dc6f07c4f9.1589086800.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
This patch adds some more debugging mb_debug() msgs to help improve
mballoc code debugging.
Other than adding more mb_debug() msgs at few more places,
there should be no other functionality change in this patch.
Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/5fc8e7788b924e211fcfa4a4c1d2f8503511661a.1589086800.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
We can't fail in the truncate path without requiring an fsck.
Add work around for this by using a combination of retry loops
and the __GFP_NOFAIL flag.
From: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Anna Pendleton <pendleton@google.com>
Reviewed-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Link: https://lore.kernel.org/r/20200507175028.15061-1-pendleton@google.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
'igrab(d_inode(dentry->d_parent))' without holding dentry->d_lock is
broken because without d_lock, d_parent can be concurrently changed due
to a rename(). Then if the old directory is immediately deleted, old
d_parent->inode can be NULL. That causes a NULL dereference in igrab().
To fix this, use dget_parent() to safely grab a reference to the parent
dentry, which pins the inode. This also eliminates the need to use
d_find_any_alias() other than for the initial inode, as we no longer
throw away the dentry at each step.
This is an extremely hard race to hit, but it is possible. Adding a
udelay() in between the reads of ->d_parent and its ->d_inode makes it
reproducible on a no-journal filesystem using the following program:
#include <fcntl.h>
#include <unistd.h>
int main()
{
if (fork()) {
for (;;) {
mkdir("dir1", 0700);
int fd = open("dir1/file", O_RDWR|O_CREAT|O_SYNC);
write(fd, "X", 1);
close(fd);
}
} else {
mkdir("dir2", 0700);
for (;;) {
rename("dir1/file", "dir2/file");
rmdir("dir1");
}
}
}
Fixes: d59729f4e7 ("ext4: fix races in ext4_sync_parent()")
Cc: stable@vger.kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20200506183140.541194-1-ebiggers@kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
If ext4_ext_convert_to_initialized() fails when called within
ext4_ext_handle_unwritten_extents(), immediately error out through the
exit point at function end. Fix the error handling in the event
ext4_ext_convert_to_initialized() returns 0, which it shouldn't do when
converting an existing extent. The current code returns the passed in
value of allocated (which is likely non-zero) while failing to set
m_flags, m_pblk, and m_len.
Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Link: https://lore.kernel.org/r/20200430185320.23001-5-enwlinux@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
If the call to ext4_split_convert_extents() fails in the
EXT4_GET_BLOCKS_PRE_IO case within ext4_ext_handle_unwritten_extents(),
error out through the exit point at function end rather than jumping
through an intermediate point. Fix the error handling in the event
ext4_split_convert_extents() returns 0, which it shouldn't do when
splitting an existing extent. The current code returns the passed in
value of allocated (which is likely non-zero) while failing to set
m_flags, m_pblk, and m_len.
Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Link: https://lore.kernel.org/r/20200430185320.23001-4-enwlinux@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Remove the redundant code assigning values to ext4_map_blocks components
in ext4_ext_handle_unwritten_extents() for the EXT4_GET_BLOCKS_CONVERT
case, using the code at the function exit instead. Clean up and reorder
that code to eliminate more redundancy and improve readability.
Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Link: https://lore.kernel.org/r/20200430185320.23001-3-enwlinux@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
There's no call to ext4_map_blocks() in the current ext4 code with a
flags argument that combines EXT4_GET_BLOCKS_CONVERT and
EXT4_GET_BLOCKS_ZERO. Remove the code that corresponds to this case
from ext4_ext_handle_unwritten_extents().
Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/20200430185320.23001-2-enwlinux@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Don't ignore return values from ext4_ext_dirty, since the errors
indicate valid failures below Ext4. In all of the other instances of
ext4_ext_dirty calls, the error return value is handled in some
way. This patch makes those remaining couple of places to handle
ext4_ext_dirty errors as well. In case of ext4_split_extent_at(), the
ignorance of return value is intentional. The reason is that we are
already in error path and there isn't much we can do if ext4_ext_dirty
returns error. This patch adds a comment for that case explaining why
we ignore the return value.
In the longer run, we probably should
make sure that errors from other mark_dirty routines are handled as
well.
Ran gce-xfstests smoke tests and verified that there were no
regressions.
Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20200427013438.219117-2-harshadshirwadkar@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
ext4_mark_inode_dirty() can fail for real reasons. Ignoring its return
value may lead ext4 to ignore real failures that would result in
corruption / crashes. Harden ext4_mark_inode_dirty error paths to fail
as soon as possible and return errors to the caller whenever
appropriate.
One of the possible scnearios when this bug could affected is that
while creating a new inode, its directory entry gets added
successfully but while writing the inode itself mark_inode_dirty
returns error which is ignored. This would result in inconsistency
that the directory entry points to a non-existent inode.
Ran gce-xfstests smoke tests and verified that there were no
regressions.
Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Link: https://lore.kernel.org/r/20200427013438.219117-1-harshadshirwadkar@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Don't pass error pointers to brelse().
commit 7159a986b4 ("ext4: fix some error pointer dereferences") has fixed
some cases, fix the remaining one case.
Once ext4_xattr_block_find()->ext4_sb_bread() failed, error pointer is
stored in @bs->bh, which will be passed to brelse() in the cleanup
routine of ext4_xattr_set_handle(). This will then cause a NULL panic
crash in __brelse().
BUG: unable to handle kernel NULL pointer dereference at 000000000000005b
RIP: 0010:__brelse+0x1b/0x50
Call Trace:
ext4_xattr_set_handle+0x163/0x5d0
ext4_xattr_set+0x95/0x110
__vfs_setxattr+0x6b/0x80
__vfs_setxattr_noperm+0x68/0x1b0
vfs_setxattr+0xa0/0xb0
setxattr+0x12c/0x1a0
path_setxattr+0x8d/0xc0
__x64_sys_setxattr+0x27/0x30
do_syscall_64+0x60/0x250
entry_SYSCALL_64_after_hwframe+0x49/0xbe
In this case, @bs->bh stores '-EIO' actually.
Fixes: fb265c9cb4 ("ext4: add ext4_sb_bread() to disambiguate ENOMEM cases")
Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: stable@kernel.org # 2.6.19
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/1587628004-95123-1-git-send-email-jefflexu@linux.alibaba.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
When we are evicting inode with journalled data, we may race with
transaction commit in the following way:
CPU0 CPU1
jbd2_journal_commit_transaction() evict(inode)
inode_io_list_del()
inode_wait_for_writeback()
process BJ_Forget list
__jbd2_journal_insert_checkpoint()
__jbd2_journal_refile_buffer()
__jbd2_journal_unfile_buffer()
if (test_clear_buffer_jbddirty(bh))
mark_buffer_dirty(bh)
__mark_inode_dirty(inode)
ext4_evict_inode(inode)
frees the inode
This results in use-after-free issues in the writeback code (or
the assertion added in the previous commit triggering).
Fix the problem by removing inode from writeback lists once all the page
cache is evicted and so inode cannot be added to writeback lists again.
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20200421085445.5731-4-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Ext4 needs to remove inode from writeback lists after it is out of
visibility of its journalling machinery (which can still dirty the
inode). Export inode_io_list_del() for it.
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20200421085445.5731-3-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
ext4_orphan_get() invokes ext4_read_inode_bitmap(), which returns a
reference of the specified buffer_head object to "bitmap_bh" with
increased refcnt.
When ext4_orphan_get() returns, local variable "bitmap_bh" becomes
invalid, so the refcount should be decreased to keep refcount balanced.
The reference counting issue happens in one exception handling path of
ext4_orphan_get(). When ext4_iget() fails, the function forgets to
decrease the refcnt increased by ext4_read_inode_bitmap(), causing a
refcnt leak.
Fix this issue by calling brelse() when ext4_iget() fails.
Signed-off-by: Xiyu Yang <xiyuyang19@fudan.edu.cn>
Signed-off-by: Xin Tan <tanxin.ctf@gmail.com>
Cc: stable@kernel.org
Link: https://lore.kernel.org/r/1587618568-13418-1-git-send-email-xiyuyang19@fudan.edu.cn
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
If eh->eh_max is 0, EXT_MAX_EXTENT/INDEX would evaluate to unsigned
(-1) resulting in illegal memory accesses. Although there is no
consistent repro, we see that generic/019 sometimes crashes because of
this bug.
Ran gce-xfstests smoke and verified that there were no regressions.
Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Link: https://lore.kernel.org/r/20200421023959.20879-2-harshadshirwadkar@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Fix the following coccicheck warning:
fs/ext4/extents_status.c:1057:5-28: WARNING: Comparison to bool
fs/ext4/inode.c:2314:18-24: WARNING: Comparison to bool
Signed-off-by: Jason Yan <yanaijie@huawei.com>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/20200420042918.19459-1-yanaijie@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
The eofblocks code was removed in the 5.7 release by "ext4: remove
EOFBLOCKS_FL and associated code" (4337ecd1fe). The ext4_map_blocks()
flag used to trigger it can now be removed as well.
Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20200415203140.30349-2-enwlinux@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
In a 32-bit program, running on arm64 architecture. When the address
space below mmap base is completely exhausted, shmat() for huge pages will
return ENOMEM, but shmat() for normal pages can still success on no-legacy
mode. This seems not fair.
For normal pages, the calling trace of get_unmapped_area() is:
=> mm->get_unmapped_area()
if on legacy mode,
=> arch_get_unmapped_area()
=> vm_unmapped_area()
if on no-legacy mode,
=> arch_get_unmapped_area_topdown()
=> vm_unmapped_area()
For huge pages, the calling trace of get_unmapped_area() is:
=> file->f_op->get_unmapped_area()
=> hugetlb_get_unmapped_area()
=> vm_unmapped_area()
To solve this issue, we only need to make hugetlb_get_unmapped_area() take
the same way as mm->get_unmapped_area(). Add *bottomup() and *topdown()
for hugetlbfs, and check current mm->get_unmapped_area() to decide which
one to use. If mm->get_unmapped_area is equal to
arch_get_unmapped_area_topdown(), hugetlb_get_unmapped_area() calls
topdown routine, otherwise calls bottomup routine.
Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Shijie Hu <hushijie3@huawei.com>
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Will Deacon <will@kernel.org>
Cc: Xiaoming Ni <nixiaoming@huawei.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: yangerkun <yangerkun@huawei.com>
Cc: ChenGang <cg.chen@huawei.com>
Cc: Chen Jie <chenjie6@huawei.com>
Link: http://lkml.kernel.org/r/20200518065338.113664-1-hushijie3@huawei.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
They're the same function, and for the purpose of all callers they are
equivalent to lru_cache_add().
[akpm@linux-foundation.org: fix it for local_lock changes]
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Rik van Riel <riel@surriel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Link: http://lkml.kernel.org/r/20200520232525.798933-5-hannes@cmpxchg.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull networking updates from David Miller:
1) Allow setting bluetooth L2CAP modes via socket option, from Luiz
Augusto von Dentz.
2) Add GSO partial support to igc, from Sasha Neftin.
3) Several cleanups and improvements to r8169 from Heiner Kallweit.
4) Add IF_OPER_TESTING link state and use it when ethtool triggers a
device self-test. From Andrew Lunn.
5) Start moving away from custom driver versions, use the globally
defined kernel version instead, from Leon Romanovsky.
6) Support GRO vis gro_cells in DSA layer, from Alexander Lobakin.
7) Allow hard IRQ deferral during NAPI, from Eric Dumazet.
8) Add sriov and vf support to hinic, from Luo bin.
9) Support Media Redundancy Protocol (MRP) in the bridging code, from
Horatiu Vultur.
10) Support netmap in the nft_nat code, from Pablo Neira Ayuso.
11) Allow UDPv6 encapsulation of ESP in the ipsec code, from Sabrina
Dubroca. Also add ipv6 support for espintcp.
12) Lots of ReST conversions of the networking documentation, from Mauro
Carvalho Chehab.
13) Support configuration of ethtool rxnfc flows in bcmgenet driver,
from Doug Berger.
14) Allow to dump cgroup id and filter by it in inet_diag code, from
Dmitry Yakunin.
15) Add infrastructure to export netlink attribute policies to
userspace, from Johannes Berg.
16) Several optimizations to sch_fq scheduler, from Eric Dumazet.
17) Fallback to the default qdisc if qdisc init fails because otherwise
a packet scheduler init failure will make a device inoperative. From
Jesper Dangaard Brouer.
18) Several RISCV bpf jit optimizations, from Luke Nelson.
19) Correct the return type of the ->ndo_start_xmit() method in several
drivers, it's netdev_tx_t but many drivers were using
'int'. From Yunjian Wang.
20) Add an ethtool interface for PHY master/slave config, from Oleksij
Rempel.
21) Add BPF iterators, from Yonghang Song.
22) Add cable test infrastructure, including ethool interfaces, from
Andrew Lunn. Marvell PHY driver is the first to support this
facility.
23) Remove zero-length arrays all over, from Gustavo A. R. Silva.
24) Calculate and maintain an explicit frame size in XDP, from Jesper
Dangaard Brouer.
25) Add CAP_BPF, from Alexei Starovoitov.
26) Support terse dumps in the packet scheduler, from Vlad Buslov.
27) Support XDP_TX bulking in dpaa2 driver, from Ioana Ciornei.
28) Add devm_register_netdev(), from Bartosz Golaszewski.
29) Minimize qdisc resets, from Cong Wang.
30) Get rid of kernel_getsockopt and kernel_setsockopt in order to
eliminate set_fs/get_fs calls. From Christoph Hellwig.
* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2517 commits)
selftests: net: ip_defrag: ignore EPERM
net_failover: fixed rollback in net_failover_open()
Revert "tipc: Fix potential tipc_aead refcnt leak in tipc_crypto_rcv"
Revert "tipc: Fix potential tipc_node refcnt leak in tipc_rcv"
vmxnet3: allow rx flow hash ops only when rss is enabled
hinic: add set_channels ethtool_ops support
selftests/bpf: Add a default $(CXX) value
tools/bpf: Don't use $(COMPILE.c)
bpf, selftests: Use bpf_probe_read_kernel
s390/bpf: Use bcr 0,%0 as tail call nop filler
s390/bpf: Maintain 8-byte stack alignment
selftests/bpf: Fix verifier test
selftests/bpf: Fix sample_cnt shared between two threads
bpf, selftests: Adapt cls_redirect to call csum_level helper
bpf: Add csum_level helper for fixing up csum levels
bpf: Fix up bpf_skb_adjust_room helper's skb csum setting
sfc: add missing annotation for efx_ef10_try_update_nic_stats_vf()
crypto/chtls: IPv6 support for inline TLS
Crypto/chcr: Fixes a coccinile check error
Crypto/chcr: Fixes compilations warnings
...
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCXtYhfgAKCRCRxhvAZXjc
oghSAP9uVX3vxYtEtNvu9WtEn1uYZcSKZoF1YrcgY7UfSmna0gEAruzyZcai4CJL
WKv+4aRq2oYk+hsqZDycAxIsEgWvNg8=
=ZWj3
-----END PGP SIGNATURE-----
Merge tag 'threads-v5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux
Pull thread updates from Christian Brauner:
"We have been discussing using pidfds to attach to namespaces for quite
a while and the patches have in one form or another already existed
for about a year. But I wanted to wait to see how the general api
would be received and adopted.
This contains the changes to make it possible to use pidfds to attach
to the namespaces of a process, i.e. they can be passed as the first
argument to the setns() syscall.
When only a single namespace type is specified the semantics are
equivalent to passing an nsfd. That means setns(nsfd, CLONE_NEWNET)
equals setns(pidfd, CLONE_NEWNET).
However, when a pidfd is passed, multiple namespace flags can be
specified in the second setns() argument and setns() will attach the
caller to all the specified namespaces all at once or to none of them.
Specifying 0 is not valid together with a pidfd. Here are just two
obvious examples:
setns(pidfd, CLONE_NEWPID | CLONE_NEWNS | CLONE_NEWNET);
setns(pidfd, CLONE_NEWUSER);
Allowing to also attach subsets of namespaces supports various
use-cases where callers setns to a subset of namespaces to retain
privilege, perform an action and then re-attach another subset of
namespaces.
Apart from significantly reducing the number of syscalls needed to
attach to all currently supported namespaces (eight "open+setns"
sequences vs just a single "setns()"), this also allows atomic setns
to a set of namespaces, i.e. either attaching to all namespaces
succeeds or we fail without having changed anything.
This is centered around a new internal struct nsset which holds all
information necessary for a task to switch to a new set of namespaces
atomically. Fwiw, with this change a pidfd becomes the only token
needed to interact with a container. I'm expecting this to be
picked-up by util-linux for nsenter rather soon.
Associated with this change is a shiny new test-suite dedicated to
setns() (for pidfds and nsfds alike)"
* tag 'threads-v5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
selftests/pidfd: add pidfd setns tests
nsproxy: attach to namespaces via pidfds
nsproxy: add struct nsset
This can only happen if there's a bug somewhere, so let's make it a WARN
not a printk. Also, I think it's safest to ignore the corruption rather
than trying to fix it by removing a cache entry.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Negative dentries of upper layer are useless after construction of
overlayfs' own dentry and may keep in the memory long time even after
unmount of overlayfs instance. This patch tries to drop unnecessary
negative dentry of upper layer to effectively reclaim memory.
Signed-off-by: Chengguang Xu <cgxu519@mykernel.net>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Call inode_permission() on real inode before opening regular file on one of
the underlying layers.
In some cases ovl_permission() already checks access to an underlying file,
but it misses the metacopy case, and possibly other ones as well.
Removing the redundant permission check from ovl_permission() should be
considered later.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Verify LSM permissions for underlying file, since vfs_ioctl() doesn't do
it.
[Stephen Rothwell] export security_file_ioctl
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
- Convert to use the new mount apis;
- Some random cleanup patches.
-----BEGIN PGP SIGNATURE-----
iIsEABYIADMWIQThPAmQN9sSA0DVxtI5NzHcH7XmBAUCXtbfOhUcaHNpYW5na2Fv
QHJlZGhhdC5jb20ACgkQOTcx3B+15gTvZgD6Ap8mYxRaW7Qta+HEyFuyRrxWZ/XZ
pq/hYiouGosDdaMBAOUNl8pGlPX54T+Y9VZv0wV0Dp4pan6NApdgtL9fIQUE
=QhQh
-----END PGP SIGNATURE-----
Merge tag 'erofs-for-5.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs
Pull erofs updates from Gao Xiang:
"The most interesting part is the new mount api conversion, which is
actually a old patch already pending for several cycles. And the
others are recent trivial cleanups here.
Summary:
- Convert to use the new mount apis
- Some random cleanup patches"
* tag 'erofs-for-5.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs:
erofs: suppress false positive last_block warning
erofs: convert to use the new mount fs_context api
erofs: code cleanup by removing ifdef macro surrounding
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl7U50AACgkQxWXV+ddt
WDtK1g//RXeNsTguYQr1N9R5eUPThjLEI0+4J0l4SYfCPU8Ou3C7nqpOEJJQgm8F
ezZE+16cWi9U5uGueOc+w0rfyz4AuIXKgzoz+c0/GG2+yV5jp6DsAMbWqojAb96L
V/N3HxEzR66jqwgVUBE/x5okb2SyY7//B1l/O0amc66XDO7KTMImpIwThere6zWZ
o2SNpYpHAPQeUYJQx8h+FAW3w1CxrCZmnifazU9Jqe9J7QeQLg7rbUlJDV38jySm
ZOA8ohKN9U1gPZy+dTU3kdyyuBIq1etkIaSPJANyTo5TczPKiC0IMg75cXtS4ae/
NSxhccMpSIjVMcIHARzSFGYKNP3sGNRsmaTUg/2Cx/9GoHOhYMiCAVc8qtBBpwJO
UI0siexrCe64RuTBMRRc128GdFv7IjmSImcdi8xaR62bCcUiNdEa3zvjRe/9tOEH
ET7Z85oBnKpSzpC3MdhSUU4dtHY5XLawP8z3oUU1VSzSWM2DVjlHf79/VzbOfp18
miCVpt94lCn/gUX7el6qcnbuvMAjDyeC6HmfD+TwzQgGwyV6TLgKN9lRXeH/Oy6/
VgjGQSavGHMll3zIGURmrBCXKudjJg0J+IP4wN1TimmSEMfwKH+7tnekQd8y5qlF
eXEIqlWNykKeDzEnmV9QJy+/cV83hVWM/mUslcTx39tLN/3B/Us=
=qTt8
-----END PGP SIGNATURE-----
Merge tag 'for-5.8-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
"Highlights:
- speedup dead root detection during orphan cleanup, eg. when there
are many deleted subvolumes waiting to be cleaned, the trees are
now looked up in radix tree instead of a O(N^2) search
- snapshot creation with inherited qgroup will mark the qgroup
inconsistent, requires a rescan
- send will emit file capabilities after chown, this produces a
stream that does not need postprocessing to set the capabilities
again
- direct io ported to iomap infrastructure, cleaned up and simplified
code, notably removing last use of struct buffer_head in btrfs code
Core changes:
- factor out backreference iteration, to be used by ordinary
backreferences and relocation code
- improved global block reserve utilization
* better logic to serialize requests
* increased maximum available for unlink
* improved handling on large pages (64K)
- direct io cleanups and fixes
* simplify layering, where cloned bios were unnecessarily created
for some cases
* error handling fixes (submit, endio)
* remove repair worker thread, used to avoid deadlocks during
repair
- refactored block group reading code, preparatory work for new type
of block group storage that should improve mount time on large
filesystems
Cleanups:
- cleaned up (and slightly sped up) set/get helpers for metadata data
structure members
- root bit REF_COWS got renamed to SHAREABLE to reflect the that the
blocks of the tree get shared either among subvolumes or with the
relocation trees
Fixes:
- when subvolume deletion fails due to ENOSPC, the filesystem is not
turned read-only
- device scan deals with devices from other filesystems that changed
ownership due to overwrite (mkfs)
- fix a race between scrub and block group removal/allocation
- fix long standing bug of a runaway balance operation, printing the
same line to the syslog, caused by a stale status bit on a reloc
tree that prevented progress
- fix corrupt log due to concurrent fsync of inodes with shared
extents
- fix space underflow for NODATACOW and buffered writes when it for
some reason needs to fallback to COW mode"
* tag 'for-5.8-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (133 commits)
btrfs: fix space_info bytes_may_use underflow during space cache writeout
btrfs: fix space_info bytes_may_use underflow after nocow buffered write
btrfs: fix wrong file range cleanup after an error filling dealloc range
btrfs: remove redundant local variable in read_block_for_search
btrfs: open code key_search
btrfs: split btrfs_direct_IO to read and write part
btrfs: remove BTRFS_INODE_READDIO_NEED_LOCK
fs: remove dio_end_io()
btrfs: switch to iomap_dio_rw() for dio
iomap: remove lockdep_assert_held()
iomap: add a filesystem hook for direct I/O bio submission
fs: export generic_file_buffered_read()
btrfs: turn space cache writeout failure messages into debug messages
btrfs: include error on messages about failure to write space/inode caches
btrfs: remove useless 'fail_unlock' label from btrfs_csum_file_blocks()
btrfs: do not ignore error from btrfs_next_leaf() when inserting checksums
btrfs: make checksum item extension more efficient
btrfs: fix corrupt log due to concurrent fsync of inodes with shared extents
btrfs: unexport btrfs_compress_set_level()
btrfs: simplify iget helpers
...
- Introduce DONTCACHE flags for dentries and inodes. This hint will
cause the VFS to drop the associated objects immediately after the
last put, so that we can change the file access mode (DAX or page
cache) on the fly.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAl68FowACgkQ+H93GTRK
tOtNzA/9FkXXQYAlTWK/toHfJV8DQT/Kx1fvf8Ng0EphBUQa/rNzlcMzFg7Gw5Cs
Rzis96+xj4q//iseLZN5LLxaoxqT2Qipza0GWCMJpQG/4wTWM0Ar7BnG/Vc87lUV
F0mXnILZOUMFzr8Zj9q4ka6UGRTDSXXtwNXqBuPpIZyVbMQvPtXHhM3lWV5RUQwm
fznBxDAEGoVXiyID2OrZD5tS4BMd16uFWAWLjWphpcy18zfC7zp0+0MQik4v/9oi
54pZdtPT9/dQOu/BI8tfLP45XzZ6f++gXy2p/G96dy7ism1u40ML77ojEkadVVFe
Bf7t+EswNxrx/em/ugWbcJDtrxttSqU47g2AXsbJJB2+aHCih6Cfid41lMyRvlhR
d4cumoteX7IF/PpT3YaKHWQBo5OxHK0a2CBPd6czrCBw5yXrEUagdmw1XQ//bw5e
FRCg4eMcEW0UgINvBCHWdWRx6VaL8ngMMsflVJ/lY7FeVvM10ZYRFzJoryoebSPm
/yWcoHFsTPC8K0nWVmbwPazVE19I0g4y6Wiw39YvZDzZRzM9PcQI4DBxQcab+Va/
FPfXEXkpz0GiC6zjs/QfkPtg60GI1IG5Um4JUzdv6ce1P0p1rGcu5WiNYearahE7
7V/44WGIEAd4NP7R0JPTI0Fqv7v6uuDzMoCp7YDn8gE4FCJTt6M=
=ebl3
-----END PGP SIGNATURE-----
Merge tag 'vfs-5.8-merge-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull DAX updates part two from Darrick Wong:
"This time around, we're hoisting the DONTCACHE flag from XFS into the
VFS so that we can make the incore DAX mode changes become effective
sooner.
We can't change the file data access mode on a live inode because we
don't have a safe way to change the file ops pointers. The incore
state change becomes effective at inode loading time, which can happen
if the inode is evicted. Therefore, we're making it so that
filesystems can ask the VFS to evict the inode as soon as the last
holder drops.
The per-fs changes to make this call this will be in subsequent pull
requests from Ted and myself.
Summary:
- Introduce DONTCACHE flags for dentries and inodes. This hint will
cause the VFS to drop the associated objects immediately after the
last put, so that we can change the file access mode (DAX or page
cache) on the fly"
* tag 'vfs-5.8-merge-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
fs: Introduce DCACHE_DONTCACHE
fs: Lift XFS_IDONTCACHE to the VFS layer
- Clean up io_is_direct.
- Add a new statx flag to indicate when file data access is being done
via DAX (as opposed to the page cache).
- Update the documentation for how system administrators and application
programmers can take advantage of the (still experimental DAX) feature.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAl6wO7UACgkQ+H93GTRK
tOtEaw//eShC2YE0S+GS7ihQ3x71PJa4Is0VZOIpTHl01aqMSegwB3QbDQbVUhn1
TzLhUw4pZsz3R9GbUOrfHOYRt+aSP2t0WNhIulDeBp41CYJQSaFt85KnfM9hoBOi
VYssum3Lu7/6ReKrDD/mumzWYkts+JDCuXRmt7nOQeZJVNXOCBBbvN354V4/IKLY
wB4Wnaq3f3gYniXYW/23aCX+kocaOIUZtK6aFKyeD0KvfP5toDlpw1cBVMoM9CmO
bmEy8vKf4lgFZDLeDMqmWOecMgEH5h0baN5Psu13WuDCiCd6maBl0KpxVpVlwsep
yVz6mMbZjmLOJ2lqyw+lZb+XicD+K3yRVSTGKxV3VbuRjeX9tjVG5Im13VesNvJB
WWJq/CkOU8W0Zs7Q5RbUDGbFFWDJSI/OStAU+UeuWvL9Gndv7hqv6H904qbPPtEu
4m4Y34ARzrEaKpkABKKwQ53cLClNxmmgUN9N3cXK3mk8idlX4zM3j6+HJYUxXTO+
fBjhOlyUy2KaWmzZoJp28QvaU4iegGmMSuRnQ9HAvXmdxUA2K6+wjS6LCZGh04vz
z7SbzTBlo2kvsKdRMwJ306s2QA0/HvmKHHLI+p8OQANce9hjhE3XdJayzhitd0fk
k0D/y8OY+fbCSgI8C4g66lA8Zf2sos/ulD0QTTNBPfU2rRWkKUc=
=rS19
-----END PGP SIGNATURE-----
Merge tag 'vfs-5.8-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull DAX updates part one from Darrick Wong:
"After many years of LKML-wrangling about how to enable programs to
query and influence the file data access mode (DAX) when a filesystem
resides on storage devices such as persistent memory, Ira Weiny has
emerged with a proposed set of standard behaviors that has not been
shot down by anyone! We're more or less standardizing on the current
XFS behavior and adapting ext4 to do the same.
This is the first of a handful pull requests that will make ext4 and
XFS present a consistent interface for user programs that care about
DAX. We add a statx attribute that programs can check to see if DAX is
enabled on a particular file. Then, we update the DAX documentation to
spell out the user-visible behaviors that filesystems will guarantee
(until the next storage industry shakeup). The on-disk inode flag has
been in XFS for a few years now.
Summary:
- Clean up io_is_direct.
- Add a new statx flag to indicate when file data access is being
done via DAX (as opposed to the page cache).
- Update the documentation for how system administrators and
application programmers can take advantage of the (still
experimental DAX) feature"
Link: https://lore.kernel.org/lkml/20200505002016.1085071-1-ira.weiny@intel.com/
* tag 'vfs-5.8-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
Documentation/dax: Update Usage section
fs/stat: Define DAX statx attribute
fs: Remove unneeded IS_DAX() check in io_is_direct()
- Various cleanups to remove dead code, unnecessary conditionals,
asserts, etc.
- Fix a linker warning caused by xfs stuffing '-g' into CFLAGS
redundantly.
- Tighten up our dmesg logging to ensure that everything is prefixed
with 'XFS' for easier grepping.
- Kill a bunch of typedefs.
- Refactor the deferred ops code to reduce indirect function calls.
- Increase type-safety with the deferred ops code.
- Make the DAX mount options a tri-state.
- Fix some error handling problems in the inode flush code and clean up
other inode flush warts.
- Refactor log recovery so that each log item recovery functions now live
with the other log item processing code.
- Fix some SPDX forms.
- Fix quota counter corruption if the fs crashes after running
quotacheck but before any dquots get logged.
- Don't fail metadata verification on zero-entry attr leaf blocks, since
they're just part of the disk format now due to a historic lack of log
atomicity.
- Don't allow SWAPEXT between files with different [ugp]id when quotas
are enabled.
- Refactor inode fork reading and verification to run directly from the
inode-from-disk function. This means that we now actually guarantee
that _iget'ted inodes are totally verified and ready to go.
- Move the incore inode fork format and extent counts to the ifork
structure.
- Scalability improvements by reducing cacheline pingponging in
struct xfs_mount.
- More scalability improvements by removing m_active_trans from the
hot path.
- Fix inode counter update sanity checking to run /only/ on debug
kernels.
- Fix longstanding inconsistency in what error code we return when a
program hits project quota limits (ENOSPC).
- Fix group quota returning the wrong error code when a program hits
group quota limits.
- Fix per-type quota limits and grace periods for group and project
quotas so that they actually work.
- Allow extension of individual grace periods.
- Refactor the non-reclaim inode radix tree walking code to remove a
bunch of stupid little functions and straighten out the
inconsistent naming schemes.
- Fix a bug in speculative preallocation where we measured a new
allocation based on the last extent mapping in the file instead of
looking farther for the last contiguous space allocation.
- Force delalloc writes to unwritten extents. This closes a
stale disk contents exposure vector if the system goes down before
the write completes.
- More lockdep whackamole.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAl7OjhgACgkQ+H93GTRK
tOuGeBAApuP9ohtvrJT9FW7U+OrRsK3lw/3R+MEYpJu8GKLpGbJ6j+SKrTHxxLvu
Rp63YLIlHBOz2rNa4brm/wW8gGJIGXOnGpuiGq0Irl01xEmwqmjOLfLcYkYhno1E
i+rG0PiKYZeo/xhLtTKGl+NAwHHxmbOmxUtYHnbinHtPzDyYLQ0wff+oUkmQ7ydg
bMYFMXohoJ3Pc5UjmUrCuJj1cvYOUwl0P4LGKiq5Zud61AkBCSskEpk+oo5xFcEX
JJc1xkn5MPi+oGpSYqhnSZ6aSjwp53/i44O9volp5vCRXXv1eLVni2u/ScZ85L72
HXxoDyuZOUupirIfMBQFHsazDGPGyFIqtPhGlXoTJjrwX+ymimY6CU/0e+Xu9DEu
krlxajfUssH30zyG2q/2TaxslU35CROH6hVBXFe0Y5cEEsOIf2aOpErUhhw2YyS7
onN9gb2NBBQdYtHqIMwsbhcgq60g5H6JfGriB5dJimXXLmpuTfAREGCY2AqIoB1x
+8QFod0WwsMn6FYhi/UpZjC9qp/WTvojBUEt8Ci3ketUFwO1CLf9qm6Hj71RL3fs
fCEDHx/ZMMft7Bdbf36lICoMAhF/KfNcRn1PsQdpW4LY1Aml/7qjFNZthSVRDW+E
rhzNu+RIzGEQsSemBvccRaaTP3HFqN+qPATu2K0sALaa1LRFxzQ=
=/NYc
-----END PGP SIGNATURE-----
Merge tag 'xfs-5.8-merge-8' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs updates from Darrick Wong:
"Most of the changes this cycle are refactoring of existing code in
preparation for things landing in the future.
We also fixed various problems and deficiencies in the quota
implementation, and (I hope) the last of the stale read vectors by
forcing write allocations to go through the unwritten state until the
write completes.
Summary:
- Various cleanups to remove dead code, unnecessary conditionals,
asserts, etc.
- Fix a linker warning caused by xfs stuffing '-g' into CFLAGS
redundantly.
- Tighten up our dmesg logging to ensure that everything is prefixed
with 'XFS' for easier grepping.
- Kill a bunch of typedefs.
- Refactor the deferred ops code to reduce indirect function calls.
- Increase type-safety with the deferred ops code.
- Make the DAX mount options a tri-state.
- Fix some error handling problems in the inode flush code and clean
up other inode flush warts.
- Refactor log recovery so that each log item recovery functions now
live with the other log item processing code.
- Fix some SPDX forms.
- Fix quota counter corruption if the fs crashes after running
quotacheck but before any dquots get logged.
- Don't fail metadata verification on zero-entry attr leaf blocks,
since they're just part of the disk format now due to a historic
lack of log atomicity.
- Don't allow SWAPEXT between files with different [ugp]id when
quotas are enabled.
- Refactor inode fork reading and verification to run directly from
the inode-from-disk function. This means that we now actually
guarantee that _iget'ted inodes are totally verified and ready to
go.
- Move the incore inode fork format and extent counts to the ifork
structure.
- Scalability improvements by reducing cacheline pingponging in
struct xfs_mount.
- More scalability improvements by removing m_active_trans from the
hot path.
- Fix inode counter update sanity checking to run /only/ on debug
kernels.
- Fix longstanding inconsistency in what error code we return when a
program hits project quota limits (ENOSPC).
- Fix group quota returning the wrong error code when a program hits
group quota limits.
- Fix per-type quota limits and grace periods for group and project
quotas so that they actually work.
- Allow extension of individual grace periods.
- Refactor the non-reclaim inode radix tree walking code to remove a
bunch of stupid little functions and straighten out the
inconsistent naming schemes.
- Fix a bug in speculative preallocation where we measured a new
allocation based on the last extent mapping in the file instead of
looking farther for the last contiguous space allocation.
- Force delalloc writes to unwritten extents. This closes a stale
disk contents exposure vector if the system goes down before the
write completes.
- More lockdep whackamole"
* tag 'xfs-5.8-merge-8' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (129 commits)
xfs: more lockdep whackamole with kmem_alloc*
xfs: force writes to delalloc regions to unwritten
xfs: refactor xfs_iomap_prealloc_size
xfs: measure all contiguous previous extents for prealloc size
xfs: don't fail unwritten extent conversion on writeback due to edquot
xfs: rearrange xfs_inode_walk_ag parameters
xfs: straighten out all the naming around incore inode tree walks
xfs: move xfs_inode_ag_iterator to be closer to the perag walking code
xfs: use bool for done in xfs_inode_ag_walk
xfs: fix inode ag walk predicate function return values
xfs: refactor eofb matching into a single helper
xfs: remove __xfs_icache_free_eofblocks
xfs: remove flags argument from xfs_inode_ag_walk
xfs: remove xfs_inode_ag_iterator_flags
xfs: remove unused xfs_inode_ag_iterator function
xfs: replace open-coded XFS_ICI_NO_TAG
xfs: move eofblocks conversion function to xfs_ioctl.c
xfs: allow individual quota grace period extension
xfs: per-type quota timers and warn limits
xfs: switch xfs_get_defquota to take explicit type
...
A previous commit enabled this functionality, which also enabled O_PATH
to work correctly with io_uring. But we can't safely close the ring
itself, as the file handle isn't reference counted inside
io_uring_enter(). Instead of jumping through hoops to enable ring
closure, add a "soft" ->needs_file option, ->needs_file_no_error. This
enables O_PATH file descriptors to work, but still catches the case of
trying to close the ring itself.
Reported-by: Jann Horn <jannh@google.com>
Fixes: 904fbcb115 ("io_uring: remove 'fd is io_uring' from close path")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl7VP+kQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpuK4D/0XsSG/Yirbba1rrbqw/qpw9xcAs9oyN0tS
8SmmGN27ghrkVSsGBXNcG+PSTu3pkkLjYZ6TQtKamrya9G+lRAsKRsQ+Yq+7Qv4e
N6lCUlLJ99KqTMtwvIoxSpA1tz3ENHucOw2cJrw3kd9G0kil7GvDkIOBasd+kmwn
ak+mnMJZzRhqSM7M5lKQOk8l92gKBHGbPy4xKb0st3dQkYptDvit0KcNSAuevtOp
sRZpdbXaT3FA6xa5iEgggI6vZQGVmK1EaGoQqZ8vgVo75aovkjZyQWWiFVVOlEqr
QjUCCQuixcbMRbZjgpojqva5nmLhFVhLCfoSH2XgttEQZhmTwypdRwM2/IlxV5q2
xCofrDkhYOfIgHkuP6p68ukIPIfQ+4jotvsmXZ/HeD/xbx3TRyJRZadISr6wiuLm
7zRXWaGCYomUIPJOOrpBQ9FsCglkaN63oB6VGuGKTg3g7kE2QrZ2/aGuexP+FAdh
OrA8BlzxZzpqMKhjQVKOl9r6FU928MZn8nIAkMdQ/Ia1mOpb4rrPo4qCdf+tbhPO
pmKtQPQjbszQ3UfTgShvfvDk43BeRim1DxZPFTauSu1FMpqWBCwQgXMynPFrf5TR
HXF61G+jw5swDW6uJgW7bXdm7hHr15vRqQr54MgGS+T0OOa1df9MR0dJB5CGklfI
ycLU6AAT+A==
=A/qA
-----END PGP SIGNATURE-----
Merge tag 'for-5.8/io_uring-2020-06-01' of git://git.kernel.dk/linux-block
Pull io_uring updates from Jens Axboe:
"A relatively quiet round, mostly just fixes and code improvements. In
particular:
- Make statx just use the generic statx handler, instead of open
coding it. We don't need that anymore, as we always call it async
safe (Bijan)
- Enable closing of the ring itself. Also fixes O_PATH closure (me)
- Properly name completion members (me)
- Batch reap of dead file registrations (me)
- Allow IORING_OP_POLL with double waitqueues (me)
- Add tee(2) support (Pavel)
- Remove double off read (Pavel)
- Fix overflow cancellations (Pavel)
- Improve CQ timeouts (Pavel)
- Async defer drain fixes (Pavel)
- Add support for enabling/disabling notifications on a registered
eventfd (Stefano)
- Remove dead state parameter (Xiaoguang)
- Disable SQPOLL submit on dying ctx (Xiaoguang)
- Various code cleanups"
* tag 'for-5.8/io_uring-2020-06-01' of git://git.kernel.dk/linux-block: (29 commits)
io_uring: fix overflowed reqs cancellation
io_uring: off timeouts based only on completions
io_uring: move timeouts flushing to a helper
statx: hide interfaces no longer used by io_uring
io_uring: call statx directly
statx: allow system call to be invoked from io_uring
io_uring: add io_statx structure
io_uring: get rid of manual punting in io_close
io_uring: separate DRAIN flushing into a cold path
io_uring: don't re-read sqe->off in timeout_prep()
io_uring: simplify io_timeout locking
io_uring: fix flush req->refs underflow
io_uring: don't submit sqes when ctx->refs is dying
io_uring: async task poll trigger cleanup
io_uring: add tee(2) support
splice: export do_tee()
io_uring: don't repeat valid flag list
io_uring: rename io_file_put()
io_uring: remove req->needs_fixed_files
io_uring: cleanup io_poll_remove_one() logic
...
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl7VPc4QHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpgQkEACnQlzWOfNQMz1AzgUAv/S8IYDJCLrkbjLZ
JK4pJv8Hjhss/7sS+fd8kyKe9VtaZz2IjmrXcC66RMMwtpx4iHnkRffoNAgEdGOl
/M5TCZGhs+F/mp3Lc0WdR5DFHkM6yy2Tkk9wCFLreB4bW67janAWnd7nbU4INqJj
+WqIgpzNMc/kfUhpBYTeQLORhL4e2TG9ADTi/zeUITlpnEsA65LOgXKEpeIFYnSX
KTl4GIZ9tjazG3Y1Eva7DYHDIErNNAtX67KBqf+WBgMV98eB0O6xIPN1WlmhDTqj
FGMLkb8msH1HHntvxDAuc4/ortnUy8vPI4o6zKP89HJJNjIM5p5eHEuVF5JnBw42
Rtu9Om6JqWx51nhAhJNBj9bUStYbhEl0vVQCwbkfPbDJhzTy3RR8z709q9+ZwOrL
xbp4aJBzqrzscjBEiSQbNCf2PyuOAdU0r1x81UN81ZN41d5qUcumcinjw4Y7vru8
z5zMlo1Iy/AWQYyu7jgHmnpI7ZyA/1Qclo5dV7aa72bLFaJa35e7QxgfQOFBA5dY
UZl6QPJRlnB80uGRzD5jCh2O2sQ3XZqYnpaKsUAka1GgbceCp9IC4A5mfZvpACsh
Xk8VXjlhvY/iPJsKLqrh4Oedg4Dj5M3PLL9C3MDfYeIP2qgXpbnk87UV1TPNSpY0
QcTxsXXXIw==
=H+/Z
-----END PGP SIGNATURE-----
Merge tag 'for-5.8/drivers-2020-06-01' of git://git.kernel.dk/linux-block
Pull block driver updates from Jens Axboe:
"On top of the core changes, here are the block driver changes for this
merge window:
- NVMe changes:
- NVMe over Fibre Channel protocol updates, which also reach
over to drivers/scsi/lpfc (James Smart)
- namespace revalidation support on the target (Anthony
Iliopoulos)
- gcc zero length array fix (Arnd Bergmann)
- nvmet cleanups (Chaitanya Kulkarni)
- misc cleanups and fixes (me, Keith Busch, Sagi Grimberg)
- use a SRQ per completion vector (Max Gurtovoy)
- fix handling of runtime changes to the queue count (Weiping
Zhang)
- t10 protection information support for nvme-rdma and
nvmet-rdma (Israel Rukshin and Max Gurtovoy)
- target side AEN improvements (Chaitanya Kulkarni)
- various fixes and minor improvements all over, icluding the
nvme part of the lpfc driver"
- Floppy code cleanup series (Willy, Denis)
- Floppy contention fix (Jiri)
- Loop CONFIGURE support (Martijn)
- bcache fixes/improvements (Coly, Joe, Colin)
- q->queuedata cleanups (Christoph)
- Get rid of ioctl_by_bdev (Christoph, Stefan)
- md/raid5 allocation fixes (Coly)
- zero length array fixes (Gustavo)
- swim3 task state fix (Xu)"
* tag 'for-5.8/drivers-2020-06-01' of git://git.kernel.dk/linux-block: (166 commits)
bcache: configure the asynchronous registertion to be experimental
bcache: asynchronous devices registration
bcache: fix refcount underflow in bcache_device_free()
bcache: Convert pr_<level> uses to a more typical style
bcache: remove redundant variables i and n
lpfc: Fix return value in __lpfc_nvme_ls_abort
lpfc: fix axchg pointer reference after free and double frees
lpfc: Fix pointer checks and comments in LS receive refactoring
nvme: set dma alignment to qword
nvmet: cleanups the loop in nvmet_async_events_process
nvmet: fix memory leak when removing namespaces and controllers concurrently
nvmet-rdma: add metadata/T10-PI support
nvmet: add metadata support for block devices
nvmet: add metadata/T10-PI support
nvme: add Metadata Capabilities enumerations
nvmet: rename nvmet_check_data_len to nvmet_check_transfer_len
nvmet: rename nvmet_rw_len to nvmet_rw_data_len
nvmet: add metadata characteristics for a namespace
nvme-rdma: add metadata/T10-PI support
nvme-rdma: introduce nvme_rdma_sgl structure
...
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl7VOwMQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpoR7EADAlz3TCkb4wwuHytTBDrm6gVDdsJ9zUfQW
Cl2ASLtufA8PWZUCEI3vhFyOe6P5e+ZZ0O2HjljSevmHyogCaRYXFYVfbWKcQKuk
AcxiTgnYNevh8KbGLfJY1WL4eXsY+C3QUGivg35cCgrx+kr9oDaHMeqA9Tm1plyM
FSprDBoSmHPqRxiV/1gnr8uXLX6K7i/fHzwmKgySMhavum7Ma8W3wdAGebzvQwrO
SbFSuJVgz06e4B1Fzr/wSvVNUE/qW/KqfGuQKIp7VQFIywbgG7TgRMHjE1FSnpnh
gn+BfL+O5gc0sTvcOTGOE0SRWWwLx961WNg8Azq08l3fzsxLA6h8/AnoDf3i+QMA
rHmLpWZIic2xPSvjaFHX3/V9ITyGYeAMpAR77EL+4ivWrKv5JrBhnSLDt1fKILdg
5elxm7RDI+C4nCP4xuTlVCy5gCd6gwjgytKj+NUWhNq1WiGAD0B54SSiV+SbCSH6
Om2f5trcxz8E4pqWcf0k3LjFapVKRNV8v/+TmVkCdRPBl3y9P0h0wFTkkcEquqnJ
y7Yq6efdWviRCnX5w/r/yj0qBuk4xo5hMVsPmlthCWtnBm+xZQ6LwMRcq4HQgZgR
2SYNscZ3OFMekHssH7DvY4DAy1J+n83ims+KzbScbLg2zCZjh/scQuv38R5Eh9WZ
rCS8c+T7Ig==
=HYf4
-----END PGP SIGNATURE-----
Merge tag 'for-5.8/block-2020-06-01' of git://git.kernel.dk/linux-block
Pull block updates from Jens Axboe:
"Core block changes that have been queued up for this release:
- Remove dead blk-throttle and blk-wbt code (Guoqing)
- Include pid in blktrace note traces (Jan)
- Don't spew I/O errors on wouldblock termination (me)
- Zone append addition (Johannes, Keith, Damien)
- IO accounting improvements (Konstantin, Christoph)
- blk-mq hardware map update improvements (Ming)
- Scheduler dispatch improvement (Salman)
- Inline block encryption support (Satya)
- Request map fixes and improvements (Weiping)
- blk-iocost tweaks (Tejun)
- Fix for timeout failing with error injection (Keith)
- Queue re-run fixes (Douglas)
- CPU hotplug improvements (Christoph)
- Queue entry/exit improvements (Christoph)
- Move DMA drain handling to the few drivers that use it (Christoph)
- Partition handling cleanups (Christoph)"
* tag 'for-5.8/block-2020-06-01' of git://git.kernel.dk/linux-block: (127 commits)
block: mark bio_wouldblock_error() bio with BIO_QUIET
blk-wbt: rename __wbt_update_limits to wbt_update_limits
blk-wbt: remove wbt_update_limits
blk-throttle: remove tg_drain_bios
blk-throttle: remove blk_throtl_drain
null_blk: force complete for timeout request
blk-mq: drain I/O when all CPUs in a hctx are offline
blk-mq: add blk_mq_all_tag_iter
blk-mq: open code __blk_mq_alloc_request in blk_mq_alloc_request_hctx
blk-mq: use BLK_MQ_NO_TAG in more places
blk-mq: rename BLK_MQ_TAG_FAIL to BLK_MQ_NO_TAG
blk-mq: move more request initialization to blk_mq_rq_ctx_init
blk-mq: simplify the blk_mq_get_request calling convention
blk-mq: remove the bio argument to ->prepare_request
nvme: force complete cancelled requests
blk-mq: blk-mq: provide forced completion method
block: fix a warning when blkdev.h is included for !CONFIG_BLOCK builds
block: blk-crypto-fallback: remove redundant initialization of variable err
block: reduce part_stat_lock() scope
block: use __this_cpu_add() instead of access by smp_processor_id()
...
Check permission before opening a real file.
ovl_path_open() is used by readdir and copy-up routines.
ovl_permission() theoretically already checked copy up permissions, but it
doesn't hurt to re-do these checks during the actual copy-up.
For directory reading ovl_permission() only checks access to topmost
underlying layer. Readdir on a merged directory accesses layers below the
topmost one as well. Permission wasn't checked for these layers.
Note: modifying ovl_permission() to perform this check would be far more
complex and hence more bug prone. The result is less precise permissions
returned in access(2). If this turns out to be an issue, we can revisit
this bug.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
In preparation for more permission checking, override credentials for
directory operations on the underlying filesystems.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
The three instances of ovl_path_open() in overlayfs/readdir.c do three
different things:
- pass f_flags from overlay file
- pass O_RDONLY | O_DIRECTORY
- pass just O_RDONLY
The value of f_flags can be (other than O_RDONLY):
O_WRONLY - not possible for a directory
O_RDWR - not possible for a directory
O_CREAT - masked out by dentry_open()
O_EXCL - masked out by dentry_open()
O_NOCTTY - masked out by dentry_open()
O_TRUNC - masked out by dentry_open()
O_APPEND - no effect on directory ops
O_NDELAY - no effect on directory ops
O_NONBLOCK - no effect on directory ops
__O_SYNC - no effect on directory ops
O_DSYNC - no effect on directory ops
FASYNC - no effect on directory ops
O_DIRECT - no effect on directory ops
O_LARGEFILE - ?
O_DIRECTORY - only affects lookup
O_NOFOLLOW - only affects lookup
O_NOATIME - overlay sets this unconditionally in ovl_path_open()
O_CLOEXEC - only affects fd allocation
O_PATH - no effect on directory ops
__O_TMPFILE - not possible for a directory
Fon non-merge directories we use the underlying filesystem's iterate; in
this case honor O_LARGEFILE from the original file to make sure that open
doesn't get rejected.
For merge directories it's safe to pass O_LARGEFILE unconditionally since
userspace will only see the artificial offsets created by overlayfs.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Amir pointed me to metacopy test cases in unionmount-testsuite and I
decided to run "./run --ov=10 --meta" and it failed while running test
"rename-mass-5.py".
Problem is w.r.t absolute redirect traversal on intermediate metacopy
dentry. We do not store intermediate metacopy dentries and also skip
current loop/layer and move onto lookup in next layer. But at the end of
loop, we have logic to reset "poe" and layer index if currnently looked up
dentry has absolute redirect. We skip all that and that means lookup in
next layer will fail.
Following is simple test case to reproduce this.
- mkdir -p lower upper work merged lower/a lower/b
- touch lower/a/foo.txt
- mount -t overlay -o lowerdir=lower,upperdir=upper,workdir=work,metacopy=on none merged
# Following will create absolute redirect "/a/foo.txt" on upper/b/bar.txt.
- mv merged/a/foo.txt merged/b/bar.txt
# unmount overlay and use upper as lower layer (lower2) for next mount.
- umount merged
- mv upper lower2
- rm -rf work; mkdir -p upper work
- mount -t overlay -o lowerdir=lower2:lower,upperdir=upper,workdir=work,metacopy=on none merged
# Force a metacopy copy-up
- chown bin:bin merged/b/bar.txt
# unmount overlay and use upper as lower layer (lower3) for next mount.
- umount merged
- mv upper lower3
- rm -rf work; mkdir -p upper work
- mount -t overlay -o lowerdir=lower3:lower2:lower,upperdir=upper,workdir=work,metacopy=on none merged
# ls merged/b/bar.txt
ls: cannot access 'bar.txt': Input/output error
Intermediate lower layer (lower2) has metacopy dentry b/bar.txt with
absolute redirect "/a/foo.txt". We skipped redirect processing at the end
of loop which sets poe to roe and sets the appropriate next lower layer
index. And that means lookup failed in next layer.
Fix this by continuing the loop for any intermediate dentries. We still do
not save these at lower stack. With this fix applied unionmount-testsuite,
"./run --ov-10 --meta" now passes.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Currently ovl_get_inode() initializes OVL_UPPERDATA flag and for that it
has to call ovl_check_metacopy_xattr() and check if metacopy xattr is
present or not.
yangerkun reported sometimes underlying filesystem might return -EIO and in
that case error handling path does not cleanup properly leading to various
warnings.
Run generic/461 with ext4 upper/lower layer sometimes may trigger the bug
as below(linux 4.19):
[ 551.001349] overlayfs: failed to get metacopy (-5)
[ 551.003464] overlayfs: failed to get inode (-5)
[ 551.004243] overlayfs: cleanup of 'd44/fd51' failed (-5)
[ 551.004941] overlayfs: failed to get origin (-5)
[ 551.005199] ------------[ cut here ]------------
[ 551.006697] WARNING: CPU: 3 PID: 24674 at fs/inode.c:1528 iput+0x33b/0x400
...
[ 551.027219] Call Trace:
[ 551.027623] ovl_create_object+0x13f/0x170
[ 551.028268] ovl_create+0x27/0x30
[ 551.028799] path_openat+0x1a35/0x1ea0
[ 551.029377] do_filp_open+0xad/0x160
[ 551.029944] ? vfs_writev+0xe9/0x170
[ 551.030499] ? page_counter_try_charge+0x77/0x120
[ 551.031245] ? __alloc_fd+0x160/0x2a0
[ 551.031832] ? do_sys_open+0x189/0x340
[ 551.032417] ? get_unused_fd_flags+0x34/0x40
[ 551.033081] do_sys_open+0x189/0x340
[ 551.033632] __x64_sys_creat+0x24/0x30
[ 551.034219] do_syscall_64+0xd5/0x430
[ 551.034800] entry_SYSCALL_64_after_hwframe+0x44/0xa9
One solution is to improve error handling and call iget_failed() if error
is encountered. Amir thinks that this path is little intricate and there
is not real need to check and initialize OVL_UPPERDATA in ovl_get_inode().
Instead caller of ovl_get_inode() can initialize this state. And this will
avoid double checking of metacopy xattr lookup in ovl_lookup() and
ovl_get_inode().
OVL_UPPERDATA is inode flag. So I was little concerned that initializing
it outside ovl_get_inode() might have some races. But this is one way
transition. That is once a file has been fully copied up, it can't go back
to metacopy file again. And that seems to help avoid races. So as of now
I can't see any races w.r.t OVL_UPPERDATA being set wrongly. So move
settingof OVL_UPPERDATA inside the callers of ovl_get_inode().
ovl_obtain_alias() already does it. So only two callers now left are
ovl_lookup() and ovl_instantiate().
Reported-by: yangerkun <yangerkun@huawei.com>
Suggested-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Currently we use a variable "metacopy" which signifies that dentry could be
either uppermetacopy or lowermetacopy. Amir suggested that we can move
code around and use d.metacopy in such a way that we don't need
lowermetacopy and just can do away with uppermetacopy.
So this patch replaces "metacopy" with "uppermetacopy".
It also moves some code little higher to keep reading little simpler.
Suggested-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
overlayfs can keep index of copied up files and directories and it seems to
serve two primary puroposes. For regular files, it avoids breaking lower
hardlinks over copy up. For directories it seems to be used for various
error checks.
During ovl_lookup(), we lookup for index using lower dentry in many a
cases. That lower dentry is called "origin" and following is a summary of
current logic.
If there is no upperdentry, always lookup for index using lower dentry.
For regular files it helps avoiding breaking hard links over copyup and for
directories it seems to be just error checks.
If there is an upperdentry, then there are 3 possible cases.
- For directories, lower dentry is found using two ways. One is regular
path based lookup in lower layers and second is using ORIGIN xattr on
upper dentry. First verify that path based lookup lower dentry matches
the one pointed by upper ORIGIN xattr. If yes, use this verified origin
for index lookup.
- For regular files (non-metacopy), there is no path based lookup in lower
layers as lookup stops once we find upper dentry. So there is no origin
verification. If there is ORIGIN xattr present on upper, use that to
lookup index otherwise don't.
- For regular metacopy files, again lower dentry is found using path based
lookup as well as ORIGIN xattr on upper. Path based lookup is continued
in this case to find lower data dentry for metacopy upper. So like
directories we only use verified origin. If ORIGIN xattr is not present
(Either because lower did not support file handles or because this is
hardlink copied up with index=off), then don't use path lookup based
lower dentry as origin. This is same as regular non-metacopy file case.
Suggested-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
syzbot reported out of bounds memory access from open_by_handle_at()
with a crafted file handle that looks like this:
{ .handle_bytes = 2, .handle_type = OVL_FILEID_V1 }
handle_bytes gets rounded down to 0 and we end up calling:
ovl_check_fh_len(fh, 0) => ovl_check_fb_len(fh + 3, -3)
But fh buffer is only 2 bytes long, so accessing struct ovl_fb at
fh + 3 is illegal.
Fixes: cbe7fba8ed ("ovl: make sure that real fid is 32bit aligned in memory")
Reported-and-tested-by: syzbot+61958888b1c60361a791@syzkaller.appspotmail.com
Cc: <stable@vger.kernel.org> # v5.5
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
- Rework the system-wide PM driver flags to make them easier to
understand and use and update their documentation (Rafael Wysocki,
Alan Stern).
- Allow cpuidle governors to be switched at run time regardless of
the kernel configuration and update the related documentation
accordingly (Hanjun Guo).
- Improve the resume device handling in the user space hibernarion
interface code (Domenico Andreoli).
- Document the intel-speed-select sysfs interface (Srinivas
Pandruvada).
- Make the ACPI code handing suspend to idle print more debug
messages to help diagnose issues with it (Rafael Wysocki).
- Fix a helper routine in the cpufreq core and correct a typo in
the struct cpufreq_driver kerneldoc comment (Rafael Wysocki, Wang
Wenhu).
- Update cpufreq drivers:
* Make the intel_pstate driver start in the passive mode by
default on systems without HWP (Rafael Wysocki).
* Add i.MX7ULP support to the imx-cpufreq-dt driver and add
i.MX7ULP to the cpufreq-dt-platdev blacklist (Peng Fan).
* Convert the qoriq cpufreq driver to a platform one, make the
platform code create a suitable device object for it and add
platform dependencies to it (Mian Yousaf Kaukab, Geert
Uytterhoeven).
* Fix wrong compatible binding in the qcom driver (Ansuel Smith).
* Build the omap driver by default for ARCH_OMAP2PLUS (Anders
Roxell).
* Add r8a7742 SoC support to the dt cpufreq driver (Lad Prabhakar).
- Update cpuidle core and drivers:
* Fix three reference count leaks in error code paths in the
cpuidle core (Qiushi Wu).
* Convert Qualcomm SPM to a generic cpuidle driver (Stephan
Gerhold).
* Fix up the execution order when entering a domain idle state in
the PSCI driver (Ulf Hansson).
- Fix a reference counting issue related to clock management and
clean up two oddities in the PM-runtime framework (Rafael Wysocki,
Andy Shevchenko).
- Add ElkhartLake support to the Intel RAPL power capping driver
and remove an unused local MSR definition from it (Jacob Pan,
Sumeet Pawnikar).
- Update devfreq core and drivers:
* Replace strncpy() with strscpy() in the devfreq core and use
lockdep asserts instead of manual checks for a locked mutex in
it (Dmitry Osipenko, Krzysztof Kozlowski).
* Add a generic imx bus scaling driver and make it register an
interconnect device (Leonard Crestez, Gustavo A. R. Silva).
* Make the cpufreq notifier in the tegra30 driver take boosting
into account and delete an unuseful error message from that
driver (Dmitry Osipenko, Markus Elfring).
- Remove unneeded semicolon from the cpupower code (Zou Wei).
-----BEGIN PGP SIGNATURE-----
iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAl7VGjwSHHJqd0Byand5
c29ja2kubmV0AAoJEILEb/54YlRx46gP/jGAXlddFEQswi6qUT3Cff0A9mb8CdcX
dyKrjX4xxo/wtBIAwSN4achxrgse//ayo2dYTzWRDd31W9Azbv+5F+46XsDRz4hL
pH29u/E66NMtFWnHCmt78NEJn0FzSa0YBC43ZzwFwKktCK9skYIpGN2z6iuXUBSX
Q5GHqop3zvDsdKQFBGL62xvUw/AmOTPG7ohIZvqWBN2mbOqEqMcoFHT+aUF/NbLj
+i14dvTH767eDZGRVASmXWQyljjaRWm+SIw4+m8zT1D1Y3d5IFObuMN+9RQl1Tif
BYjkgJ2oDDMhCJLW7TBuJB+g7exiyaSQds3nMr2ZR+eZbJipICjU4eehNEKIUopU
DM17tHQfnwZfS/7YbCx3vYQwLkNq37AJyXS9uqCAIFM+0n4xN4/mIVmgWYISLDTs
1v9olFxtwMRNpjGGQWPJAO7ebB8Zz9qhQv7pIkSQEfwp93/SzvlVf4vvruTeFN9J
qqG60cDumXWAm+s43eQHJNn5nOd5ocWv0FBpo/cxqKbzxFVWwdB42Cm0SY+rK2ID
uHdnc2DJcK2c78UVbz3Cmk4272foJt2zxchqjFXXAZPLrOsFfzmti4B28VxGxjmP
LG3MhH5sdbF4yl/1aSC1Bnrt+PV9Lus6ut/VKhjwIpw8cqiXgpwSbMoDoaBd9UMQ
ubGz2rplGAtB
=APdj
-----END PGP SIGNATURE-----
Merge tag 'pm-5.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management updates from Rafael Wysocki:
"These rework the system-wide PM driver flags, make runtime switching
of cpuidle governors easier, improve the user space hibernation
interface code, add intel-speed-select interface documentation, add
more debug messages to the ACPI code handling suspend to idle, update
the cpufreq core and drivers, fix a minor issue in the cpuidle core
and update two cpuidle drivers, improve the PM-runtime framework,
update the Intel RAPL power capping driver, update devfreq core and
drivers, and clean up the cpupower utility.
Specifics:
- Rework the system-wide PM driver flags to make them easier to
understand and use and update their documentation (Rafael Wysocki,
Alan Stern).
- Allow cpuidle governors to be switched at run time regardless of
the kernel configuration and update the related documentation
accordingly (Hanjun Guo).
- Improve the resume device handling in the user space hibernarion
interface code (Domenico Andreoli).
- Document the intel-speed-select sysfs interface (Srinivas
Pandruvada).
- Make the ACPI code handing suspend to idle print more debug
messages to help diagnose issues with it (Rafael Wysocki).
- Fix a helper routine in the cpufreq core and correct a typo in the
struct cpufreq_driver kerneldoc comment (Rafael Wysocki, Wang
Wenhu).
- Update cpufreq drivers:
- Make the intel_pstate driver start in the passive mode by
default on systems without HWP (Rafael Wysocki).
- Add i.MX7ULP support to the imx-cpufreq-dt driver and add
i.MX7ULP to the cpufreq-dt-platdev blacklist (Peng Fan).
- Convert the qoriq cpufreq driver to a platform one, make the
platform code create a suitable device object for it and add
platform dependencies to it (Mian Yousaf Kaukab, Geert
Uytterhoeven).
- Fix wrong compatible binding in the qcom driver (Ansuel Smith).
- Build the omap driver by default for ARCH_OMAP2PLUS (Anders
Roxell).
- Add r8a7742 SoC support to the dt cpufreq driver (Lad
Prabhakar).
- Update cpuidle core and drivers:
- Fix three reference count leaks in error code paths in the
cpuidle core (Qiushi Wu).
- Convert Qualcomm SPM to a generic cpuidle driver (Stephan
Gerhold).
- Fix up the execution order when entering a domain idle state in
the PSCI driver (Ulf Hansson).
- Fix a reference counting issue related to clock management and
clean up two oddities in the PM-runtime framework (Rafael Wysocki,
Andy Shevchenko).
- Add ElkhartLake support to the Intel RAPL power capping driver and
remove an unused local MSR definition from it (Jacob Pan, Sumeet
Pawnikar).
- Update devfreq core and drivers:
- Replace strncpy() with strscpy() in the devfreq core and use
lockdep asserts instead of manual checks for a locked mutex in
it (Dmitry Osipenko, Krzysztof Kozlowski).
- Add a generic imx bus scaling driver and make it register an
interconnect device (Leonard Crestez, Gustavo A. R. Silva).
- Make the cpufreq notifier in the tegra30 driver take boosting
into account and delete an unuseful error message from that
driver (Dmitry Osipenko, Markus Elfring).
- Remove unneeded semicolon from the cpupower code (Zou Wei)"
* tag 'pm-5.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (51 commits)
cpuidle: Fix three reference count leaks
PM: runtime: Replace pm_runtime_callbacks_present()
PM / devfreq: Use lockdep asserts instead of manual checks for locked mutex
PM / devfreq: imx-bus: Fix inconsistent IS_ERR and PTR_ERR
PM / devfreq: Replace strncpy with strscpy
PM / devfreq: imx: Register interconnect device
PM / devfreq: Add generic imx bus scaling driver
PM / devfreq: tegra30: Delete an error message in tegra_devfreq_probe()
PM / devfreq: tegra30: Make CPUFreq notifier to take into account boosting
PM: hibernate: Restrict writes to the resume device
PM: runtime: clk: Fix clk_pm_runtime_get() error path
cpuidle: Convert Qualcomm SPM driver to a generic CPUidle driver
ACPI: EC: PM: s2idle: Extend GPE dispatching debug message
ACPI: PM: s2idle: Print type of wakeup debug messages
powercap: RAPL: remove unused local MSR define
PM: runtime: Make clear what we do when conditions are wrong in rpm_suspend()
Documentation: admin-guide: pm: Document intel-speed-select
PM: hibernate: Split off snapshot dev option
PM: hibernate: Incorporate concurrency handling
Documentation: ABI: make current_governer_ro as a candidate for removal
...
Before this patch, the error path of function gfs2_create_inode would
always calls gfs2_glock_put for the inode glock. That's good for inodes
that are free. But after they've been added to the vfs inodes, errors
will cause the inode to be evicted, and the evict will do the glock
put for us. If we do a glock put again, we can try to free the glock
while there are still references to it, e.g. revokes pending for
the transaction that created it.
This patch adds a check: if (free_vfs_inode) before the put, thus
solving the problem.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Merge updates from Andrew Morton:
"A few little subsystems and a start of a lot of MM patches.
Subsystems affected by this patch series: squashfs, ocfs2, parisc,
vfs. With mm subsystems: slab-generic, slub, debug, pagecache, gup,
swap, memcg, pagemap, memory-failure, vmalloc, kasan"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (128 commits)
kasan: move kasan_report() into report.c
mm/mm_init.c: report kasan-tag information stored in page->flags
ubsan: entirely disable alignment checks under UBSAN_TRAP
kasan: fix clang compilation warning due to stack protector
x86/mm: remove vmalloc faulting
mm: remove vmalloc_sync_(un)mappings()
x86/mm/32: implement arch_sync_kernel_mappings()
x86/mm/64: implement arch_sync_kernel_mappings()
mm/ioremap: track which page-table levels were modified
mm/vmalloc: track which page-table levels were modified
mm: add functions to track page directory modifications
s390: use __vmalloc_node in stack_alloc
powerpc: use __vmalloc_node in alloc_vm_stack
arm64: use __vmalloc_node in arch_alloc_vmap_stack
mm: remove vmalloc_user_node_flags
mm: switch the test_vmalloc module to use __vmalloc_node
mm: remove __vmalloc_node_flags_caller
mm: remove both instances of __vmalloc_node_flags
mm: remove the prot argument to __vmalloc_node
mm: remove the pgprot argument to __vmalloc
...
This is always PAGE_KERNEL - for long term mappings with other properties
vmap should be used.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Christophe Leroy <christophe.leroy@c-s.fr>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: David Airlie <airlied@linux.ie>
Cc: Gao Xiang <xiang@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Michael Kelley <mikelley@microsoft.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Sakari Ailus <sakari.ailus@linux.intel.com>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Sumit Semwal <sumit.semwal@linaro.org>
Cc: Wei Liu <wei.liu@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Paul Mackerras <paulus@ozlabs.org>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Will Deacon <will@kernel.org>
Link: http://lkml.kernel.org/r/20200414131348.444715-19-hch@lst.de
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
After an NFS page has been written it is considered "unstable" until a
COMMIT request succeeds. If the COMMIT fails, the page will be
re-written.
These "unstable" pages are currently accounted as "reclaimable", either
in WB_RECLAIMABLE, or in NR_UNSTABLE_NFS which is included in a
'reclaimable' count. This might have made sense when sending the COMMIT
required a separate action by the VFS/MM (e.g. releasepage() used to
send a COMMIT). However now that all writes generated by ->writepages()
will automatically be followed by a COMMIT (since commit 919e3bd9a8
("NFS: Ensure we commit after writeback is complete")) it makes more
sense to treat them as writeback pages.
So this patch removes NR_UNSTABLE_NFS and accounts unstable pages in
NR_WRITEBACK and WB_WRITEBACK.
A particular effect of this change is that when
wb_check_background_flush() calls wb_over_bg_threshold(), the latter
will report 'true' a lot less often as the 'unstable' pages are no
longer considered 'dirty' (as there is nothing that writeback can do
about them anyway).
Currently wb_check_background_flush() will trigger writeback to NFS even
when there are relatively few dirty pages (if there are lots of unstable
pages), this can result in small writes going to the server (10s of
Kilobytes rather than a Megabyte) which hurts throughput. With this
patch, there are fewer writes which are each larger on average.
Where the NR_UNSTABLE_NFS count was included in statistics
virtual-files, the entry is retained, but the value is hard-coded as
zero. static trace points and warning printks which mentioned this
counter no longer report it.
[akpm@linux-foundation.org: re-layout comment]
[akpm@linux-foundation.org: fix printk warning]
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Acked-by: Michal Hocko <mhocko@suse.com> [mm]
Cc: Christoph Hellwig <hch@lst.de>
Cc: Chuck Lever <chuck.lever@oracle.com>
Link: http://lkml.kernel.org/r/87d06j7gqa.fsf@notabene.neil.brown.name
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
PF_LESS_THROTTLE exists for loop-back nfsd (and a similar need in the
loop block driver and callers of prctl(PR_SET_IO_FLUSHER)), where a
daemon needs to write to one bdi (the final bdi) in order to free up
writes queued to another bdi (the client bdi).
The daemon sets PF_LESS_THROTTLE and gets a larger allowance of dirty
pages, so that it can still dirty pages after other processses have been
throttled. The purpose of this is to avoid deadlock that happen when
the PF_LESS_THROTTLE process must write for any dirty pages to be freed,
but it is being thottled and cannot write.
This approach was designed when all threads were blocked equally,
independently on which device they were writing to, or how fast it was.
Since that time the writeback algorithm has changed substantially with
different threads getting different allowances based on non-trivial
heuristics. This means the simple "add 25%" heuristic is no longer
reliable.
The important issue is not that the daemon needs a *larger* dirty page
allowance, but that it needs a *private* dirty page allowance, so that
dirty pages for the "client" bdi that it is helping to clear (the bdi
for an NFS filesystem or loop block device etc) do not affect the
throttling of the daemon writing to the "final" bdi.
This patch changes the heuristic so that the task is not throttled when
the bdi it is writing to has a dirty page count below below (or equal
to) the free-run threshold for that bdi. This ensures it will always be
able to have some pages in flight, and so will not deadlock.
In a steady-state, it is expected that PF_LOCAL_THROTTLE tasks might
still be throttled by global threshold, but that is acceptable as it is
only the deadlock state that is interesting for this flag.
This approach of "only throttle when target bdi is busy" is consistent
with the other use of PF_LESS_THROTTLE in current_may_throttle(), were
it causes attention to be focussed only on the target bdi.
So this patch
- renames PF_LESS_THROTTLE to PF_LOCAL_THROTTLE,
- removes the 25% bonus that that flag gives, and
- If PF_LOCAL_THROTTLE is set, don't delay at all unless the
global and the local free-run thresholds are exceeded.
Note that previously realtime threads were treated the same as
PF_LESS_THROTTLE threads. This patch does *not* change the behvaiour
for real-time threads, so it is now different from the behaviour of nfsd
and loop tasks. I don't know what is wanted for realtime.
[akpm@linux-foundation.org: coding style fixes]
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Chuck Lever <chuck.lever@oracle.com> [nfsd]
Cc: Christoph Hellwig <hch@lst.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Link: http://lkml.kernel.org/r/87ftbf7gs3.fsf@notabene.neil.brown.name
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since the new pair function is introduced, we can call them to clean the
code in orangefs.
Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Mike Marshall <hubcap@omnibond.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Martin Brandenburg <martin@omnibond.com>
Link: http://lkml.kernel.org/r/20200517214718.468-9-guoqing.jiang@cloud.ionos.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Call the new function since attach_page_buffers will be removed.
Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Anton Altaparmakov <anton@tuxera.com>
Link: http://lkml.kernel.org/r/20200517214718.468-8-guoqing.jiang@cloud.ionos.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since the new pair function is introduced, we can call them to clean the
code in iomap.
Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>
Link: http://lkml.kernel.org/r/20200517214718.468-7-guoqing.jiang@cloud.ionos.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since the new pair function is introduced, we can call them to clean the
code in f2fs.h.
Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Chao Yu <yuchao0@huawei.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Link: http://lkml.kernel.org/r/20200517214718.468-6-guoqing.jiang@cloud.ionos.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since the new pair function is introduced, we can call them to clean the
code in buffer.c.
Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Link: http://lkml.kernel.org/r/20200517214718.468-5-guoqing.jiang@cloud.ionos.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since the new pair function is introduced, we can call them to clean the
code in btrfs.
Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: David Sterba <dsterba@suse.com>
Cc: Chris Mason <clm@fb.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Link: http://lkml.kernel.org/r/20200517214718.468-4-guoqing.jiang@cloud.ionos.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use the new readahead operation in iomap. Convert XFS and ZoneFS to use
it.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Cc: Chao Yu <yuchao0@huawei.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Eric Biggers <ebiggers@google.com>
Cc: Gao Xiang <gaoxiang25@huawei.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Link: http://lkml.kernel.org/r/20200414150233.24495-26-willy@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Implement the new readahead operation in fuse by using __readahead_batch()
to fill the array of pages in fuse_args_pages directly. This lets us
inline fuse_readpages_fill() into fuse_readahead().
[willy@infradead.org: build fix]
Link: http://lkml.kernel.org/r/20200415025938.GB5820@bombadil.infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Acked-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: Chao Yu <yuchao0@huawei.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Eric Biggers <ebiggers@google.com>
Cc: Gao Xiang <gaoxiang25@huawei.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: http://lkml.kernel.org/r/20200414150233.24495-25-willy@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This function now only uses the mapping argument to look up the inode, and
both callers already have the inode, so just pass the inode instead of the
mapping.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Acked-by: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Gao Xiang <gaoxiang25@huawei.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Link: http://lkml.kernel.org/r/20200414150233.24495-24-willy@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use the new readahead operation in f2fs
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Acked-by: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Gao Xiang <gaoxiang25@huawei.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Link: http://lkml.kernel.org/r/20200414150233.24495-23-willy@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This function now only uses the mapping argument to look up the inode, and
both callers already have the inode, so just pass the inode instead of the
mapping.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Cc: Chao Yu <yuchao0@huawei.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Gao Xiang <gaoxiang25@huawei.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Link: http://lkml.kernel.org/r/20200414150233.24495-22-willy@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use the new readahead operation in ext4
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Cc: Chao Yu <yuchao0@huawei.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Gao Xiang <gaoxiang25@huawei.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Link: http://lkml.kernel.org/r/20200414150233.24495-21-willy@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use the new readahead operation in erofs.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Acked-by: Gao Xiang <gaoxiang25@huawei.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Eric Biggers <ebiggers@google.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Link: http://lkml.kernel.org/r/20200414150233.24495-20-willy@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use the new readahead operation in erofs
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Acked-by: Gao Xiang <gaoxiang25@huawei.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Eric Biggers <ebiggers@google.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Link: http://lkml.kernel.org/r/20200414150233.24495-19-willy@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Implement the new readahead method in btrfs using the new
readahead_page_batch() function.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Cc: Chao Yu <yuchao0@huawei.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Eric Biggers <ebiggers@google.com>
Cc: Gao Xiang <gaoxiang25@huawei.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Link: http://lkml.kernel.org/r/20200414150233.24495-18-willy@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Implement the new readahead aop and convert all callers (block_dev,
exfat, ext2, fat, gfs2, hpfs, isofs, jfs, nilfs2, ocfs2, omfs, qnx6,
reiserfs & udf).
The callers are all trivial except for GFS2 & OCFS2.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com> # ocfs2
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> # ocfs2
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Cc: Chao Yu <yuchao0@huawei.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Eric Biggers <ebiggers@google.com>
Cc: Gao Xiang <gaoxiang25@huawei.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Link: http://lkml.kernel.org/r/20200414150233.24495-17-willy@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
ext4 and f2fs have duplicated the guts of the readahead code so they can
read past i_size. Instead, separate out the guts of the readahead code
so they can call it directly.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Cc: Chao Yu <yuchao0@huawei.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Gao Xiang <gaoxiang25@huawei.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Link: http://lkml.kernel.org/r/20200414150233.24495-14-willy@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When syncing out a block device (a'la __sync_blockdev), any error
encountered will only be recorded in the bd_inode's mapping. When the
blockdev contains a filesystem however, we'd like to also record the
error in the super_block that's stored there.
Make mark_buffer_write_io_error also record the error in the
corresponding super_block when a writeback error occurs and the block
device contains a mounted superblock.
Since superblocks are RCU freed, hold the rcu_read_lock to ensure that
the superblock doesn't go away while we're marking it.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andres Freund <andres@anarazel.de>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>
Link: http://lkml.kernel.org/r/20200428135155.19223-3-jlayton@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "vfs: have syncfs() return error when there are writeback
errors", v6.
Currently, syncfs does not return errors when one of the inodes fails to
be written back. It will return errors based on the legacy AS_EIO and
AS_ENOSPC flags when syncing out the block device fails, but that's not
particularly helpful for filesystems that aren't backed by a blockdev.
It's also possible for a stray sync to lose those errors.
The basic idea in this set is to track writeback errors at the
superblock level, so that we can quickly and easily check whether
something bad happened without having to fsync each file individually.
syncfs is then changed to reliably report writeback errors after they
occur, much in the same fashion as fsync does now.
This patch (of 2):
Usually we suggest that applications call fsync when they want to ensure
that all data written to the file has made it to the backing store, but
that can be inefficient when there are a lot of open files.
Calling syncfs on the filesystem can be more efficient in some
situations, but the error reporting doesn't currently work the way most
people expect. If a single inode on a filesystem reports a writeback
error, syncfs won't necessarily return an error. syncfs only returns an
error if __sync_blockdev fails, and on some filesystems that's a no-op.
It would be better if syncfs reported an error if there were any
writeback failures. Then applications could call syncfs to see if there
are any errors on any open files, and could then call fsync on all of
the other descriptors to figure out which one failed.
This patch adds a new errseq_t to struct super_block, and has
mapping_set_error also record writeback errors there.
To report those errors, we also need to keep an errseq_t in struct file
to act as a cursor. This patch adds a dedicated field for that purpose,
which slots nicely into 4 bytes of padding at the end of struct file on
x86_64.
An earlier version of this patch used an O_PATH file descriptor to cue
the kernel that the open file should track the superblock error and not
the inode's writeback error.
I think that API is just too weird though. This is simpler and should
make syncfs error reporting "just work" even if someone is multiplexing
fsync and syncfs on the same fds.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Andres Freund <andres@anarazel.de>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: David Howells <dhowells@redhat.com>
Link: http://lkml.kernel.org/r/20200428135155.19223-1-jlayton@kernel.org
Link: http://lkml.kernel.org/r/20200428135155.19223-2-jlayton@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Usually we create and use a ocfs2 shared volume on the top of ha stack.
For pcmk based ha stack, which includes DLM, corosync and pacemaker
services.
The customers complained they could not mount existent ocfs2 volume in
the single node without ha stack, e.g. single node backup/restore
scenario.
Like this case, the customers just want to access the data from the
existent ocfs2 volume quickly, but do not want to restart or setup ha
stack.
Then, I'd like to add a mount option "nocluster", if the users use this
option to mount a ocfs2 shared volume, the whole mount will not depend
on the ha related services. the command will mount the existent ocfs2
volume directly (like local mount), for avoiding setup the ha stack.
Signed-off-by: Gang He <ghe@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Jun Piao <piaojun@huawei.com>
Link: http://lkml.kernel.org/r/20200423053300.22661-1-ghe@suse.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Sparse reports a warning at dlm_empty_lockres()
warning: context imbalance in dlm_purge_lockres() - unexpected unlock
The root cause is the missing annotation at dlm_purge_lockres()
Add the missing __must_hold(&dlm->spinlock)
Signed-off-by: Jules Irenge <jbi.octave@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Link: http://lkml.kernel.org/r/20200403160505.2832-4-jbi.octave@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
ll_rw_block() function has been deprecated in favor of BIO which appears
to come with large performance improvements.
This patch decreases boot time by close to 40% when using squashfs for
the root file-system. This is observed at least in the context of
starting an Android VM on Chrome OS using crosvm. The patch was tested
on 4.19 as well as master.
This patch is largely based on Adrien Schildknecht's patch that was
originally sent as https://lkml.org/lkml/2017/9/22/814 though with some
significant changes and simplifications while also taking Phillip
Lougher's feedback into account, around preserving support for
FILE_CACHE in particular.
[akpm@linux-foundation.org: fix build error reported by Randy]
Link: http://lkml.kernel.org/r/319997c2-5fc8-f889-2ea3-d913308a7c1f@infradead.org
Signed-off-by: Philippe Liard <pliard@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Adrien Schildknecht <adrien+dev@schischi.me>
Cc: Phillip Lougher <phillip@squashfs.org.uk>
Cc: Guenter Roeck <groeck@chromium.org>
Cc: Daniel Rosenberg <drosen@google.com>
Link: https://chromium.googlesource.com/chromiumos/platform/crosvm
Link: http://lkml.kernel.org/r/20191106074238.186023-1-pliard@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Before this patch, a simple typo accidentally added \n to the jid=
string for lock_nolock mounts. This made it impossible to mount a
gfs2 file system with a journal other than journal0. Thus:
mount -tgfs2 -o hostdata="jid=1" <device> <mount pt>
Resulted in:
mount: wrong fs type, bad option, bad superblock on <device>
In most cases this is not a problem. However, for debugging and
testing purposes we sometimes want to test the integrity of other
journals. This patch removes the unnecessary \n and thus allows
lock_nolock users to specify an alternate journal.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Before for this patch, function inode_go_sync ignored io errors
during inode_go_sync, overwriting them with metadata write errors:
error = filemap_fdatawait(mapping);
mapping_set_error(mapping, error);
}
error = filemap_fdatawait(metamapping);
...
return error;
So any errors returned by the inode write would be forgotten if the
metadata write succeeded. This patch still does both writes, but
only sets error if it's still zero. That way, any errors will be
reported by to the caller, do_xmote, which will take appropriate
action and report the error.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
We forget to call locks_move_blocks in posix_lock_inode when try to
process same owner and different types.
Signed-off-by: yangerkun <yangerkun@huawei.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
This commit moves channel picking code in separate function.
Signed-off-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
MS-SMB2 specification was updated in March. Make minor additions
and corrections to compression related definitions in smb2pdu.h
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Pull vfs updates from Al Viro:
"Assorted patches from Miklos.
An interesting part here is /proc/mounts stuff..."
The "/proc/mounts stuff" is using a cursor for keeeping the location
data while traversing the mount listing.
Also probably worth noting is the addition of faccessat2(), which takes
an additional set of flags to specify how the lookup is done
(AT_EACCESS, AT_SYMLINK_NOFOLLOW, AT_EMPTY_PATH).
* 'from-miklos' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
vfs: add faccessat2 syscall
vfs: don't parse "silent" option
vfs: don't parse "posixacl" option
vfs: don't parse forbidden flags
statx: add mount_root
statx: add mount ID
statx: don't clear STATX_ATIME on SB_RDONLY
uapi: deprecate STATX_ALL
utimensat: AT_EMPTY_PATH support
vfs: split out access_override_creds()
proc/mounts: add cursor
aio: fix async fsync creds
vfs: allow unprivileged whiteout creation
Pull uaccess/coredump updates from Al Viro:
"set_fs() removal in coredump-related area - mostly Christoph's
stuff..."
* 'work.set_fs-exec' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
binfmt_elf_fdpic: remove the set_fs(KERNEL_DS) in elf_fdpic_core_dump
binfmt_elf: remove the set_fs(KERNEL_DS) in elf_core_dump
binfmt_elf: remove the set_fs in fill_siginfo_note
signal: refactor copy_siginfo_to_user32
powerpc/spufs: simplify spufs core dumping
powerpc/spufs: stop using access_ok
powerpc/spufs: fix copy_to_user while atomic
Pull uaccess/__copy_to_user updates from Al Viro:
"Getting rid of __copy_to_user() callers - stuff that doesn't fit into
other series"
* 'uaccess.__copy_to_user' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
dlmfs: convert dlmfs_file_read() to copy_to_user()
esas2r: don't bother with __copy_to_user()
Pull uaccess/__copy_from_user updates from Al Viro:
"Getting rid of __copy_from_user() callers - patches that don't fit
into other series"
* 'uaccess.__copy_from_user' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
pstore: switch to copy_from_user()
firewire: switch ioctl_queue_iso to use of copy_from_user()
Pull uaccess/readdir updates from Al Viro:
"Finishing the conversion of readdir.c to unsafe_... API.
This includes the uaccess_{read,write}_begin series by Christophe
Leroy"
* 'uaccess.readdir' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
readdir.c: get rid of the last __put_user(), drop now-useless access_ok()
readdir.c: get compat_filldir() more or less in sync with filldir()
switch readdir(2) to unsafe_copy_dirent_name()
drm/i915/gem: Replace user_access_begin by user_write_access_begin
uaccess: Selectively open read or write user access
uaccess: Add user_read_access_begin/end and user_write_access_begin/end
Pull uaccess/access_ok updates from Al Viro:
"Removals of trivially pointless access_ok() calls.
Note: the fiemap stuff was removed from the series, since they are
duplicates with part of ext4 series carried in Ted's tree"
* 'uaccess.access_ok' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
vmci_host: get rid of pointless access_ok()
hfi1: get rid of pointless access_ok()
usb: get rid of pointless access_ok() calls
lpfc_debugfs: get rid of pointless access_ok()
efi_test: get rid of pointless access_ok()
drm_read(): get rid of pointless access_ok()
via-pmu: don't bother with access_ok()
drivers/crypto/ccp/sev-dev.c: get rid of pointless access_ok()
omapfb: get rid of pointless access_ok() calls
amifb: get rid of pointless access_ok() calls
drivers/fpga/dfl-afu-dma-region.c: get rid of pointless access_ok()
drivers/fpga/dfl-fme-pr.c: get rid of pointless access_ok()
cm4000_cs.c cmm_ioctl(): get rid of pointless access_ok()
nvram: drop useless access_ok()
n_hdlc_tty_read(): remove pointless access_ok()
tomoyo_write_control(): get rid of pointless access_ok()
btrfs_ioctl_send(): don't bother with access_ok()
fat_dir_ioctl(): hadn't needed that access_ok() for more than a decade...
dlmfs_file_write(): get rid of pointless access_ok()
set from Mauro toward the completion of the RST conversion. I *really*
hope we are getting close to the end of this. Meanwhile, those patches
reach pretty far afield to update document references around the tree;
there should be no actual code changes there. There will be, alas, more of
the usual trivial merge conflicts.
Beyond that we have more translations, improvements to the sphinx
scripting, a number of additions to the sysctl documentation, and lots of
fixes.
-----BEGIN PGP SIGNATURE-----
iQFDBAABCAAtFiEEIw+MvkEiF49krdp9F0NaE2wMflgFAl7VId8PHGNvcmJldEBs
d24ubmV0AAoJEBdDWhNsDH5Yq/gH/iaDgirQZV6UZ2v9sfwQNYolNpf2sKAuOZjd
bPFB7WJoMQbKwQEvYrAUL2+5zPOcLYuIfzyOfo1BV1py+EyKbACcKjI4AedxfJF7
+NchmOBhlEqmEhzx2U08HRc4/8J223WG17fJRVsV3p+opJySexSFeQucfOciX5NR
RUCxweWWyg/FgyqjkyMMTtsePqZPmcT5dWTlVXISlbWzcv5NFhuJXnSrw8Sfzcmm
SJMzqItv3O+CabnKQ8kMLV2PozXTMfjeWH47ZUK0Y8/8PP9+cvqwFzZ0UDQJ1Xaz
oyW/TqmunaXhfMsMFeFGSwtfgwRHvXdxkQdtwNHvo1dV4dzTvDw=
=fDC/
-----END PGP SIGNATURE-----
Merge tag 'docs-5.8' of git://git.lwn.net/linux
Pull documentation updates from Jonathan Corbet:
"A fair amount of stuff this time around, dominated by yet another
massive set from Mauro toward the completion of the RST conversion. I
*really* hope we are getting close to the end of this. Meanwhile,
those patches reach pretty far afield to update document references
around the tree; there should be no actual code changes there. There
will be, alas, more of the usual trivial merge conflicts.
Beyond that we have more translations, improvements to the sphinx
scripting, a number of additions to the sysctl documentation, and lots
of fixes"
* tag 'docs-5.8' of git://git.lwn.net/linux: (130 commits)
Documentation: fixes to the maintainer-entry-profile template
zswap: docs/vm: Fix typo accept_threshold_percent in zswap.rst
tracing: Fix events.rst section numbering
docs: acpi: fix old http link and improve document format
docs: filesystems: add info about efivars content
Documentation: LSM: Correct the basic LSM description
mailmap: change email for Ricardo Ribalda
docs: sysctl/kernel: document unaligned controls
Documentation: admin-guide: update bug-hunting.rst
docs: sysctl/kernel: document ngroups_max
nvdimm: fixes to maintainter-entry-profile
Documentation/features: Correct RISC-V kprobes support entry
Documentation/features: Refresh the arch support status files
Revert "docs: sysctl/kernel: document ngroups_max"
docs: move locking-specific documents to locking/
docs: move digsig docs to the security book
docs: move the kref doc into the core-api book
docs: add IRQ documentation at the core-api book
docs: debugging-via-ohci1394.txt: add it to the core-api book
docs: fix references for ipmi.rst file
...
- Branch Target Identification (BTI)
* Support for ARMv8.5-BTI in both user- and kernel-space. This
allows branch targets to limit the types of branch from which
they can be called and additionally prevents branching to
arbitrary code, although kernel support requires a very recent
toolchain.
* Function annotation via SYM_FUNC_START() so that assembly
functions are wrapped with the relevant "landing pad"
instructions.
* BPF and vDSO updates to use the new instructions.
* Addition of a new HWCAP and exposure of BTI capability to
userspace via ID register emulation, along with ELF loader
support for the BTI feature in .note.gnu.property.
* Non-critical fixes to CFI unwind annotations in the sigreturn
trampoline.
- Shadow Call Stack (SCS)
* Support for Clang's Shadow Call Stack feature, which reserves
platform register x18 to point at a separate stack for each
task that holds only return addresses. This protects function
return control flow from buffer overruns on the main stack.
* Save/restore of x18 across problematic boundaries (user-mode,
hypervisor, EFI, suspend, etc).
* Core support for SCS, should other architectures want to use it
too.
* SCS overflow checking on context-switch as part of the existing
stack limit check if CONFIG_SCHED_STACK_END_CHECK=y.
- CPU feature detection
* Removed numerous "SANITY CHECK" errors when running on a system
with mismatched AArch32 support at EL1. This is primarily a
concern for KVM, which disabled support for 32-bit guests on
such a system.
* Addition of new ID registers and fields as the architecture has
been extended.
- Perf and PMU drivers
* Minor fixes and cleanups to system PMU drivers.
- Hardware errata
* Unify KVM workarounds for VHE and nVHE configurations.
* Sort vendor errata entries in Kconfig.
- Secure Monitor Call Calling Convention (SMCCC)
* Update to the latest specification from Arm (v1.2).
* Allow PSCI code to query the SMCCC version.
- Software Delegated Exception Interface (SDEI)
* Unexport a bunch of unused symbols.
* Minor fixes to handling of firmware data.
- Pointer authentication
* Add support for dumping the kernel PAC mask in vmcoreinfo so
that the stack can be unwound by tools such as kdump.
* Simplification of key initialisation during CPU bringup.
- BPF backend
* Improve immediate generation for logical and add/sub
instructions.
- vDSO
- Minor fixes to the linker flags for consistency with other
architectures and support for LLVM's unwinder.
- Clean up logic to initialise and map the vDSO into userspace.
- ACPI
- Work around for an ambiguity in the IORT specification relating
to the "num_ids" field.
- Support _DMA method for all named components rather than only
PCIe root complexes.
- Minor other IORT-related fixes.
- Miscellaneous
* Initialise debug traps early for KGDB and fix KDB cacheflushing
deadlock.
* Minor tweaks to early boot state (documentation update, set
TEXT_OFFSET to 0x0, increase alignment of PE/COFF sections).
* Refactoring and cleanup
-----BEGIN PGP SIGNATURE-----
iQFEBAABCgAuFiEEPxTL6PPUbjXGY88ct6xw3ITBYzQFAl7U9csQHHdpbGxAa2Vy
bmVsLm9yZwAKCRC3rHDchMFjNLBHCACs/YU4SM7Om5f+7QnxIKao5DBr2CnGGvdC
yTfDghFDTLQVv3MufLlfno3yBe5G8sQpcZfcc+hewfcGoMzVZXu8s7LzH6VSn9T9
jmT3KjDMrg0RjSHzyumJp2McyelTk0a4FiKArSIIKsJSXUyb1uPSgm7SvKVDwEwU
JGDzL9IGilmq59GiXfDzGhTZgmC37QdwRoRxDuqtqWQe5CHoRXYexg87HwBKOQxx
HgU9L7ehri4MRZfpyjaDrr6quJo3TVnAAKXNBh3mZAskVS9ZrfKpEH0kYWYuqybv
znKyHRecl/rrGePV8RTMtrwnSdU26zMXE/omsVVauDfG9hqzqm+Q
=w3qi
-----END PGP SIGNATURE-----
Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 updates from Will Deacon:
"A sizeable pile of arm64 updates for 5.8.
Summary below, but the big two features are support for Branch Target
Identification and Clang's Shadow Call stack. The latter is currently
arm64-only, but the high-level parts are all in core code so it could
easily be adopted by other architectures pending toolchain support
Branch Target Identification (BTI):
- Support for ARMv8.5-BTI in both user- and kernel-space. This allows
branch targets to limit the types of branch from which they can be
called and additionally prevents branching to arbitrary code,
although kernel support requires a very recent toolchain.
- Function annotation via SYM_FUNC_START() so that assembly functions
are wrapped with the relevant "landing pad" instructions.
- BPF and vDSO updates to use the new instructions.
- Addition of a new HWCAP and exposure of BTI capability to userspace
via ID register emulation, along with ELF loader support for the
BTI feature in .note.gnu.property.
- Non-critical fixes to CFI unwind annotations in the sigreturn
trampoline.
Shadow Call Stack (SCS):
- Support for Clang's Shadow Call Stack feature, which reserves
platform register x18 to point at a separate stack for each task
that holds only return addresses. This protects function return
control flow from buffer overruns on the main stack.
- Save/restore of x18 across problematic boundaries (user-mode,
hypervisor, EFI, suspend, etc).
- Core support for SCS, should other architectures want to use it
too.
- SCS overflow checking on context-switch as part of the existing
stack limit check if CONFIG_SCHED_STACK_END_CHECK=y.
CPU feature detection:
- Removed numerous "SANITY CHECK" errors when running on a system
with mismatched AArch32 support at EL1. This is primarily a concern
for KVM, which disabled support for 32-bit guests on such a system.
- Addition of new ID registers and fields as the architecture has
been extended.
Perf and PMU drivers:
- Minor fixes and cleanups to system PMU drivers.
Hardware errata:
- Unify KVM workarounds for VHE and nVHE configurations.
- Sort vendor errata entries in Kconfig.
Secure Monitor Call Calling Convention (SMCCC):
- Update to the latest specification from Arm (v1.2).
- Allow PSCI code to query the SMCCC version.
Software Delegated Exception Interface (SDEI):
- Unexport a bunch of unused symbols.
- Minor fixes to handling of firmware data.
Pointer authentication:
- Add support for dumping the kernel PAC mask in vmcoreinfo so that
the stack can be unwound by tools such as kdump.
- Simplification of key initialisation during CPU bringup.
BPF backend:
- Improve immediate generation for logical and add/sub instructions.
vDSO:
- Minor fixes to the linker flags for consistency with other
architectures and support for LLVM's unwinder.
- Clean up logic to initialise and map the vDSO into userspace.
ACPI:
- Work around for an ambiguity in the IORT specification relating to
the "num_ids" field.
- Support _DMA method for all named components rather than only PCIe
root complexes.
- Minor other IORT-related fixes.
Miscellaneous:
- Initialise debug traps early for KGDB and fix KDB cacheflushing
deadlock.
- Minor tweaks to early boot state (documentation update, set
TEXT_OFFSET to 0x0, increase alignment of PE/COFF sections).
- Refactoring and cleanup"
* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (148 commits)
KVM: arm64: Move __load_guest_stage2 to kvm_mmu.h
KVM: arm64: Check advertised Stage-2 page size capability
arm64/cpufeature: Add get_arm64_ftr_reg_nowarn()
ACPI/IORT: Remove the unused __get_pci_rid()
arm64/cpuinfo: Add ID_MMFR4_EL1 into the cpuinfo_arm64 context
arm64/cpufeature: Add remaining feature bits in ID_AA64PFR1 register
arm64/cpufeature: Add remaining feature bits in ID_AA64PFR0 register
arm64/cpufeature: Add remaining feature bits in ID_AA64ISAR0 register
arm64/cpufeature: Add remaining feature bits in ID_MMFR4 register
arm64/cpufeature: Add remaining feature bits in ID_PFR0 register
arm64/cpufeature: Introduce ID_MMFR5 CPU register
arm64/cpufeature: Introduce ID_DFR1 CPU register
arm64/cpufeature: Introduce ID_PFR2 CPU register
arm64/cpufeature: Make doublelock a signed feature in ID_AA64DFR0
arm64/cpufeature: Drop TraceFilt feature exposure from ID_DFR0 register
arm64/cpufeature: Add explicit ftr_id_isar0[] for ID_ISAR0 register
arm64: mm: Add asid_gen_match() helper
firmware: smccc: Fix missing prototype warning for arm_smccc_version_init
arm64: vdso: Fix CFI directives in sigreturn trampoline
arm64: vdso: Don't prefix sigreturn trampoline with a BTI C instruction
...
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAl7VLcQRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1iFnhAArGBqco3C2RPQugv7UDDbKEaMvxOGrc5B
kwnyOS/k/yeIkfhT9u11oBuLcaj/Zgw8YCjFyRfaNsorRqnytLyZzZ6PvdCCE3YU
X3DVYgulcdAQnM4bS2e3Kt9ciJvFxB27XNm0AfuyLMUxMqCD+iIO4gJ6TuQNBYy3
dfUMfB1R9OUDW13GCrASe+p1Dw76uaqVngdFWJhnC8Rm49E6gFXq7CLQp5Cka81I
KZeJ8I6ug9p3gqhOIXdi+S6g5CM5jf86Wkk7dOHwHFH7CceFb3FIz7z0n1je4Wgd
L5rYX7+PwfNeZ73GIuvEBN+agJH2K0H/KmnlWNWeZHzc+J12MeruSdSMBIkBOEpn
iSbYAOmDpQLzBjTdZjC8bDqTZf472WrTh4VwN9NxHLucjdC+IqGoTAvnyyEOmZ5o
R7sv7Q++316CVwRhYVXbzwZcqtiinCDE1EkP5nKTo9z3z0kMF5+ce/k7wn5sgZIk
zJq3LXtaToiDoDRAPGxcvFPts9MdC0EI1aKTIjaK/n6i2h/SpJfrTKgANWaldYTe
XJIqlSB43saqf5YAQ3/sY+wnpCRBmmCU+sfKja4C8bH7RuggI3mZS19uhFs0Qctq
Yx5bIXVSBAIqjJtgzQ0WAAZ5LrCpNNyAzb35ZYefQlGyJlx1URKXVBmxa6S99biU
KiYX7Dk5uhQ=
=0ZQd
-----END PGP SIGNATURE-----
Merge tag 'x86-cleanups-2020-06-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 cleanups from Ingo Molnar:
"Misc cleanups, with an emphasis on removing obsolete/dead code"
* tag 'x86-cleanups-2020-06-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/spinlock: Remove obsolete ticket spinlock macros and types
x86/mm: Drop deprecated DISCONTIGMEM support for 32-bit
x86/apb_timer: Drop unused declaration and macro
x86/apb_timer: Drop unused TSC calibration
x86/io_apic: Remove unused function mp_init_irq_at_boot()
x86/mm: Stop printing BRK addresses
x86/audit: Fix a -Wmissing-prototypes warning for ia32_classify_syscall()
x86/nmi: Remove edac.h include leftover
mm: Remove MPX leftovers
x86/mm/mmap: Fix -Wmissing-prototypes warnings
x86/early_printk: Remove unused includes
crash_dump: Remove no longer used saved_max_pfn
x86/smpboot: Remove the last ICPU() macro
This makes it easier to enable all KUnit fragments.
Adding 'if !KUNIT_ALL_TESTS' so individual tests can not be turned off.
Therefore if KUNIT_ALL_TESTS is enabled that will hide the prompt in
menuconfig.
Reviewed-by: David Gow <davidgow@google.com>
Signed-off-by: Anders Roxell <anders.roxell@linaro.org>
Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
of local_lock_t - this primitive comes from the -rt project and identifies
CPU-local locking dependencies normally handled opaquely beind preempt_disable()
or local_irq_save/disable() critical sections.
The generated code on mainline kernels doesn't change as a result, but still there
are benefits: improved debugging and better documentation of data structure
accesses.
The new local_lock_t primitives are introduced and then utilized in a couple of
kernel subsystems. No change in functionality is intended.
There's also other smaller changes and cleanups.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAl7VAogRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1h67BAAusYb44jJyZUE74rmaLnJr0c6j7eJ6twT
8LKRwxb21Y35DMuX6M5ewmvnHiLFYmjL728z+y8O+SP8vb4PSJBX/75X+wsawIJB
cjHdxonyynVVC4zcbdrc37FsrOiVoKLbbZcpqRzHksKkCq2PHbFVxBNvEaKHZCWW
1jnq0MRy9wEJtW9EThDWPLD+OPWhBvocUFYJH4fiqCIaDiip/E16fz3i+yMPt545
Jz4Ibnsq+G5Ehm1N2AkaZuK9V9nYv85E7Z/UNiK4mkDOApE6OMS+q3d86BhqgPg5
g/HL3HNXAtIY74tBYAac5tAQglT+283LuTpEPt9BEjNM7QxKg/ecXO7lwtn7Boku
dACMqeuMHbLyru8uhbun/VBx1gca7HIhW1cvXO5OoR7o78fHpEFivjJ0B0OuSYAI
y+/DsA41OlkWSEnboUs+zTQgFatqxQPke92xpGOJtjVVZRYHRqxcPtw9WFmoVqWA
HeczDQLcSUhqbKSfr6X9BO2u3qxys5BzmImTKMqXEQ4d8Kk0QXbJgGYGfS8+ASey
Am/jwUP3Cvzs99NxLH5gECKRSuTx3rY7nRGaIBYa+Ui575bdSF8sVAF13riB2mBp
NJq2Pw0D36WcX7ecaC2Fk2ezkphbeuAr8E7gh/Mt/oVxjrfwRGfPMrnIwKygUydw
1W5x+WZ+WsY=
=TBTY
-----END PGP SIGNATURE-----
Merge tag 'locking-core-2020-06-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking updates from Ingo Molnar:
"The biggest change to core locking facilities in this cycle is the
introduction of local_lock_t - this primitive comes from the -rt
project and identifies CPU-local locking dependencies normally handled
opaquely beind preempt_disable() or local_irq_save/disable() critical
sections.
The generated code on mainline kernels doesn't change as a result, but
still there are benefits: improved debugging and better documentation
of data structure accesses.
The new local_lock_t primitives are introduced and then utilized in a
couple of kernel subsystems. No change in functionality is intended.
There's also other smaller changes and cleanups"
* tag 'locking-core-2020-06-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
zram: Use local lock to protect per-CPU data
zram: Allocate struct zcomp_strm as per-CPU memory
connector/cn_proc: Protect send_msg() with a local lock
squashfs: Make use of local lock in multi_cpu decompressor
mm/swap: Use local_lock for protection
radix-tree: Use local_lock for protection
locking: Introduce local_lock()
locking/lockdep: Replace zero-length array with flexible-array
locking/rtmutex: Remove unused rt_mutex_cmpxchg_relaxed()
Fix kerneldoc warnings and some coding style inconsistencies.
This mirrors the similar cleanups being done in fs/crypto/.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQSacvsUNc7UX4ntmEPzXCl4vpKOKwUCXtSdTBQcZWJpZ2dlcnNA
Z29vZ2xlLmNvbQAKCRDzXCl4vpKOK8m/AP9+n5FpIxE2X6aYTVLweKIQ2bqfO/5K
5WyPlW5zdMEDyQD+OT8bjqVTDxTI0/c+MBOidwvJF6kUyZyVze3M0pE7OQg=
=b+RP
-----END PGP SIGNATURE-----
Merge tag 'fsverity-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt
Pull fsverity updates from Eric Biggers:
"Fix kerneldoc warnings and some coding style inconsistencies.
This mirrors the similar cleanups being done in fs/crypto/"
* tag 'fsverity-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt:
fs-verity: remove unnecessary extern keywords
fs-verity: fix all kerneldoc warnings
- Add the IV_INO_LBLK_32 encryption policy flag which modifies the
encryption to be optimized for eMMC inline encryption hardware.
- Make the test_dummy_encryption mount option for ext4 and f2fs support
v2 encryption policies.
- Fix kerneldoc warnings and some coding style inconsistencies.
There will be merge conflicts with the ext4 and f2fs trees due to the
test_dummy_encryption change, but the resolutions are straightforward.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQSacvsUNc7UX4ntmEPzXCl4vpKOKwUCXtScMBQcZWJpZ2dlcnNA
Z29vZ2xlLmNvbQAKCRDzXCl4vpKOKxC6AP0eOEkMrc9e10YftdN6xsyRjvqiPyFg
oMjuU+SvQ+/sVgEAo0mBFITnl75ZGb8PyqXCNMDAy6uHaxcEjVGufx5q2QE=
=dbxy
-----END PGP SIGNATURE-----
Merge tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt
Pull fscrypt updates from Eric Biggers:
- Add the IV_INO_LBLK_32 encryption policy flag which modifies the
encryption to be optimized for eMMC inline encryption hardware.
- Make the test_dummy_encryption mount option for ext4 and f2fs support
v2 encryption policies.
- Fix kerneldoc warnings and some coding style inconsistencies.
* tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt:
fscrypt: add support for IV_INO_LBLK_32 policies
fscrypt: make test_dummy_encryption use v2 by default
fscrypt: support test_dummy_encryption=v2
fscrypt: add fscrypt_add_test_dummy_key()
linux/parser.h: add include guards
fscrypt: remove unnecessary extern keywords
fscrypt: name all function parameters
fscrypt: fix all kerneldoc warnings
- refactor pstore locking for safer module unloading (Kees Cook)
- remove orphaned records from pstorefs when backend unloaded (Kees Cook)
- refactor dump_oops parameter into max_reason (Pavel Tatashin)
- introduce pstore/zone for common code for contiguous storage (WeiXiong Liao)
- introduce pstore/blk for block device backend (WeiXiong Liao)
- introduce mtd backend (WeiXiong Liao)
-----BEGIN PGP SIGNATURE-----
iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAl7UbYYWHGtlZXNjb29r
QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJpkgD/9/09OkJIWydwk2lr2T89HW5fSF
5uBT0a309/QDUpnV9yhcRsrESEicnvbtaGxD0kuYIInkiW/2cj1l689EkyRjUmy9
q3z4GzLqOlC7qvd7LUPFNGHmllBb09H/CxmXDxRP3aynB9oHzdpNQdPcpLBDA00r
0byp/AE48dFbKIhtT0QxpGUYZFOlyc7XVAaOkED4bmu148gx8q7MU1AxFgbx0Feb
9iPV0r6XYMgXJZ3sn/3PJsxF0V/giDSJ8ui2xsYRjCE408zVIYLdDs2e8dz+2yW6
+3Lyankgo+ofZc4XYExTYgn3WjhPFi+pjVRUaj+BcyTk9SLNIj2WmZdmcLMuzanh
BaUurmED7ffTtlsH4PhQgn8/OY4FX2PO2MwUHwlU+87Y8YDiW0lpzTq5H822OO8p
QQ8awql/6lLCJuyzuWIciVUsS65MCPxsZ4+LSiMZzyYpWu1sxrEY8ic3agzCgsA0
0i+4nZFlLG+Aap/oiKpegenkIyAunn2tDXAyFJFH6qLOiZJ78iRuws3XZqjCElhJ
XqvyDJIfjkJhWUb++ckeqX7ThOR4CPSnwba/7GHv7NrQWuk3Cn+GQ80oxydXUY6b
2/4eYjq0wtvf9NeuJ4/LYNXotLR/bq9zS0zqwTWG50v+RPmuC3bNJB+RmF7fCiCG
jo1Sd1LMeTQ7bnULpA==
=7s1u
-----END PGP SIGNATURE-----
Merge tag 'pstore-v5.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull pstore updates from Kees Cook:
"Fixes and new features for pstore.
This is a pretty big set of changes (relative to past pstore pulls),
but it has been in -next for a while. The biggest change here is the
ability to support a block device as a pstore backend, which has been
desired for a while. A lot of additional fixes and refactorings are
also included, mostly in support of the new features.
- refactor pstore locking for safer module unloading (Kees Cook)
- remove orphaned records from pstorefs when backend unloaded (Kees
Cook)
- refactor dump_oops parameter into max_reason (Pavel Tatashin)
- introduce pstore/zone for common code for contiguous storage
(WeiXiong Liao)
- introduce pstore/blk for block device backend (WeiXiong Liao)
- introduce mtd backend (WeiXiong Liao)"
* tag 'pstore-v5.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: (35 commits)
mtd: Support kmsg dumper based on pstore/blk
pstore/blk: Introduce "best_effort" mode
pstore/blk: Support non-block storage devices
pstore/blk: Provide way to query pstore configuration
pstore/zone: Provide way to skip "broken" zone for MTD devices
Documentation: Add details for pstore/blk
pstore/zone,blk: Add ftrace frontend support
pstore/zone,blk: Add console frontend support
pstore/zone,blk: Add support for pmsg frontend
pstore/blk: Introduce backend for block devices
pstore/zone: Introduce common layer to manage storage zones
ramoops: Add "max-reason" optional field to ramoops DT node
pstore/ram: Introduce max_reason and convert dump_oops
pstore/platform: Pass max_reason to kmesg dump
printk: Introduce kmsg_dump_reason_str()
printk: honor the max_reason field in kmsg_dumper
printk: Collapse shutdown types into a single dump reason
pstore/ftrace: Provide ftrace log merging routine
pstore/ram: Refactor ftrace buffer merging
pstore/ram: Refactor DT size parsing
...
Pull crypto updates from Herbert Xu:
"API:
- Introduce crypto_shash_tfm_digest() and use it wherever possible.
- Fix use-after-free and race in crypto_spawn_alg.
- Add support for parallel and batch requests to crypto_engine.
Algorithms:
- Update jitter RNG for SP800-90B compliance.
- Always use jitter RNG as seed in drbg.
Drivers:
- Add Arm CryptoCell driver cctrng.
- Add support for SEV-ES to the PSP driver in ccp"
* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (114 commits)
crypto: hisilicon - fix driver compatibility issue with different versions of devices
crypto: engine - do not requeue in case of fatal error
crypto: cavium/nitrox - Fix a typo in a comment
crypto: hisilicon/qm - change debugfs file name from qm_regs to regs
crypto: hisilicon/qm - add DebugFS for xQC and xQE dump
crypto: hisilicon/zip - add debugfs for Hisilicon ZIP
crypto: hisilicon/hpre - add debugfs for Hisilicon HPRE
crypto: hisilicon/sec2 - add debugfs for Hisilicon SEC
crypto: hisilicon/qm - add debugfs to the QM state machine
crypto: hisilicon/qm - add debugfs for QM
crypto: stm32/crc32 - protect from concurrent accesses
crypto: stm32/crc32 - don't sleep in runtime pm
crypto: stm32/crc32 - fix multi-instance
crypto: stm32/crc32 - fix run-time self test issue.
crypto: stm32/crc32 - fix ext4 chksum BUG_ON()
crypto: hisilicon/zip - Use temporary sqe when doing work
crypto: hisilicon - add device error report through abnormal irq
crypto: hisilicon - remove codes of directly report device errors through MSI
crypto: hisilicon - QM memory management optimization
crypto: hisilicon - unify initial value assignment into QM
...
sh5 never became a product and has probably never really worked.
Remove it by recursively deleting all associated Kconfig options
and all corresponding files.
Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Rich Felker <dalias@libc.org>
It make no sense to check the caps when reconnecting to mds. And
for the async dirop caps, they will be put by its _cb() function,
so when releasing the requests, it will make no sense too.
URL: https://tracker.ceph.com/issues/45635
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
send_mds_reconnect takes the s_mutex while the mdsc->mutex is already
held. That inverts the locking order documented in mds_client.h. Drop
the mdsc->mutex, acquire the s_mutex and then reacquire the mdsc->mutex
to prevent a deadlock.
URL: https://tracker.ceph.com/issues/45609
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Similarly to commit 03f219041f ("ceph: check i_nlink while converting
a file handle to dentry"), this fixes another corner case with
name_to_handle_at/open_by_handle_at. The issue has been detected by
xfstest generic/467, when doing:
- name_to_handle_at("/cephfs/myfile")
- open("/cephfs/myfile")
- unlink("/cephfs/myfile")
- sync; sync;
- drop caches
- open_by_handle_at()
The call to open_by_handle_at should not fail because the file hasn't been
deleted yet (only unlinked) and we do have a valid handle to it. -ESTALE
shall be returned only if i_nlink is 0 *and* i_count is 1.
This patch also makes sure we have LINK caps before checking i_nlink.
Signed-off-by: Luis Henriques <lhenriques@suse.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Acked-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Returning -EXDEV when trying to 'mv' files/directories from different
quota realms results in copy+unlink operations instead of the faster
CEPH_MDS_OP_RENAME. This will occur even when there aren't any quotas
set in the destination directory, or if there's enough space left for
the new file(s).
This patch adds a new helper function to be called on rename operations
which will allow these operations if they can be executed. This patch
mimics userland fuse client commit b8954e5734b3 ("client:
optimize rename operation under different quota root").
Since ceph_quota_is_same_realm() is now called only from this new
helper, make it static.
URL: https://tracker.ceph.com/issues/44791
Signed-off-by: Luis Henriques <lhenriques@suse.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Function check_quota_exceeded() uses delta parameter only for the
QUOTA_CHECK_MAX_BYTES_OP operation. Using this parameter also for
MAX_FILES will makes the code cleaner and will be required to support
cross-quota-tree renames.
Signed-off-by: Luis Henriques <lhenriques@suse.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
The mdsc->cap_dirty_lock is not held while walking the list in
ceph_kick_flushing_caps, which is not safe.
ceph_early_kick_flushing_caps does something similar, but the
s_mutex is held while it's called and I think that guards against
changes to the list.
Ensure we hold the s_mutex when calling ceph_kick_flushing_caps,
and add some clarifying comments.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
When flushing a lot of caps to the MDS's at once (e.g. for syncfs),
we can end up waiting a substantial amount of time for MDS replies, due
to the fact that it may delay some of them so that it can batch them up
together in a single journal transaction. This can lead to stalls when
calling sync or syncfs.
What we'd really like to do is request expedited service on the _last_
cap we're flushing back to the server. If the CHECK_CAPS_FLUSH flag is
set on the request and the current inode was the last one on the
session->s_cap_dirty list, then mark the request with
CEPH_CLIENT_CAPS_SYNC.
Note that this heuristic is not perfect. New inodes can race onto the
list after we've started flushing, but it does seem to fix some common
use cases.
URL: https://tracker.ceph.com/issues/44744
Reported-by: Jan Fajerski <jfajerski@suse.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
This is a per-sb list now, but that makes it difficult to tell when
the cap is the last dirty one associated with the session. Switch
this to be a per-session list, but continue using the
mdsc->cap_dirty_lock to protect the lists.
This list is only ever walked in ceph_flush_dirty_caps, so change that
to walk the sessions array and then flush the caps for inodes on each
session's list.
If the auth cap ever changes while the inode has dirty caps, then
move the inode to the appropriate session for the new auth_cap. Also,
ensure that we never remove an auth cap while the inode is still on the
s_cap_dirty list.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
write can stuck at waiting for larger max_size in following sequence of
events:
- client opens a file and writes to position 'A' (larger than unit of
max size increment)
- client closes the file handle and updates wanted caps (not wanting
file write caps)
- client opens and truncates the file, writes to position 'A' again.
At the 1st event, client set inode's requested_max_size to 'A'. At the
2nd event, mds removes client's writable range, but client does not reset
requested_max_size. At the 3rd event, client does not request max size
because requested_max_size is already larger than 'A'.
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Nothing ensures that session will still be valid by the time we
dereference the pointer. Take and put a reference.
In principle, we should always be able to get a reference here, but
throw a warning if that's ever not the case.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Just take it before calling it. This means we have to do a couple of
minor in-memory operations under the spinlock now, but those shouldn't
be an issue.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
There's no reason to do this here. Just have the caller handle it.
Also, add a lockdep assertion.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
This function takes a mdsc argument or ci argument, but if both are
passed in, it ignores the ci arg. Fortunately, nothing does that, but
there's no good reason to have the same function handle both cases.
Also, get rid of some branches and just use |= to set the wake_* vals.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Get rid of the __releases annotation by breaking it up into two
functions: __prep_cap which is done under the spinlock and __send_cap
that is done outside it. Add new fields to cap_msg_args for the wake
boolean and old_xattr_buf pointer.
Nothing checks the return value from __send_cap, so make it void
return.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Add a new "r_ended" field to struct ceph_mds_request and use that to
maintain the average latency of MDS requests.
URL: https://tracker.ceph.com/issues/43215
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Calculate the latency for OSD read requests. Add a new r_end_stamp
field to struct ceph_osd_request that will hold the time of that
the reply was received. Use that to calculate the RTT for each call,
and divide the sum of those by number of calls to get averate RTT.
Keep a tally of RTT for OSD writes and number of calls to track average
latency of OSD writes.
URL: https://tracker.ceph.com/issues/43215
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Count hits and misses in the caps cache. If the client has all of
the necessary caps when a task needs references, then it's counted
as a hit. Any other situation is a miss.
URL: https://tracker.ceph.com/issues/43215
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
For dentry leases, only count the hit/miss info triggered from the vfs
calls. For the cases like request reply handling and ceph_trim_dentries,
ignore them.
For now, these are only viewable using debugfs. Future patches will
allow the client to send the stats to the MDS.
The output looks like:
item total miss hit
-------------------------------------------------
d_lease 11 7 141
URL: https://tracker.ceph.com/issues/43215
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Joe Perches pointed out that we were missing a newline
at the end of two debug messages
Reported-by: Joe Perches <joe@perches.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Use pr_fmt to standardize all logging for fs/cifs.
Some logging output had no CIFS: specific prefix.
Now all output has one of three prefixes:
o CIFS:
o CIFS: VFS:
o Root-CIFS:
Miscellanea:
o Convert printks to pr_<level>
o Neaten macro definitions
o Remove embedded CIFS: prefixes from formats
o Convert "illegal" to "invalid"
o Coalesce formats
o Add missing '\n' format terminations
o Consolidate multiple cifs_dbg continuations into single calls
o More consistent use of upper case first word output logging
o Multiline statement argument alignment and wrapping
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
In order to handle workloads where it is important to make sure that
a buggy app did not delete content on the drive, the new mount option
"nodelete" allows standard permission checks on the server to work,
but prevents on the client any attempts to unlink a file or delete
a directory on that mount point. This can be helpful when running
a little understood app on a network mount that contains important
content that should not be deleted.
Signed-off-by: Steve French <stfrench@microsoft.com>
CC: Stable <stable@vger.kernel.org>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Move some large data structures off the stack and into dynamically
allocated memory in the function smb2_ioctl_query_info
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Move a lot of structures and arrays off the stack and into a dynamically
allocated structure instead.
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
The target iterator parameter "it" is not used in
reconn_setup_dfs_targets(), so just remove it.
Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
In order to support reconnect to hostnames that resolve to same ip
address, besides relying on the currently set hostname to match DFS
targets, attempt to resolve the targets and then match their addresses
with the reconnected server ip address.
For instance, if we have two hostnames "FOO" and "BAR", and both
resolve to the same ip address, we would be able to handle failover in
DFS paths like
\\FOO\dfs\link1 -> [ \BAZ\share2 (*), \BAR\share1 ]
\\FOO\dfs\link2 -> [ \BAZ\share2 (*), \FOO\share1 ]
so when "BAZ" is no longer accessible, link1 and link2 would get
reconnected despite having different target hostnames.
Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
If we mount a very specific DFS link
\\FS0.FOO.COM\dfs\link -> \FS0\share1, \FS1\share2
where its target list contains NB names ("FS0" & "FS1") rather than
FQDN ones ("FS0.FOO.COM" & "FS1.FOO.COM"), we end up connecting to
\FOO\share1 but server->hostname will have "FOO.COM". The reason is
because both "FS0" and "FS0.FOO.COM" resolve to same IP address and
they share same TCP server connection, but "FS0.FOO.COM" was the first
hostname set -- which is OK.
However, if the echo thread timeouts and we still have a good
connection to "FS0", in cifs_reconnect()
rc = generic_ip_connect(server) -> success
if (rc) {
...
reconn_inval_dfs_target(server, cifs_sb, &tgt_list,
&tgt_it);
...
}
...
it successfully reconnects to "FS0" server but does not set up next
DFS target - which should be the same target server "\FS0\share1" -
and server->hostname remains set to "FS0.FOO.COM" rather than "FS0",
as reconn_inval_dfs_target() would have it set to "FS0" if called
earlier.
Finally, in __smb2_reconnect(), the reconnect of tcons would fail
because tcon->ses->server->hostname (FS0.FOO.COM) does not match DFS
target's hostname (FS0).
Fix that by calling reconn_inval_dfs_target() before
generic_ip_connect() so server->hostname will get updated correctly
prior to reconnecting its tcons in __smb2_reconnect().
With "cifs: handle hostnames that resolve to same ip in failover"
patch
- The above problem would not occur.
- We could save an DNS query to find out that they both resolve to
the same ip address.
Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
The variable rc is being initialized with a value that is never read
and it is being updated later with a new value. The initialization is
redundant and can be removed.
Addresses-Coverity: ("Unused value")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
The "nolease" mount option is only supported for SMB2+ mounts.
Fail with appropriate error message if vers=1.0 option is passed.
Signed-off-by: Kenneth D'souza <kdsouza@redhat.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
In order to use arbitrary block devices as a pstore backend, provide a
new module param named "best_effort", which will allow using any block
device, even if it has not provided a panic_write callback.
Link: https://lore.kernel.org/lkml/20200511233229.27745-12-keescook@chromium.org/
Signed-off-by: Kees Cook <keescook@chromium.org>
Add support for non-block devices (e.g. MTD). A non-block driver calls
pstore_blk_register_device() to register iself.
In addition, pstore/zone is updated to handle non-block devices,
where an erase must be done before a write. Without this, there is no
way to remove records stored to an MTD.
Signed-off-by: WeiXiong Liao <liaoweixiong@allwinnertech.com>
Link: https://lore.kernel.org/lkml/20200511233229.27745-10-keescook@chromium.org/
Co-developed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
In order to configure itself, the MTD backend needs to be able to query
the current pstore configuration. Introduce pstore_blk_get_config() for
this purpose.
Signed-off-by: WeiXiong Liao <liaoweixiong@allwinnertech.com>
Link: https://lore.kernel.org/lkml/20200511233229.27745-9-keescook@chromium.org/
Co-developed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
One requirement to support MTD devices in pstore/zone is having a
way to declare certain regions as broken. Add this support to
pstore/zone.
The MTD driver should return -ENOMSG when encountering a bad region,
which tells pstore/zone to skip and try the next one.
Signed-off-by: WeiXiong Liao <liaoweixiong@allwinnertech.com>
Link: https://lore.kernel.org/lkml/20200511233229.27745-8-keescook@chromium.org/
Co-developed-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Link: //lore.kernel.org/lkml/20200512173801.222666-1-colin.king@canonical.com
Signed-off-by: Kees Cook <keescook@chromium.org>
xdp_umem.c had overlapping changes between the 64-bit math fix
for the calculation of npgs and the removal of the zerocopy
memory type which got rid of the chunk_size_nohdr member.
The mlx5 Kconfig conflict is a case where we just take the
net-next copy of the Kconfig entry dependency as it takes on
the ESWITCH dependency by one level of indirection which is
what the 'net' conflicting change is trying to ensure.
Signed-off-by: David S. Miller <davem@davemloft.net>
As a prelude to implementing asynchronous fileserver operations in the afs
filesystem, rename struct afs_fs_cursor to afs_operation.
This struct is going to form the core of the operation management and is
going to acquire more members in later.
Signed-off-by: David Howells <dhowells@redhat.com>
Set a flag in the call struct to indicate an unmarshalling error rather
than return and handle an error from the decoding of file statuses. This
flag is checked on a successful return from the delivery function.
Signed-off-by: David Howells <dhowells@redhat.com>
afs_vol_interest objects represent the volume IDs currently being accessed
from a fileserver. These hold lists of afs_cb_interest objects that
repesent the superblocks using that volume ID on that server.
When a callback notification from the server telling of a modification by
another client arrives, the volume ID specified in the notification is
looked up in the server's afs_vol_interest list. Through the
afs_cb_interest list, the relevant superblocks can be iterated over and the
specific inode looked up and marked in each one.
Make the following efficiency improvements:
(1) Hold rcu_read_lock() over the entire processing rather than locking it
each time.
(2) Do all the callbacks for each vid together rather than individually.
Each volume then only needs to be looked up once.
(3) afs_vol_interest objects are now stored in an rb_tree rather than a
flat list to reduce the lookup step count.
(4) afs_vol_interest lookup is now done with RCU, but because it's in an
rb_tree which may rotate under us, a seqlock is used so that if it
changes during the walk, we repeat the walk with a lock held.
With this and the preceding patch which adds RCU-based lookups in the inode
cache, target volumes/vnodes can be taken without the need to take any
locks, except on the target itself.
Signed-off-by: David Howells <dhowells@redhat.com>
Show more information in /proc/net/afs/servers to make it easier to see
what's going on with the server probing.
Signed-off-by: David Howells <dhowells@redhat.com>
When an AFS client accesses a file, it receives a limited-duration callback
promise that the server will notify it if another client changes a file.
This callback duration can be a few hours in length.
If a client mounts a volume and then an application prevents it from being
unmounted, say by chdir'ing into it, but then does nothing for some time,
the rxrpc_peer record will expire and rxrpc-level keepalive will cease.
If there is NAT or a firewall between the client and the server, the route
back for the server may close after a comparatively short duration, meaning
that attempts by the server to notify the client may then bounce.
The client, however, may (so far as it knows) still have a valid unexpired
promise and will then rely on its cached data and will not see changes made
on the server by a third party until it incidentally rechecks the status or
the promise needs renewal.
To deal with this, the client needs to regularly probe the server. This
has two effects: firstly, it keeps a route open back for the server, and
secondly, it causes the server to disgorge any notifications that got
queued up because they couldn't be sent.
Fix this by adding a mechanism to emit regular probes.
Two levels of probing are made available: Under normal circumstances the
'slow' queue will be used for a fileserver - this just probes the preferred
address once every 5 mins or so; however, if server fails to respond to any
probes, the server will shift to the 'fast' queue from which all its
interfaces will be probed every 30s. When it finally responds, the record
will switch back to the slow queue.
Further notes:
(1) Probing is now no longer driven from the fileserver rotation
algorithm.
(2) Probes are dispatched to all interfaces on a fileserver when that an
afs_server object is set up to record it.
(3) The afs_server object is removed from the probe queues when we start
to probe it. afs_is_probing_server() returns true if it's not listed
- ie. it's undergoing probing.
(4) The afs_server object is added back on to the probe queue when the
final outstanding probe completes, but the probed_at time is set when
we're about to launch a probe so that it's not dependent on the probe
duration.
(5) The timer and the work item added for this must be handed a count on
net->servers_outstanding, which they hand on or release. This makes
sure that network namespace cleanup waits for them.
Fixes: d2ddc776a4 ("afs: Overhaul volume and server record caching and fileserver rotation")
Reported-by: Dave Botsch <botsch@cnf.cornell.edu>
Signed-off-by: David Howells <dhowells@redhat.com>
Split the usage count on the afs_server struct to have an active count that
registers who's actually using it separately from the reference count on
the object.
This allows a future patch to dispatch polling probes without advancing the
"unuse" time into the future each time we emit a probe, which would
otherwise prevent unused server records from expiring.
Included in this:
(1) The latter part of afs_destroy_server() in which the RCU destruction
of afs_server objects is invoked and the outstanding server count is
decremented is split out into __afs_put_server().
(2) afs_put_server() now calls __afs_put_server() rather then setting the
management timer.
(3) The calls begun by afs_fs_give_up_all_callbacks() and
afs_fs_get_capabilities() can now take a ref on the server record, so
afs_destroy_server() can just drop its ref and needn't wait for the
completion of these calls. They'll put the ref when they're done.
(4) Because of (3), afs_fs_probe_done() no longer needs to wake up
afs_destroy_server() with server->probe_outstanding.
(5) afs_gc_servers can be simplified. It only needs to check if
server->active is 0 rather than playing games with the refcount.
(6) afs_manage_servers() can propose a server for gc if usage == 0 rather
than if ref == 1. The gc is effected by (5).
Signed-off-by: David Howells <dhowells@redhat.com>
The U-version VLDB volume record retrieved by the VL.GetEntryByNameU rpc op
carries a change counter (the serverUnique field) for each fileserver
listed in the record as backing that volume. This is incremented whenever
the registration details for a fileserver change (such as its address
list). Note that the same value will be seen in all UVLDB records that
refer to that fileserver.
This should be checked before calling the VL server to re-query the address
list for a fileserver. If it's the same, there's no point doing the query.
Reported-by: Jeffrey Altman <jaltman@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
When a lookup is done in an AFS directory, the filesystem will speculate
and fetch up to 49 other statuses for files in the same directory and fetch
those as well, turning them into inodes or updating inodes that already
exist.
However, occasionally, a callback break might go missing due to NAT timing
out, but the afs filesystem doesn't then realise that the directory is not
up to date.
Alleviate this by using one of the status slots to check the directory in
which the lookup is being done.
Reported-by: Dave Botsch <botsch@cnf.cornell.edu>
Suggested-by: Jeffrey Altman <jaltman@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Make the inode hash table RCU searchable so that searches that want to
access or modify an inode without taking a ref on that inode can do so
without taking the inode hash table lock.
The main thing this requires is some RCU annotation on the list
manipulation operations. Inodes are already freed by RCU in most cases.
Users of this interface must take care as the inode may be still under
construction or may be being torn down around them.
There are at least three instances where this can be of use:
(1) Testing whether the inode number iunique() is going to return is
currently unique (the iunique_lock is still held).
(2) Ext4 date stamp updating.
(3) AFS callback breaking.
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
cc: linux-ext4@vger.kernel.org
cc: linux-afs@lists.infradead.org
Add details on using pstore/blk, the new backend of pstore to record
dumps to block devices, in Documentation/admin-guide/pstore-blk.rst
Signed-off-by: WeiXiong Liao <liaoweixiong@allwinnertech.com>
Link: https://lore.kernel.org/lkml/20200511233229.27745-7-keescook@chromium.org/
Signed-off-by: Kees Cook <keescook@chromium.org>
Support backend for console. To enable console backend, just make
console_size be greater than 0 and a multiple of 4096.
Signed-off-by: WeiXiong Liao <liaoweixiong@allwinnertech.com>
Link: https://lore.kernel.org/lkml/20200511233229.27745-5-keescook@chromium.org/
Signed-off-by: Kees Cook <keescook@chromium.org>
pstore/blk is similar to pstore/ram, but uses a block device as the
storage rather than persistent ram.
The pstore/blk backend solves two common use-cases that used to preclude
using pstore/ram:
- not all devices have a battery that could be used to persist
regular RAM across power failures.
- most embedded intelligent equipment have no persistent ram, which
increases costs, instead preferring cheaper solutions, like block
devices.
pstore/blk provides separate configurations for the end user and for the
block drivers. User configuration determines how pstore/blk operates, such
as record sizes, max kmsg dump reasons, etc. These can be set by Kconfig
and/or module parameters, but module parameter have priority over Kconfig.
Driver configuration covers all the details about the target block device,
such as total size of the device and how to perform read/write operations.
These are provided by block drivers, calling pstore_register_blkdev(),
including an optional panic_write callback used to bypass regular IO
APIs in an effort to avoid potentially destabilized kernel code during
a panic.
Signed-off-by: WeiXiong Liao <liaoweixiong@allwinnertech.com>
Link: https://lore.kernel.org/lkml/20200511233229.27745-3-keescook@chromium.org/
Co-developed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Implement a common set of APIs needed to support pstore storage zones,
based on how ramoops is designed. This will be used by pstore/blk with
the intention of migrating pstore/ram in the future.
Signed-off-by: WeiXiong Liao <liaoweixiong@allwinnertech.com>
Link: https://lore.kernel.org/lkml/20200511233229.27745-2-keescook@chromium.org/
Co-developed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Now that pstore_register() can correctly pass max_reason to the kmesg
dump facility, introduce a new "max_reason" module parameter and
"max-reason" Device Tree field.
The "dump_oops" module parameter and "dump-oops" Device
Tree field are now considered deprecated, but are now automatically
converted to their corresponding max_reason values when present, though
the new max_reason setting has precedence.
For struct ramoops_platform_data, the "dump_oops" member is entirely
replaced by a new "max_reason" member, with the only existing user
updated in place.
Additionally remove the "reason" filter logic from ramoops_pstore_write(),
as that is not specifically needed anymore, though technically
this is a change in behavior for any ramoops users also setting the
printk.always_kmsg_dump boot param, which will cause ramoops to behave as
if max_reason was set to KMSG_DUMP_MAX.
Co-developed-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Link: https://lore.kernel.org/lkml/20200515184434.8470-6-keescook@chromium.org/
Signed-off-by: Kees Cook <keescook@chromium.org>
Add a new member to struct pstore_info for passing information about
kmesg dump maximum reason. This allows a finer control of what kmesg
dumps are sent to pstore storage backends.
Those backends that do not explicitly set this field (keeping it equal to
0), get the default behavior: store only Oopses and Panics, or everything
if the printk.always_kmsg_dump boot param is set.
Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Link: https://lore.kernel.org/lkml/20200515184434.8470-5-keescook@chromium.org/
Co-developed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
The pstore subsystem already had a private version of this function.
With the coming addition of the pstore/zone driver, this needs to be
shared. As it really should live with printk, move it there instead.
Link: https://lore.kernel.org/lkml/20200515184434.8470-4-keescook@chromium.org/
Acked-by: Petr Mladek <pmladek@suse.com>
Acked-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Reviewed-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
This changes the ftrace record merging code to be agnostic of
pstore/ram, as the first step to making it available as a generic
routine for other backends to use, such as pstore/zone.
Link: https://lore.kernel.org/lkml/20200510202436.63222-6-keescook@chromium.org/
Signed-off-by: Kees Cook <keescook@chromium.org>