Now that all set/get helpers use the eb from the token, we don't need to
pass it to many btrfs_token_*/btrfs_set_token_* helpers, saving some
stack space.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The token stores a copy of the extent buffer pointer but does not make
any use of it besides sanity checks. We can use it and drop the eb
parameter from several functions, this patch only switches the use
inside the set/get helpers.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
disk-io.h is included more than once in block-group.c, remove it.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Tiezhu Yang <yangtiezhu@loongson.cn>
Signed-off-by: David Sterba <dsterba@suse.com>
The name of this function contains the word "cache", which is left from
the times where btrfs_block_group was called btrfs_block_group_cache.
Now this "cache" doesn't match anything, and we have better namings for
functions like read/insert/remove_block_group_item().
Rename it to update_block_group_item().
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently the block group item insert is pretty straight forward, fill
the block group item structure and insert it into extent tree.
However the incoming skinny block group feature is going to change this,
so this patch will refactor insertion into a new function,
insert_block_group_item(), to make the incoming feature easier to add.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When deleting a block group item, it's pretty straight forward, just
delete the item pointed by the key. However it will not be that
straight-forward for incoming skinny block group item.
So refactor the block group item deletion into a new function,
remove_block_group_item(), also to make the already lengthy
btrfs_remove_block_group() a little shorter.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Structure btrfs_block_group has the following members which are
currently read from on-disk block group item and key:
- length - from item key
- used
- flags - from block group item
However for incoming skinny block group tree, we are going to read those
members from different sources.
This patch will refactor such read by:
- Don't initialize btrfs_block_group::length at allocation
Caller should initialize them manually.
Also to avoid possible (well, only two callers) missing
initialization, add extra ASSERT() in btrfs_add_block_group_cache().
- Refactor length/used/flags initialization into one function
The new function, fill_one_block_group() will handle the
initialization of such members.
- Use btrfs_block_group::length to replace key::offset
Since skinny block group item would have a different meaning for its
key offset.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Regular block group items in extent tree are scattered inside the huge
tree, thus forward readahead makes no sense.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Whenever a chown is executed, all capabilities of the file being touched
are lost. When doing incremental send with a file with capabilities,
there is a situation where the capability can be lost on the receiving
side. The sequence of actions bellow shows the problem:
$ mount /dev/sda fs1
$ mount /dev/sdb fs2
$ touch fs1/foo.bar
$ setcap cap_sys_nice+ep fs1/foo.bar
$ btrfs subvolume snapshot -r fs1 fs1/snap_init
$ btrfs send fs1/snap_init | btrfs receive fs2
$ chgrp adm fs1/foo.bar
$ setcap cap_sys_nice+ep fs1/foo.bar
$ btrfs subvolume snapshot -r fs1 fs1/snap_complete
$ btrfs subvolume snapshot -r fs1 fs1/snap_incremental
$ btrfs send fs1/snap_complete | btrfs receive fs2
$ btrfs send -p fs1/snap_init fs1/snap_incremental | btrfs receive fs2
At this point, only a chown was emitted by "btrfs send" since only the
group was changed. This makes the cap_sys_nice capability to be dropped
from fs2/snap_incremental/foo.bar
To fix that, only emit capabilities after chown is emitted. The current
code first checks for xattrs that are new/changed, emits them, and later
emit the chown. Now, __process_new_xattr skips capabilities, letting
only finish_inode_if_needed to emit them, if they exist, for the inode
being processed.
This behavior was being worked around in "btrfs receive" side by caching
the capability and only applying it after chown. Now, xattrs are only
emmited _after_ chown, making that workaround not needed anymore.
Link: https://github.com/kdave/btrfs-progs/issues/202
CC: stable@vger.kernel.org # 4.4+
Suggested-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When scrubbing a stripe, whenever we find an extent we lookup for its
checksums in the checksum tree. However we do it even for metadata extents
which don't have checksum items stored in the checksum tree, that is
only for data extents.
So make the lookup for checksums only if we are processing with a data
extent.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The helpers btrfs_freeze_block_group() and btrfs_unfreeze_block_group()
used to be named btrfs_get_block_group_trimming() and
btrfs_put_block_group_trimming() respectively.
At the time they were added to free-space-cache.c, by commit e33e17ee10
("btrfs: add missing discards when unpinning extents with -o discard")
because all the trimming related functions were in free-space-cache.c.
Now that the helpers were renamed and are used in scrub context as well,
move them to block-group.c, a much more logical location for them.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Back in 2014, commit 04216820fe ("Btrfs: fix race between fs trimming
and block group remove/allocation"), I added the 'trimming' member to the
block group structure. Its purpose was to prevent races between trimming
and block group deletion/allocation by pinning the block group in a way
that prevents its logical address and device extents from being reused
while trimming is in progress for a block group, so that if another task
deletes the block group and then another task allocates a new block group
that gets the same logical address and device extents while the trimming
task is still in progress.
After the previous fix for scrub (patch "btrfs: fix a race between scrub
and block group removal/allocation"), scrub now also has the same needs that
trimming has, so the member name 'trimming' no longer makes sense.
Since there is already a 'pinned' member in the block group that refers
to space reservations (pinned bytes), rename the member to 'frozen',
add a comment on top of it to describe its general purpose and rename
the helpers to increment and decrement the counter as well, to match
the new member name.
The next patch in the series will move the helpers into a more suitable
file (from free-space-cache.c to block-group.c).
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When scrub is verifying the extents of a block group for a device, it is
possible that the corresponding block group gets removed and its logical
address and device extents get used for a new block group allocation.
When this happens scrub incorrectly reports that errors were detected
and, if the the new block group has a different profile then the old one,
deleted block group, we can crash due to a null pointer dereference.
Possibly other unexpected and weird consequences can happen as well.
Consider the following sequence of actions that leads to the null pointer
dereference crash when scrub is running in parallel with balance:
1) Balance sets block group X to read-only mode and starts relocating it.
Block group X is a metadata block group, has a raid1 profile (two
device extents, each one in a different device) and a logical address
of 19424870400;
2) Scrub is running and finds device extent E, which belongs to block
group X. It enters scrub_stripe() to find all extents allocated to
block group X, the search is done using the extent tree;
3) Balance finishes relocating block group X and removes block group X;
4) Balance starts relocating another block group and when trying to
commit the current transaction as part of the preparation step
(prepare_to_relocate()), it blocks because scrub is running;
5) The scrub task finds the metadata extent at the logical address
19425001472 and marks the pages of the extent to be read by a bio
(struct scrub_bio). The extent item's flags, which have the bit
BTRFS_EXTENT_FLAG_TREE_BLOCK set, are added to each page (struct
scrub_page). It is these flags in the scrub pages that tells the
bio's end io function (scrub_bio_end_io_worker) which type of extent
it is dealing with. At this point we end up with 4 pages in a bio
which is ready for submission (the metadata extent has a size of
16Kb, so that gives 4 pages on x86);
6) At the next iteration of scrub_stripe(), scrub checks that there is a
pause request from the relocation task trying to commit a transaction,
therefore it submits the pending bio and pauses, waiting for the
transaction commit to complete before resuming;
7) The relocation task commits the transaction. The device extent E, that
was used by our block group X, is now available for allocation, since
the commit root for the device tree was swapped by the transaction
commit;
8) Another task doing a direct IO write allocates a new data block group Y
which ends using device extent E. This new block group Y also ends up
getting the same logical address that block group X had: 19424870400.
This happens because block group X was the block group with the highest
logical address and, when allocating Y, find_next_chunk() returns the
end offset of the current last block group to be used as the logical
address for the new block group, which is
18351128576 + 1073741824 = 19424870400
So our new block group Y has the same logical address and device extent
that block group X had. However Y is a data block group, while X was
a metadata one, and Y has a raid0 profile, while X had a raid1 profile;
9) After allocating block group Y, the direct IO submits a bio to write
to device extent E;
10) The read bio submitted by scrub reads the 4 pages (16Kb) from device
extent E, which now correspond to the data written by the task that
did a direct IO write. Then at the end io function associated with
the bio, scrub_bio_end_io_worker(), we call scrub_block_complete()
which calls scrub_checksum(). This later function checks the flags
of the first page, and sees that the bit BTRFS_EXTENT_FLAG_TREE_BLOCK
is set in the flags, so it assumes it has a metadata extent and
then calls scrub_checksum_tree_block(). That functions returns an
error, since interpreting data as a metadata extent causes the
checksum verification to fail.
So this makes scrub_checksum() call scrub_handle_errored_block(),
which determines 'failed_mirror_index' to be 1, since the device
extent E was allocated as the second mirror of block group X.
It allocates BTRFS_MAX_MIRRORS scrub_block structures as an array at
'sblocks_for_recheck', and all the memory is initialized to zeroes by
kcalloc().
After that it calls scrub_setup_recheck_block(), which is responsible
for filling each of those structures. However, when that function
calls btrfs_map_sblock() against the logical address of the metadata
extent, 19425001472, it gets a struct btrfs_bio ('bbio') that matches
the current block group Y. However block group Y has a raid0 profile
and not a raid1 profile like X had, so the following call returns 1:
scrub_nr_raid_mirrors(bbio)
And as a result scrub_setup_recheck_block() only initializes the
first (index 0) scrub_block structure in 'sblocks_for_recheck'.
Then scrub_recheck_block() is called by scrub_handle_errored_block()
with the second (index 1) scrub_block structure as the argument,
because 'failed_mirror_index' was previously set to 1.
This scrub_block was not initialized by scrub_setup_recheck_block(),
so it has zero pages, its 'page_count' member is 0 and its 'pagev'
page array has all members pointing to NULL.
Finally when scrub_recheck_block() calls scrub_recheck_block_checksum()
we have a NULL pointer dereference when accessing the flags of the first
page, as pavev[0] is NULL:
static void scrub_recheck_block_checksum(struct scrub_block *sblock)
{
(...)
if (sblock->pagev[0]->flags & BTRFS_EXTENT_FLAG_DATA)
scrub_checksum_data(sblock);
(...)
}
Producing a stack trace like the following:
[542998.008985] BUG: kernel NULL pointer dereference, address: 0000000000000028
[542998.010238] #PF: supervisor read access in kernel mode
[542998.010878] #PF: error_code(0x0000) - not-present page
[542998.011516] PGD 0 P4D 0
[542998.011929] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
[542998.012786] CPU: 3 PID: 4846 Comm: kworker/u8:1 Tainted: G B W 5.6.0-rc7-btrfs-next-58 #1
[542998.014524] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
[542998.016065] Workqueue: btrfs-scrub btrfs_work_helper [btrfs]
[542998.017255] RIP: 0010:scrub_recheck_block_checksum+0xf/0x20 [btrfs]
[542998.018474] Code: 4c 89 e6 ...
[542998.021419] RSP: 0018:ffffa7af0375fbd8 EFLAGS: 00010202
[542998.022120] RAX: 0000000000000000 RBX: ffff9792e674d120 RCX: 0000000000000000
[542998.023178] RDX: 0000000000000001 RSI: ffff9792e674d120 RDI: ffff9792e674d120
[542998.024465] RBP: 0000000000000000 R08: 0000000000000067 R09: 0000000000000001
[542998.025462] R10: ffffa7af0375fa50 R11: 0000000000000000 R12: ffff9791f61fe800
[542998.026357] R13: ffff9792e674d120 R14: 0000000000000001 R15: ffffffffc0e3dfc0
[542998.027237] FS: 0000000000000000(0000) GS:ffff9792fb200000(0000) knlGS:0000000000000000
[542998.028327] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[542998.029261] CR2: 0000000000000028 CR3: 00000000b3b18003 CR4: 00000000003606e0
[542998.030301] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[542998.031316] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[542998.032380] Call Trace:
[542998.032752] scrub_recheck_block+0x162/0x400 [btrfs]
[542998.033500] ? __alloc_pages_nodemask+0x31e/0x460
[542998.034228] scrub_handle_errored_block+0x6f8/0x1920 [btrfs]
[542998.035170] scrub_bio_end_io_worker+0x100/0x520 [btrfs]
[542998.035991] btrfs_work_helper+0xaa/0x720 [btrfs]
[542998.036735] process_one_work+0x26d/0x6a0
[542998.037275] worker_thread+0x4f/0x3e0
[542998.037740] ? process_one_work+0x6a0/0x6a0
[542998.038378] kthread+0x103/0x140
[542998.038789] ? kthread_create_worker_on_cpu+0x70/0x70
[542998.039419] ret_from_fork+0x3a/0x50
[542998.039875] Modules linked in: dm_snapshot dm_thin_pool ...
[542998.047288] CR2: 0000000000000028
[542998.047724] ---[ end trace bde186e176c7f96a ]---
This issue has been around for a long time, possibly since scrub exists.
The last time I ran into it was over 2 years ago. After recently fixing
fstests to pass the "--full-balance" command line option to btrfs-progs
when doing balance, several tests started to more heavily exercise balance
with fsstress, scrub and other operations in parallel, and therefore
started to hit this issue again (with btrfs/061 for example).
Fix this by having scrub increment the 'trimming' counter of the block
group, which pins the block group in such a way that it guarantees neither
its logical address nor device extents can be reused by future block group
allocations until we decrement the 'trimming' counter. Also make sure that
on each iteration of scrub_stripe() we stop scrubbing the block group if
it was removed already.
A later patch in the series will rename the block group's 'trimming'
counter and its helpers to a more generic name, since now it is not used
exclusively for pinning while trimming anymore.
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The extent references v0 have been superseded long time go, there are
some unused declarations of access helpers. We can safely remove them
now. The struct btrfs_extent_ref_v0 is not used anywhere, but struct
btrfs_extent_item_v0 is still part of a backward compatibility check in
relocation.c and thus not removed.
Signed-off-by: David Sterba <dsterba@suse.com>
There's no callers in-tree anymore since
commit d24ee97b96 ("btrfs: use new helpers to set uuids in eb")
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
For the following operation, qgroup is guaranteed to be screwed up due
to snapshot adding to a new qgroup:
# mkfs.btrfs -f $dev
# mount $dev $mnt
# btrfs qgroup en $mnt
# btrfs subv create $mnt/src
# xfs_io -f -c "pwrite 0 1m" $mnt/src/file
# sync
# btrfs qgroup create 1/0 $mnt/src
# btrfs subv snapshot -i 1/0 $mnt/src $mnt/snapshot
# btrfs qgroup show -prce $mnt/src
qgroupid rfer excl max_rfer max_excl parent child
-------- ---- ---- -------- -------- ------ -----
0/5 16.00KiB 16.00KiB none none --- ---
0/257 1.02MiB 16.00KiB none none --- ---
0/258 1.02MiB 16.00KiB none none 1/0 ---
1/0 0.00B 0.00B none none --- 0/258
^^^^^^^^^^^^^^^^^^^^
[CAUSE]
The problem is in btrfs_qgroup_inherit(), we don't have good enough
check to determine if the new relation would break the existing
accounting.
Unlike btrfs_add_qgroup_relation(), which has proper check to determine
if we can do quick update without a rescan, in btrfs_qgroup_inherit() we
can even assign a snapshot to multiple qgroups.
[FIX]
Fix it by manually marking qgroup inconsistent for snapshot inheritance.
For subvolume creation, since all its extents are exclusively owned, we
don't need to rescan.
In theory, we should call relation check like quick_update_accounting()
when doing qgroup inheritance and inform user about qgroup accounting
inconsistency.
But we don't have good mechanism to relay that back to the user in the
snapshot creation context, thus we can only silently mark the qgroup
inconsistent.
Anyway, user shouldn't use qgroup inheritance during snapshot creation,
and should add qgroup relationship after snapshot creation by 'btrfs
qgroup assign', which has a much better UI to inform user about qgroup
inconsistent and kick in rescan automatically.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When mounting, we handle deleted subvolume and orphan items. First,
find add orphan roots, then add them to fs_root radix tree. Second, in
tree-root, process each orphan item, skip if it is dead root.
The original algorithm is based on the list of dead_roots, one by one to
visit and check whether the objectid is consistent, the time complexity
is O (n ^ 2). When processing 50000 deleted subvols, it takes about
120s.
Because btrfs_find_orphan_roots has already ran before us, and added
deleted subvol to fs_roots radix tree.
The fs root will only be removed from the fs_roots radix tree after the
cleaner process is started, and the cleaner will only start execution
after the mount is complete.
btrfs_orphan_cleanup can be called during the whole filesystem mount
lifetime, but only "tree root" will be used in this section of code, and
only mount time will be brought into tree root.
So we can quickly check whether the orphan item is dead root through the
fs_roots radix tree.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Robbie Ko <robbieko@synology.com>
Signed-off-by: David Sterba <dsterba@suse.com>
I've grepped logs for 'errno=.*unknown' and found -95, -117 and -122,
now added to the table. The wording is adjusted so it makes sense in
context of filesystem.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When an old device has new fsid through 'btrfs device add -f <dev>' our
fs_devices list has an alien device in one of the fs_devices lists.
By having an alien device in fs_devices, we have two issues so far
1. missing device does not not show as missing in the userland
2. degraded mount will fail
Both issues are caused by the fact that there's an alien device in the
fs_devices list. (Alien means that it does not belong to the filesystem,
identified by fsid, or does not contain btrfs filesystem at all, eg. due
to overwrite).
A device can be scanned/added through the control device ioctls
SCAN_DEV, DEVICES_READY or by ADD_DEV.
And device coming through the control device is checked against the all
other devices in the lists, but this was not the case for ADD_DEV.
This patch fixes both issues above by removing the alien device.
CC: stable@vger.kernel.org # 5.4+
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_free_extra_devids() updates fs_devices::latest_bdev to point to
the bdev with greatest device::generation number. For a typical-missing
device the generation number is zero so fs_devices::latest_bdev will
never point to it.
But if the missing device is due to alienation [1], then
device::generation is not zero and if it is greater or equal to the rest
of device generations in the list, then fs_devices::latest_bdev ends up
pointing to the missing device and reports the error like [2].
[1] We maintain devices of a fsid (as in fs_device::fsid) in the
fs_devices::devices list, a device is considered as an alien device
if its fsid does not match with the fs_device::fsid
Consider a working filesystem with raid1:
$ mkfs.btrfs -f -d raid1 -m raid1 /dev/sda /dev/sdb
$ mount /dev/sda /mnt-raid1
$ umount /mnt-raid1
While mnt-raid1 was unmounted the user force-adds one of its devices to
another btrfs filesystem:
$ mkfs.btrfs -f /dev/sdc
$ mount /dev/sdc /mnt-single
$ btrfs dev add -f /dev/sda /mnt-single
Now the original mnt-raid1 fails to mount in degraded mode, because
fs_devices::latest_bdev is pointing to the alien device.
$ mount -o degraded /dev/sdb /mnt-raid1
[2]
mount: wrong fs type, bad option, bad superblock on /dev/sdb,
missing codepage or helper program, or other error
In some cases useful info is found in syslog - try
dmesg | tail or so.
kernel: BTRFS warning (device sdb): devid 1 uuid 072a0192-675b-4d5a-8640-a5cf2b2c704d is missing
kernel: BTRFS error (device sdb): failed to read devices
kernel: BTRFS error (device sdb): open_ctree failed
Fix the root cause by checking if the device is not missing before it
can be considered for the fs_devices::latest_bdev.
CC: stable@vger.kernel.org # 4.19+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Use crypto_shash_digest() instead of crypto_shash_init() +
crypto_shash_update() + crypto_shash_final(). This is more efficient.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There is no need of goto out in open_fs_devices() as there is nothing
special done there.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
At btrfs_log_prealloc_extents() we are checking if copy_items() returns a
value greater than 0. That used to happen in the past to signal the caller
that the path given to it was released and reused for other searches, but
as of commit 0e56315ca1 ("Btrfs: fix missing hole after hole punching
and fsync when using NO_HOLES"), the copy_items() function does not have
that behaviour anymore and always returns 0 or a negative value. So just
remove that check at btrfs_log_prealloc_extents(), which the previously
mentioned commit forgot to remove.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently, direct I/O has its own versions of bio_readpage_error() and
btrfs_check_repairable() (dio_read_error() and
btrfs_check_dio_repairable(), respectively). The main difference is that
the direct I/O version doesn't do read validation. The rework of direct
I/O repair makes it possible to do validation, so we can get rid of
btrfs_check_dio_repairable() and combine bio_readpage_error() and
dio_read_error() into a new helper, btrfs_submit_read_repair().
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This was originally added in commit 8b110e393c ("Btrfs: implement
repair function when direct read fails") to avoid a deadlock. In that
commit, the direct I/O read endio executes on the endio_workers
workqueue, submits a repair bio, and waits for it to complete. The
repair bio endio must execute on a different workqueue, otherwise it
could block on the endio_workers workqueue becoming available, which
won't happen because the original endio is blocked on the repair bio.
As of the previous commit, the original endio doesn't wait for the
repair bio, so this separate workqueue is unnecessary.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Direct I/O read repair was originally implemented in commit 8b110e393c
("Btrfs: implement repair function when direct read fails"). This
implementation is unnecessarily complicated. There is major code
duplication between __btrfs_subio_endio_read() (checks checksums and
handles I/O errors for files with checksums),
__btrfs_correct_data_nocsum() (handles I/O errors for files without
checksums), btrfs_retry_endio() (checks checksums and handles I/O errors
for retries of files with checksums), and btrfs_retry_endio_nocsum()
(handles I/O errors for retries of files without checksum). If it sounds
like these should be one function, that's because they should.
Additionally, these functions are very hard to follow due to their
excessive use of goto.
This commit replaces the original implementation. After the previous
commit getting rid of orig_bio, we can reuse the same endio callback for
repair I/O and the original I/O, we just need to track the file offset
and original iterator in the repair bio. We can also unify the handling
of files with and without checksums and simplify the control flow. We
also no longer have to wait for each repair I/O to complete one by one.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In the worst case, there are _4_ layers of bios in the Btrfs direct I/O
path:
1. The bio created by the generic direct I/O code (dio_bio).
2. A clone of dio_bio we create in btrfs_submit_direct() to represent
the entire direct I/O range (orig_bio).
3. A partial clone of orig_bio limited to the size of a RAID stripe that
we create in btrfs_submit_direct_hook().
4. Clones of each of those split bios for each RAID stripe that we
create in btrfs_map_bio().
As of the previous commit, the second layer (orig_bio) is no longer
needed for anything: we can split dio_bio instead, and complete dio_bio
directly when all of the cloned bios complete. This lets us clean up a
bunch of cruft, including dip->subio_endio and dip->errors (we can use
dio_bio->bi_status instead). It also enables the next big cleanup of
direct I/O read repair.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The next commit will get rid of btrfs_dio_private->orig_bio. The only
thing we really need it for is containing all of the checksums, but we
can easily put the checksum array in btrfs_dio_private and have the
submitted bios reference the array. We can also look the checksums up
while we're setting up instead of the current awkward logic that looks
them up for orig_bio when the first split bio is submitted.
(Interestingly, btrfs_dio_private did contain the
checksums before commit 23ea8e5a07 ("Btrfs: load checksum data once
when submitting a direct read io"), but it didn't look them up up
front.)
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is really a reference count now, so convert it to refcount_t and
rename it to refs.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We haven't used this since commit 9be3395bcd ("Btrfs: use a btrfs
bioset instead of abusing bio internals").
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since its introduction in commit 2fe6303e7c ("Btrfs: split
bio_readpage_error into several functions"), btrfs_check_repairable()
has only been used from extent_io.c where it is defined.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
__readpage_endio_check() is also used from the direct I/O read code, so
give it a more descriptive name.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Fix a couple of issues in the btrfs_lookup_bio_sums documentation:
* The bio doesn't need to be a btrfs_io_bio if dst was provided. Move
the declaration in the code to make that clear, too.
* dst must be large enough to hold nblocks * csum_size, not just
csum_size.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The purpose of the validation step is to distinguish between good and
bad sectors in a failed multi-sector read. If a multi-sector read
succeeded but some of those sectors had checksum errors, we don't need
to validate anything; we know the sectors with bad checksums need to be
repaired.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Read repair does two things: it finds a good copy of data to return to
the reader, and it corrects the bad copy on disk. If a read of multiple
sectors has an I/O error, repair does an extra "validation" step that
issues a separate read for each sector. This allows us to find the exact
failing sectors and only rewrite those.
This heuristic is implemented in
bio_readpage_error()/btrfs_check_repairable() as:
failed_bio_pages = failed_bio->bi_iter.bi_size >> PAGE_SHIFT;
if (failed_bio_pages > 1)
do validation
However, at this point, bi_iter may have already been advanced. This
means that we'll skip the validation step and rewrite the entire failed
read.
Fix it by getting the actual size from the biovec (which we can do
because this is only called for non-cloned bios, although that will
change in a later commit).
Fixes: 8a2ee44a37 ("btrfs: look at bi_size for repair decisions")
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In btrfs_submit_direct(), if we fail to allocate the btrfs_dio_private,
we complete the ordered extent range. However, we don't mark that the
range doesn't need to be cleaned up from btrfs_direct_IO() until later.
Therefore, if we fail to allocate the btrfs_dio_private, we complete the
ordered extent range twice. We could fix this by updating
unsubmitted_oe_range earlier, but it's cleaner to reorganize the code so
that creating the btrfs_dio_private and submitting the bios are
separate, and once the btrfs_dio_private is created, cleanup always
happens through the btrfs_dio_private.
The logic around unsubmitted_oe_range_end and unsubmitted_oe_range_start
is really subtle. We have the following:
1. btrfs_direct_IO sets those two to the same value.
2. When we call __blockdev_direct_IO unless
btrfs_get_blocks_direct->btrfs_get_blocks_direct_write is called to
modify unsubmitted_oe_range_start so that start < end. Cleanup
won't happen.
3. We come into btrfs_submit_direct - if it dip allocation fails we'd
return with oe_range_end now modified so cleanup will happen.
4. If we manage to allocate the dip we reset the unsubmitted range
members to be equal so that cleanup happens from
btrfs_endio_direct_write.
This 4-step logic is not really obvious, especially given it's scattered
across 3 functions.
Fixes: f28a492878 ("Btrfs: fix leaking of ordered extents after direct IO write error")
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
[ add range start/end logic explanation from Nikolay ]
Signed-off-by: David Sterba <dsterba@suse.com>
In btrfs_submit_direct_hook(), if a direct I/O write doesn't span a RAID
stripe or chunk, we submit orig_bio without cloning it. In this case, we
don't increment pending_bios. Then, if btrfs_submit_dio_bio() fails, we
decrement pending_bios to -1, and we never complete orig_bio. Fix it by
initializing pending_bios to 1 instead of incrementing later.
Fixing this exposes another bug: we put orig_bio prematurely and then
put it again from end_io. Fix it by not putting orig_bio.
After this change, pending_bios is really more of a reference count, but
I'll leave that cleanup separate to keep the fix small.
Fixes: e65e153554 ("btrfs: fix panic caused by direct IO")
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
At clean_pinned_extents(), whether we end up returning success or failure,
we pretty much have to do the same things:
1) unlock unused_bg_unpin_mutex
2) decrement reference count on the previous transaction
We also call btrfs_dec_block_group_ro() in case of failure, but that is
better done in its caller, btrfs_delete_unused_bgs(), since its the
caller that calls inc_block_group_ro(), so it should be responsible for
the decrement operation, as it is in case any of the other functions it
calls fail.
So move the call to btrfs_dec_block_group_ro() from clean_pinned_extents()
into btrfs_delete_unused_bgs() and unify the error and success return
paths for clean_pinned_extents(), reducing duplicated code and making it
simpler.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
All callers pass the eb::level so we can get read it directly inside the
btrfs_bin_search and key_search.
This is inspired by the work of Marek in U-boot.
CC: Marek Behun <marek.behun@nic.cz>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Instead of returning both the page and the super block structure, make
btrfs_read_disk_super just return a pointer to struct btrfs_disk_super.
As a result the function signature is simplified. Also,
read_cache_page_gfp can never return NULL so check its return value only
for IS_ERR.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function always works on a local copy of the reloc root list, which
cannot be modified outside of it so using list_for_each_entry is fine.
Additionally the macro handles empty lists so drop list_empty checks of
callers. No semantic changes.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Deleting a subvolume on a full filesystem leads to ENOSPC followed by a
forced read-only. This is not a transaction abort and the filesystem is
otherwise ok, so the error should be just propagated to the callers.
This is caused by unnecessary call to btrfs_handle_fs_error for all
errors, except EAGAIN. This does not make sense as the standard
transaction abort mechanism is in btrfs_drop_snapshot so all relevant
failures are handled.
Originally in commit cb1b69f450 ("Btrfs: forced readonly when
btrfs_drop_snapshot() fails") there was no return value at all, so the
btrfs_std_error made some sense but once the error handling and
propagation has been implemented we don't need it anymore.
Signed-off-by: David Sterba <dsterba@suse.com>
The reclaim_size counter of a space_info object is unsigned. So its value
can never be negative, it's pointless to have an assertion that checks
its value is >= 0, therefore remove it.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Remove the duplicate definition of 'inode_item_err' in the file
tree-checker.c that got there by accident in c23c77b097 ("btrfs:
tree-checker: Refactor inode key check into seperate function").
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Zheng Wei <wei.zheng@vivo.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Nikolay noticed a bunch of test failures with my global rsv steal
patches. At first he thought they were introduced by them, but they've
been failing for a while with 64k nodes.
The problem is with 64k nodes we have a global reserve that calculates
out to 13MiB on a freshly made file system, which only has 8MiB of
metadata space. Because of changes I previously made we no longer
account for the global reserve in the overcommit logic, which means we
correctly allow overcommit to happen even though we are already
overcommitted.
However in some corner cases, for example btrfs/170, we will allocate
the entire file system up with data chunks before we have enough space
pressure to allocate a metadata chunk. Then once the fs is full we
ENOSPC out because we cannot overcommit and the global reserve is taking
up all of the available space.
The most ideal way to deal with this is to change our space reservation
stuff to take into account the height of the tree's that we're
modifying, so that our global reserve calculation does not end up so
obscenely large.
However that is a huge undertaking. Instead fix this by forcing a chunk
allocation if the global reserve is larger than the total metadata
space. This gives us essentially the same behavior that happened
before, we get a chunk allocated and these tests can pass.
This is meant to be a stop-gap measure until we can tackle the "tree
height only" project.
Fixes: 0096420adb ("btrfs: do not account global reserve in can_overcommit")
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Tested-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
With normal tickets we could have a large reservation at the front of
the list that is unable to be satisfied, but a smaller ticket later on
that can be satisfied. The way we handle this is to run
btrfs_try_granting_tickets() in maybe_fail_all_tickets().
However no such protection exists for priority tickets. Fix this by
handling it in handle_reserve_ticket(). If we've returned after
attempting to flush space in a priority related way, we'll still be on
the priority list and need to be removed.
We rely on the flushing to free up space and wake the ticket, but if
there is not enough space to reclaim _but_ there's enough space in the
space_info to handle subsequent reservations then we would have gotten
an ENOSPC erroneously.
Address this by catching where we are still on the list, meaning we were
a priority ticket, and removing ourselves and then running
btrfs_try_granting_tickets(). This will handle this particular corner
case.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Tested-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In debugging a generic/320 failure on ppc64, Nikolay noticed that
sometimes we'd ENOSPC out with plenty of space to reclaim if we had
committed the transaction. He further discovered that this was because
there was a priority ticket that was small enough to fit in the free
space currently in the space_info.
Consider the following scenario. There is no more space to reclaim in
the fs without committing the transaction. Assume there's 1MiB of space
free in the space info, but there are pending normal tickets with 2MiB
reservations.
Now a priority ticket comes in with a .5MiB reservation. Because we
have normal tickets pending we add ourselves to the priority list,
despite the fact that we could satisfy this reservation.
The flushing machinery now gets to the point where it wants to commit
the transaction, but because there's a .5MiB ticket on the priority list
and we have 1MiB of free space we assume the ticket will be granted
soon, so we bail without committing the transaction.
Meanwhile the priority flushing does not commit the transaction, and
eventually fails with an ENOSPC. Then all other tickets are failed with
ENOSPC because we were never able to actually commit the transaction.
The fix for this is we should have simply granted the priority flusher
his reservation, because there was space to make the reservation.
Priority flushers by definition take priority, so they are allowed to
make their reservations before any previous normal tickets. By not
adding this priority ticket to the list the normal flushing mechanisms
will then commit the transaction and everything will continue normally.
We still need to serialize ourselves with other priority tickets, so if
there are any tickets on the priority list then we need to add ourselves
to that list in order to maintain the serialization between priority
tickets.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Tested-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
On ppc64le with 64k page size (respectively 64k block size) generic/320
was failing and debug output showed we were getting a premature ENOSPC
with a bunch of space in btrfs_fs_info::trans_block_rsv.
This meant there were still open transaction handles holding space, yet
the flusher didn't commit the transaction because it deemed the freed
space won't be enough to satisfy the current reserve ticket. Fix this
by accounting for space in trans_block_rsv when deciding whether the
current transaction should be committed or not.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Tested-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We previously had a limit of stealing 50% of the global reserve for
unlink. This was from a time when the global reserve was used for the
delayed refs as well. However now those reservations are kept separate,
so the global reserve can be depleted much more to allow us to make
progress for space restoring operations like unlink. Change the minimum
amount of space required to be left in the global reserve to 10%.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Tested-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
For unlink transactions and block group removal
btrfs_start_transaction_fallback_global_rsv will first try to start an
ordinary transaction and if it fails it will fall back to reserving the
required amount by stealing from the global reserve. This is problematic
because of all the same reasons we had with previous iterations of the
ENOSPC handling, thundering herd. We get a bunch of failures all at
once, everybody tries to allocate from the global reserve, some win and
some lose, we get an ENSOPC.
Fix this behavior by introducing BTRFS_RESERVE_FLUSH_ALL_STEAL. It's
used to mark unlink reservation. To fix this we need to integrate this
logic into the normal ENOSPC infrastructure. We still go through all of
the normal flushing work, and at the moment we begin to fail all the
tickets we try to satisfy any tickets that are allowed to steal by
stealing from the global reserve. If this works we start the flushing
system over again just like we would with a normal ticket satisfaction.
This serializes our global reserve stealing, so we don't have the
thundering herd problem.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Tested-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
For relocation tree detection, relocation backref cache uses
btrfs_should_ignore_reloc_root() which uses relocation-specific checks
like checking the DEAD_RELOC_ROOT bit.
However for general purpose backref cache, we can rely on that check, as
it's possible that relocation is also running.
For generic purposed backref cache, we detect reloc root by
SHARED_BLOCK_REF item. Only reloc root node has its parent bytenr
pointing back to itself.
And in that case, backref cache will mark the reloc root node useless,
dropping any child orphan nodes.
So only call btrfs_should_ignore_reloc_root() if the backref cache is
for relocation.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The error cleanup will be extracted as a new function,
btrfs_backref_error_cleanup(), and moved to backref.c and exported for
later usage.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This the the 2nd major part of generic backref cache. Move it to
backref.c so we can reuse it.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function is the major part of backref cache build process, move it
to backref.c so we can reuse it later.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The backref code is going to be moved to backref.c, and read_fs_root()
is just a simple wrapper, open-code it to prepare to the incoming code
move.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function is mostly single purpose to relocation backref cache, but
since we're moving the main part of backref cache to backref.c, we need
to export such function.
And to avoid confusion, rename the function to
btrfs_should_ignore_reloc_root() make the name a little more clear.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Also change the parameter, since all callers can easily grab an fs_info,
there is no need for all the pointer chasing.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since we're releasing all existing nodes/edges, other than cleanup the
mess after error, "release" is a more proper naming here.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Also add comment explaining the cleanup progress, to differ it from
btrfs_backref_drop_node().
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
With extra comment for drop_backref_node() as it has some similarity
with remove_backref_node(), thus we need extra comment explaining the
difference.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Structure tree_entry provides a very simple rb_tree which only uses
bytenr as search index.
That tree_entry is used in 3 structures: backref_node, mapping_node and
tree_block.
Since we're going to make backref_node independnt from relocation, it's
a good time to extract the tree_entry into rb_simple_node, and export it
into misc.h.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
These 3 structures are the main part of btrfs backref cache, move them
to backref.h to build the basis for later reuse.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Those three structures are the main elements of backref cache. Add the
"btrfs_" prefix for later export.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This patch will also add some comment for the cleanup.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
After handle_one_tree_backref(), all newly added (not cached) edges and
nodes have the following features:
- Only backref_edge::list[LOWER] is linked.
This means, we can only iterate from botton to top, not the other
direction.
- Newly added nodes are not added to cache rb_tree yet
So to finish the backref cache, we still need to finish the links and
add all nodes into backref cache rb_tree.
This patch will refactor the existing code into finish_upper_links(),
add more comments of each branch, and why we need to do all the work.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
build_backref_tree() uses "goto again;" to implement a breadth-first
search to build backref cache.
This patch will extract most of its work into a wrapper,
handle_one_tree_block(), and use a do {} while() loop to implement the
same thing.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Bytenr and level are essential parameters for backref_node, thus it
makes sense to initialize them at allocation time.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since backref_edge is used to connect upper and lower backref nodes, and
needs to access both nodes, some code can look pretty nasty:
list_add_tail(&edge->list[LOWER], &cur->upper);
The above code will link @cur to the LOWER side of the edge, while both
"LOWER" and "upper" words show up. This can sometimes be very confusing
for reader to grasp.
This patch introduces a new wrapper, link_backref_edge(), to handle the
linking behavior. Which also has extra ASSERT() to ensure caller won't
pass wrong nodes.
Also, this updates the comment of related lists of backref_node and
backref_edge, to make it more clear that each list points to what.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The processing of indirect tree backref (TREE_BLOCK_REF) is the most
complex work.
We need to grab the fs root, do a tree search to locate all its parent
nodes, link all needed edges, and put all uncached edges to pending edge
list.
This is definitely worth a helper function.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
For BTRFS_SHARED_BLOCK_REF_KEY, its processing is straightforward, as we
now the parent node bytenr directly.
If the parent is already cached, or a root, call it a day.
If the parent is not cached, add it pending list.
This patch will just refactor this part into its own function,
handle_direct_tree_backref() and add some comment explaining the
@ref_key parameter.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
find_reloc_root() searches reloc_control::reloc_root_tree to find the
reloc root. This behavior is only useful for relocation backref cache.
For the incoming more generic purpose backref cache, we don't care
about who owns the reloc root, but only care if it's a reloc root.
So this patch makes the following modifications to make the reloc root
search more specific to relocation backref:
- Add backref_node::is_reloc_root
This will be an extra indicator for generic purposed backref cache.
User doesn't need to read root key from backref_node::root to
determine if it's a reloc root.
Also for reloc tree root, it's useless and will be queued to useless
list.
- Add backref_cache::is_reloc
This will allow backref cache code to do different behavior for
generic purpose backref cache and relocation backref cache.
- Pass fs_info to find_reloc_root()
- Export find_reloc_root()
So backref.c can utilize this function.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add this member so that we can grab fs_info without the help from
reloc_control.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
These two new members will act the same as the existing local lists,
@useless and @list in build_backref_tree().
Currently build_backref_tree() is only executed serially, thus moving
such local list into backref_cache is still safe.
Also since we're here, use list_first_entry() to replace a lot of
list_entry() calls after !list_empty().
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
These two functions are weirdly named, mark_block_processed() in fact
just marks a range dirty unconditionally, while __mark_block_processed()
does extra check before doing the marking.
This patch will open code old mark_block_processed, and rename
__mark_block_processed() to remove the "__" prefix.
Since we're here, also kill the forward declaration, which could also
kill in_block_group() with in_range() macro.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In the core function of relocation, build_backref_tree, it needs to
iterate all backref items of one tree block.
Use btrfs_backref_iter infrastructure to do the loop and make the code
more readable.
The backref items look would be much more easier to read:
ret = btrfs_backref_iter_start(iter, cur->bytenr);
for (; ret == 0; ret = btrfs_backref_iter_next(iter)) {
/* The really important work */
}
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function will go to the next inline/keyed backref for
btrfs_backref_iter infrastructure.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Due to the complex nature of btrfs extent tree, when we want to iterate
all backrefs of one extent, this involves quite a lot of work, like
searching the EXTENT_ITEM/METADATA_ITEM, iteration through inline and keyed
backrefs.
Normally this would result in a complex code, something like:
btrfs_search_slot()
/* Ensure we are at EXTENT_ITEM/METADATA_ITEM */
while (1) { /* Loop for extent tree items */
while (ptr < end) { /* Loop for inlined items */
/* Real work here */
}
next:
ret = btrfs_next_item()
/* Ensure we're still at keyed item for specified bytenr */
}
The idea of btrfs_backref_iter is to avoid such complex and hard to
read code structure, but something like the following:
iter = btrfs_backref_iter_alloc();
ret = btrfs_backref_iter_start(iter, bytenr);
if (ret < 0)
goto out;
for (; ; ret = btrfs_backref_iter_next(iter)) {
/* Real work here */
}
out:
btrfs_backref_iter_free(iter);
This patch is just the skeleton + btrfs_backref_iter_start() code.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Sparse reports a warning at btrfs_tree_lock()
warning: context imbalance in btrfs_tree_lock() - wrong count at exit
The root cause is the missing annotation at btrfs_tree_lock()
Add the missing __acquires(&eb->lock) annotation
Signed-off-by: Jules Irenge <jbi.octave@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Sparse reports a warning at btrfs_lock_cluster()
warning: context imbalance in btrfs_lock_cluster()
- wrong count
The root cause is the missing annotation at btrfs_lock_cluster()
Add the missing __acquires(&cluster->refill_lock) annotation.
Signed-off-by: Jules Irenge <jbi.octave@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The comments in fanotify_group_event_mask() say:
"If the event is on dir/child and this mark doesn't care about
events on dir/child, don't send it!"
Specifically, mount and filesystem marks do not care about events
on child, but they can still specify an ignore mask for those events.
For example, a group that has:
- A mount mark with mask 0 and ignore_mask FAN_OPEN
- An inode mark on a directory with mask FAN_OPEN | FAN_OPEN_EXEC
with flag FAN_EVENT_ON_CHILD
A child file open for exec would be reported to group with the FAN_OPEN
event despite the fact that FAN_OPEN is in ignore mask of mount mark,
because the mark iteration loop skips over non-inode marks for events
on child when calculating the ignore mask.
Move ignore mask calculation to the top of the iteration loop block
before excluding marks for events on dir/child.
Link: https://lore.kernel.org/r/20200524072441.18258-1-amir73il@gmail.com
Reported-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/linux-fsdevel/20200521162443.GA26052@quack2.suse.cz/
Fixes: 55bf882c7f "fanotify: fix merging marks masks with FAN_ONDIR"
Fixes: b469e7e47c "fanotify: fix handling of events on child..."
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Shutdown test is somtimes hung, since it keeps trying to flush dirty node pages
in an inifinite loop. Let's drop dirty pages at umount in that case.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The MSCC bug fix in 'net' had to be slightly adjusted because the
register accesses are done slightly differently in net-next.
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking fixes from David Miller:
1) Fix RCU warnings in ipv6 multicast router code, from Madhuparna
Bhowmik.
2) Nexthop attributes aren't being checked properly because of
mis-initialized iterator, from David Ahern.
3) Revert iop_idents_reserve() change as it caused performance
regressions and was just working around what is really a UBSAN bug
in the compiler. From Yuqi Jin.
4) Read MAC address properly from ROM in bmac driver (double iteration
proceeds past end of address array), from Jeremy Kerr.
5) Add Microsoft Surface device IDs to r8152, from Marc Payne.
6) Prevent reference to freed SKB in __netif_receive_skb_core(), from
Boris Sukholitko.
7) Fix ACK discard behavior in rxrpc, from David Howells.
8) Preserve flow hash across packet scrubbing in wireguard, from Jason
A. Donenfeld.
9) Cap option length properly for SO_BINDTODEVICE in AX25, from Eric
Dumazet.
10) Fix encryption error checking in kTLS code, from Vadim Fedorenko.
11) Missing BPF prog ref release in flow dissector, from Jakub Sitnicki.
12) dst_cache must be used with BH disabled in tipc, from Eric Dumazet.
13) Fix use after free in mlxsw driver, from Jiri Pirko.
14) Order kTLS key destruction properly in mlx5 driver, from Tariq
Toukan.
15) Check devm_platform_ioremap_resource() return value properly in
several drivers, from Tiezhu Yang.
* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (71 commits)
net: smsc911x: Fix runtime PM imbalance on error
net/mlx4_core: fix a memory leak bug.
net: ethernet: ti: cpsw: fix ASSERT_RTNL() warning during suspend
net: phy: mscc: fix initialization of the MACsec protocol mode
net: stmmac: don't attach interface until resume finishes
net: Fix return value about devm_platform_ioremap_resource()
net/mlx5: Fix error flow in case of function_setup failure
net/mlx5e: CT: Correctly get flow rule
net/mlx5e: Update netdev txq on completions during closure
net/mlx5: Annotate mutex destroy for root ns
net/mlx5: Don't maintain a case of del_sw_func being null
net/mlx5: Fix cleaning unmanaged flow tables
net/mlx5: Fix memory leak in mlx5_events_init
net/mlx5e: Fix inner tirs handling
net/mlx5e: kTLS, Destroy key object after destroying the TIS
net/mlx5e: Fix allowed tc redirect merged eswitch offload cases
net/mlx5: Avoid processing commands before cmdif is ready
net/mlx5: Fix a race when moving command interface to events mode
net/mlx5: Add command entry handling completion
rxrpc: Fix a memory leak in rxkad_verify_response()
...
Fix a warning due to an uninitialised variable.
le included from ../fs/afs/fs_probe.c:11:
../fs/afs/fs_probe.c: In function 'afs_fileserver_probe_result':
../fs/afs/internal.h:1453:2: warning: 'rtt_us' may be used uninitialized in this function [-Wmaybe-uninitialized]
1453 | printk("[%-6.6s] "FMT"\n", current->comm ,##__VA_ARGS__)
| ^~~~~~
../fs/afs/fs_probe.c:35:15: note: 'rtt_us' was declared here
Signed-off-by: David Howells <dhowells@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEqG5UsNXhtOCrfGQP+7dXa6fLC2sFAl7FQH4ACgkQ+7dXa6fL
C2u05BAAlilXCYGZU23/yQmMxxkmcG+jW+oDV9ySNDl6iJT+OtKxHEGReDD4D2f0
rhaBivgGcOnZy89AGjNrSVROGlwSBXl7ArJcjsfsx8AuNzxHUHQKWlW/k8n87qEt
NCTze7f65IT6NowgYAFgJn5kIpY/9iKuNiCf6NGL3Z35wqxPvwNs6AQSGM495uvB
el/ddkr8QzzjI9Ejsgzj94x4DAOjk4T4WzfWMAgyr1OEqz6vKNKkCwSKPySOsQAK
72JRaGhWA9rfAOkA7nAZpnjHdfFYnkFBOVQzmswOJYRYe3D/QY5D9PUlGIQ5OSjL
yV5YOi/+AUrSif79NfEYXga0r/NFJMFqBg2zo/eiSrhfZZFZMDcagnGhzpGjbYF1
IaeIu4q/MQOQybi8m1GJhvFfPOhdKRn731jlsUvEoxK0TonSu/u64eus+qelQxOd
uiIcu/kLxfPZSznUd8cXZ+Pffce0uBIRWq0nRQZ703TyHY+/gYo7ZGHr/FZNKaK4
lRNP4Nu3goLQCI40R7y7USnpX+kWfd4mYC9zl+VBSXG1JymYbOezXYrNATBCqpo6
9VoYtqDdo8ESksFUBqM7fGRDZ20nah6KdRGmnrPU+rpODHZEZmNN7D/rayU31wua
VIbVw1WluvSVnQ8+b1BwJwvhJQ4CazyGcnbxDqx1zd07EjUiiJI=
=jJvy
-----END PGP SIGNATURE-----
Merge tag 'rxrpc-fixes-20200520' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
David Howells says:
====================
rxrpc: Fix retransmission timeout and ACK discard
Here are a couple of fixes and an extra tracepoint for AF_RXRPC:
(1) Calculate the RTO pretty much as TCP does, rather than making
something up, including an initial 4s timeout (which causes return
probes from the fileserver to fail if a packet goes missing), and add
backoff.
(2) Fix the discarding of out-of-order received ACKs. We mustn't let the
hard-ACK point regress, nor do we want to do unnecessary
retransmission because the soft-ACK list regresses. This is not
trivial, however, due to some loose wording in various old protocol
specs, the ACK field that should be used for this sometimes has the
wrong information in it.
(3) Add a tracepoint to log a discarded ACK.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl7IByoQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpigmEACWJoK7zk5OK2RhavpzOsb2SDu8nz/YAbUe
+R6tXAjwe4Z7lVVa+FW/fmN9/mQcjyYRIbbG564IFs5fe6hPUoOjUzHqvGTOFLHd
Fw8mjVKgWjWAE5GdoX6ATauLhVwwjnImej1PNfO/J5y29o0SQksP8MbM0eGuuNx1
piqxBj0/3h3YyPn1GeJmqxwwcsFhzHDqk7fbkfbQokZk+7SPiKpqWgJBa7AKSlNC
N0WTluT4UOummQZw1RFynPfA4cCuX6XHVgWAa9h7vrJHXigvuMWqLaHG+MBFqeKu
xD6PPnaCnMwcLRe4T2sJvtjxmNSdyr15Q2kGkIi/RhohSIn4u/y8jEA6wTprCP48
rDi30dn1o2LwUj2S1NO3YCOV8jIKWUguztEvKiAXmjf4KDZIDd4/OwrFsJdb4vg9
EuK86SEwXbvFHf9nu1M7pHlGThKfQi0CiK6C6M7Qb/kOthio72wwZ46gGkwLDk5z
DZWHymHBhQw/z1c20loX7pBvFIzLzbuYUThf23UegPzXVqqQfBkqs4BGFcOGuqy6
yfEYF/MAX/O/TQgm2dDQHrhl05AevLu/UQXMXZ8Ha6OrmlC4C2qu3Te/iZO8FUew
YIx5H5XmBh93McjpmJ8VCn7CjE+y/ufNTMdvm8WzCyAIfH40gfcyLangpre26QoJ
CCAARffXrQ==
=ZYUy
-----END PGP SIGNATURE-----
Merge tag 'io_uring-5.7-2020-05-22' of git://git.kernel.dk/linux-block
Pull io_uring fixes from Jens Axboe:
"A small collection of small fixes that should go into this release:
- Two fixes for async request preparation (Pavel)
- Busy clear fix for SQPOLL (Xiaoguang)
- Don't use kiocb->private for O_DIRECT buf index, some file systems
use it (Bijan)
- Kill dead check in io_splice()
- Ensure sqo_wait is initialized early
- Cancel task_work if we fail adding to original process
- Only add (IO)pollable requests to iopoll list, fixing a regression
in this merge window"
* tag 'io_uring-5.7-2020-05-22' of git://git.kernel.dk/linux-block:
io_uring: reset -EBUSY error when io sq thread is waken up
io_uring: don't add non-IO requests to iopoll pending list
io_uring: don't use kiocb.private to store buf_index
io_uring: cancel work if task_work_add() fails
io_uring: remove dead check in io_splice()
io_uring: fix FORCE_ASYNC req preparation
io_uring: don't prepare DRAIN reqs twice
io_uring: initialize ctx->sqo_wait earlier
The argument isn't used by any caller, and drivers don't fill out
bi_sector for flush requests either.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Define ext2_listxattr to NULL when CONFIG_EROFS_FS_XATTR
is not enabled, then we can remove many ugly ifdef macros
in the code.
Link: https://lore.kernel.org/r/20200522044035.24190-2-cgxu519@mykernel.net
Signed-off-by: Chengguang Xu <cgxu519@mykernel.net>
Signed-off-by: Jan Kara <jack@suse.cz>
Let's always set special inode i_op to &ext2_special_inode_operations
regardless of CONFIG_EXT2_FS_XATTR setting. It makes sence to be able to
query extended inode flags (needing ->setattr and ->getattr callbacks)
even when CONFIG_EXT2_FS_XATTR is not set.
Link: https://lore.kernel.org/r/20200522044035.24190-1-cgxu519@mykernel.net
Signed-off-by: Chengguang Xu <cgxu519@mykernel.net>
Signed-off-by: Jan Kara <jack@suse.cz>
As Ubuntu and Fedora release new version used kernel version equal to or
higher than v5.4, They started to support kernel exfat filesystem.
Linus reported a mount error with new version of exfat on Fedora:
exfat: Unknown parameter 'namecase'
This is because there is a difference in mount option between old
staging/exfat and new exfat. And utf8, debug, and codepage options as
well as namecase have been removed from new exfat.
This patch add the dummy mount options as deprecated option to be
backward compatible with old one.
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-----BEGIN PGP SIGNATURE-----
iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAl7BzV8eHHRvcnZhbGRz
QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGg8EH/A2pXMTxtc96RI4S
sttEsUQqbakFS0Z/2tQPpMGr/qW2e5eHgsTX/a3SiUeZiIXk6f4lMFkMuctzBf7p
X77cNEDwGOEdbtCXTsMcmKSde7sP2zCXsPB8xTWLyE6rnaFRgikwwkeqgkIKhp1h
bvOQV0t9HNGvxGAM0iZeOvQAvFl4vd7nS123/MYbir9cugfQUSJRueQ4BiCiJqVE
6cNA7/vFzDJuFGszzIrJ7HXn/IdQMMWHkvTDjgBw0GZw1mDbGFbfbZwOeTz1ojCt
smUQ4tIFxBa/VA5zx7dOy2P2keHbSVf4VLkZRPcceT7OqVS65ETmFDp+qt5NdWM5
vZ8+7/0=
=CyYH
-----END PGP SIGNATURE-----
Merge tag 'v5.7-rc6' into rdma.git for-next
Linux 5.7-rc6
Conflict in drivers/net/ethernet/mellanox/mlx5/core/steering/dr_send.c
resolved by deleting dr_cq_event, matching how netdev resolved it.
Required for dependencies in the following patches.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Recursion in kernel code is generally a bad idea as it can overflow
the kernel stack. Recursion in exec also hides that the code is
looping and that the loop changes bprm->file.
Instead of recursing in search_binary_handler have the methods that
would recurse set bprm->interpreter and return 0. Modify exec_binprm
to loop when bprm->interpreter is set. Consolidate all of the
reassignments of bprm->file in that loop to make it clear what is
going on.
The structure of the new loop in exec_binprm is that all errors return
immediately, while successful completion (ret == 0 &&
!bprm->interpreter) just breaks out of the loop and runs what
exec_bprm has always run upon successful completion.
Fail if the an interpreter is being call after execfd has been set.
The code has never properly handled an interpreter being called with
execfd being set and with reassignments of bprm->file and the
assignment of bprm->executable in generic code it has finally become
possible to test and fail when if this problematic condition happens.
With the reassignments of bprm->file and the assignment of
bprm->executable moved into the generic code add a test to see if
bprm->executable is being reassigned.
In search_binary_handler remove the test for !bprm->file. With all
reassignments of bprm->file moved to exec_binprm bprm->file can never
be NULL in search_binary_handler.
Link: https://lkml.kernel.org/r/87sgfwyd84.fsf_-_@x220.int.ebiederm.org
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Most of the support for passing the file descriptor of an executable
to an interpreter already lives in the generic code and in binfmt_elf.
Rework the fields in binfmt_elf that deal with executable file
descriptor passing to make executable file descriptor passing a first
class concept.
Move the fd_install from binfmt_misc into begin_new_exec after the new
creds have been installed. This means that accessing the file through
/proc/<pid>/fd/N is able to see the creds for the new executable
before allowing access to the new executables files.
Performing the install of the executables file descriptor after
the point of no return also means that nothing special needs to
be done on error. The exiting of the process will close all
of it's open files.
Move the would_dump from binfmt_misc into begin_new_exec right
after would_dump is called on the bprm->file. This makes it
obvious this case exists and that no nesting of bprm->file is
currently supported.
In binfmt_misc the movement of fd_install into generic code means
that it's special error exit path is no longer needed.
Link: https://lkml.kernel.org/r/87y2poyd91.fsf_-_@x220.int.ebiederm.org
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
The return code -ENOEXEC serves to tell search_binary_handler that it
should continue searching for the binfmt to handle a given file. This
makes return -ENOEXEC with a bprm->buf that is needed to continue the
search problematic.
The current binfmt_script manages to escape problems as it closes and
clears bprm->file before return -ENOEXEC with bprm->buf modified.
This prevents search_binary_handler from looping as it explicitly
handles a NULL bprm->file.
I plan on moving all of the bprm->file managment into fs/exec.c and out
of the binary handlers so this will become a problem.
Move closing bprm->file and the test for BINPRM_PATH_INACCESSIBLE
down below the last return of -ENOEXEC.
Introduce i_sep and i_end to track the end of the first argument and
the end of the parameters respectively. Using those, constification
of all char * pointers, and the helpers next_terminator and
next_non_spacetab guarantee the parameter parsing will not modify
bprm->buf.
Only modify bprm->buf to terminate the strings i_arg and i_name with
'\0' for passing to copy_strings_kernel.
When replacing loops with next_non_spacetab and next_terminator care
has been take that the logic of the parsing code (short of replacing
characters by '\0') remains the same.
Link: https://lkml.kernel.org/r/874ksczru6.fsf_-_@x220.int.ebiederm.org
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
The code in prepare_binary_handler needs to be run every time
search_binary_handler is called so move the call into search_binary_handler
itself to make the code simpler and easier to understand.
Link: https://lkml.kernel.org/r/87d070zrvx.fsf_-_@x220.int.ebiederm.org
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: James Morris <jamorris@linux.microsoft.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Add a flag preserve_creds that binfmt_misc can set to prevent
credentials from being updated. This allows binfmt_misc to always
call prepare_binprm. Allowing the credential computation logic to be
consolidated.
Not replacing the credentials with the interpreters credentials is
safe because because an open file descriptor to the executable is
passed to the interpreter. As the interpreter does not need to
reopen the executable it is guaranteed to see the same file that
exec sees.
Ref: c407c033de84 ("[PATCH] binfmt_misc: improve calculation of interpreter's credentials")
Link: https://lkml.kernel.org/r/87imgszrwo.fsf_-_@x220.int.ebiederm.org
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Rename bprm->cap_elevated to bprm->active_secureexec and initialize it
in prepare_binprm instead of in cap_bprm_set_creds. Initializing
bprm->active_secureexec in prepare_binprm allows multiple
implementations of security_bprm_repopulate_creds to play nicely with
each other.
Rename security_bprm_set_creds to security_bprm_reopulate_creds to
emphasize that this path recomputes part of bprm->cred. This
recomputation avoids the time of check vs time of use problems that
are inherent in unix #! interpreters.
In short two renames and a move in the location of initializing
bprm->active_secureexec.
Link: https://lkml.kernel.org/r/87o8qkzrxp.fsf_-_@x220.int.ebiederm.org
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Highlights of this series:
* Remove serialization of sending RPC/RDMA Replies
* Convert the TCP socket send path to use xdr_buf::bvecs (pre-requisite for
RPC-on-TLS)
* Fix svcrdma backchannel sendto return code
* Convert a number of dprintk call sites to use tracepoints
* Fix the "suggest braces around empty body in an 'else' statement" warning
fs/nfsd/nfsctl.c:256: warning: Function parameter or member 'file' not described in 'write_unlock_ip'
fs/nfsd/nfsctl.c:256: warning: Function parameter or member 'buf' not described in 'write_unlock_ip'
fs/nfsd/nfsctl.c:256: warning: Function parameter or member 'size' not described in 'write_unlock_ip'
fs/nfsd/nfsctl.c:295: warning: Function parameter or member 'file' not described in 'write_unlock_fs'
fs/nfsd/nfsctl.c:295: warning: Function parameter or member 'buf' not described in 'write_unlock_fs'
fs/nfsd/nfsctl.c:295: warning: Function parameter or member 'size' not described in 'write_unlock_fs'
fs/nfsd/nfsctl.c:352: warning: Function parameter or member 'file' not described in 'write_filehandle'
fs/nfsd/nfsctl.c:352: warning: Function parameter or member 'buf' not described in 'write_filehandle'
fs/nfsd/nfsctl.c:352: warning: Function parameter or member 'size' not described in 'write_filehandle'
fs/nfsd/nfsctl.c:434: warning: Function parameter or member 'file' not described in 'write_threads'
fs/nfsd/nfsctl.c:434: warning: Function parameter or member 'buf' not described in 'write_threads'
fs/nfsd/nfsctl.c:434: warning: Function parameter or member 'size' not described in 'write_threads'
fs/nfsd/nfsctl.c:478: warning: Function parameter or member 'file' not described in 'write_pool_threads'
fs/nfsd/nfsctl.c:478: warning: Function parameter or member 'buf' not described in 'write_pool_threads'
fs/nfsd/nfsctl.c:478: warning: Function parameter or member 'size' not described in 'write_pool_threads'
fs/nfsd/nfsctl.c:697: warning: Function parameter or member 'file' not described in 'write_versions'
fs/nfsd/nfsctl.c:697: warning: Function parameter or member 'buf' not described in 'write_versions'
fs/nfsd/nfsctl.c:697: warning: Function parameter or member 'size' not described in 'write_versions'
fs/nfsd/nfsctl.c:858: warning: Function parameter or member 'file' not described in 'write_ports'
fs/nfsd/nfsctl.c:858: warning: Function parameter or member 'buf' not described in 'write_ports'
fs/nfsd/nfsctl.c:858: warning: Function parameter or member 'size' not described in 'write_ports'
fs/nfsd/nfsctl.c:892: warning: Function parameter or member 'file' not described in 'write_maxblksize'
fs/nfsd/nfsctl.c:892: warning: Function parameter or member 'buf' not described in 'write_maxblksize'
fs/nfsd/nfsctl.c:892: warning: Function parameter or member 'size' not described in 'write_maxblksize'
fs/nfsd/nfsctl.c:941: warning: Function parameter or member 'file' not described in 'write_maxconn'
fs/nfsd/nfsctl.c:941: warning: Function parameter or member 'buf' not described in 'write_maxconn'
fs/nfsd/nfsctl.c:941: warning: Function parameter or member 'size' not described in 'write_maxconn'
fs/nfsd/nfsctl.c:1023: warning: Function parameter or member 'file' not described in 'write_leasetime'
fs/nfsd/nfsctl.c:1023: warning: Function parameter or member 'buf' not described in 'write_leasetime'
fs/nfsd/nfsctl.c:1023: warning: Function parameter or member 'size' not described in 'write_leasetime'
fs/nfsd/nfsctl.c:1039: warning: Function parameter or member 'file' not described in 'write_gracetime'
fs/nfsd/nfsctl.c:1039: warning: Function parameter or member 'buf' not described in 'write_gracetime'
fs/nfsd/nfsctl.c:1039: warning: Function parameter or member 'size' not described in 'write_gracetime'
fs/nfsd/nfsctl.c:1094: warning: Function parameter or member 'file' not described in 'write_recoverydir'
fs/nfsd/nfsctl.c:1094: warning: Function parameter or member 'buf' not described in 'write_recoverydir'
fs/nfsd/nfsctl.c:1094: warning: Function parameter or member 'size' not described in 'write_recoverydir'
fs/nfsd/nfsctl.c:1125: warning: Function parameter or member 'file' not described in 'write_v4_end_grace'
fs/nfsd/nfsctl.c:1125: warning: Function parameter or member 'buf' not described in 'write_v4_end_grace'
fs/nfsd/nfsctl.c:1125: warning: Function parameter or member 'size' not described in 'write_v4_end_grace'
fs/nfsd/nfs4proc.c:1164: warning: Function parameter or member 'nss' not described in 'nfsd4_interssc_connect'
fs/nfsd/nfs4proc.c:1164: warning: Function parameter or member 'rqstp' not described in 'nfsd4_interssc_connect'
fs/nfsd/nfs4proc.c:1164: warning: Function parameter or member 'mount' not described in 'nfsd4_interssc_connect'
fs/nfsd/nfs4proc.c:1262: warning: Function parameter or member 'rqstp' not described in 'nfsd4_setup_inter_ssc'
fs/nfsd/nfs4proc.c:1262: warning: Function parameter or member 'cstate' not described in 'nfsd4_setup_inter_ssc'
fs/nfsd/nfs4proc.c:1262: warning: Function parameter or member 'copy' not described in 'nfsd4_setup_inter_ssc'
fs/nfsd/nfs4proc.c:1262: warning: Function parameter or member 'mount' not described in 'nfsd4_setup_inter_ssc'
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Clean up: Fix gcc empty-body warning when -Wextra is used.
../fs/nfsd/nfs4state.c:3898:3: warning: suggest braces around empty body in an ‘else’ statement [-Wempty-body]
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Capture obvious events and replace dprintk() call sites. Introduce
infrastructure so that adding more tracepoints in this code later
is simplified.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Try to capture DRC failures.
Two additional clean-ups:
- Introduce Doxygen-style comments for the main entry points
- Remove a dprintk that fires for an allocation failure. This was
the only dprintk in the REPCACHE class.
Reported-by: kbuild test robot <lkp@intel.com>
[ cel: force typecast for display of checksum values ]
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
- Correctly set next cursor for detailed_erase_block_info debugfs file
- Don't use crypto_shash_descsize() for digest size in UBIFS
- Remove broken lazytime support from UBIFS
-----BEGIN PGP SIGNATURE-----
iQJKBAABCAA0FiEEdgfidid8lnn52cLTZvlZhesYu8EFAl7Fh08WHHJpY2hhcmRA
c2lnbWEtc3Rhci5hdAAKCRBm+VmF6xi7wW2WD/428LjXh+24Y3rekfnCRXG5w+es
yITAfhOmNuzn2vjS1UvCD0HsoBaS/LYbjuaceoyfXF9BG5mcrRTjFH7dVEEWFGDZ
YeRvBFkyt4xBEJtrY/6MW35KPRtnCp4Jau9HR9M5RCcQ5xzOeGtw0r/JMdZe56Av
zc2mLnZag1x5NyS4TvS30nCgj5pxVbO2bdAkyULJwBfPYs0C3TKeIul/4vjRi+57
PjyIUSR7CxpsOJde0tMjDvf23ewn1IUEW+YnewP1qk36ijRw1M6C90ERr4CU9BM5
YTEfjsxAheCItSf8r+BC70gaPBQPADtvHzPFqs9yNMSsLHYdOkkvqT8Bpwisj76d
1zL45DjZZ8UxC3HfSMFPl/dYDWvfddpffNwrimeltoAzzejI/Wk8AX0VqH1IQ3Z1
zDbz0ixP21ADATvrHUxr7UsoeEU9havGV+2sg+4wSU1aLtKIZUTjceizjkTN+9oB
ntHLuv6cS2iop22iSbJGClOv2TjpBlGQNwMDQ7TdD1a0QqxTSPRiguMmf/mDpQa/
MgQGAO6xS5NKRNiEbifniiCugLqpUQBHBPyn+q+4unmfK5sPzzLdpb3vpc0XNmbm
WgwfuMZdfmK0jO27P1/MRG6LUGxXKh5arsi6JrUJVIsdxzV3bdc2xBjkUFOOS/tH
W7fn4QS+WmbPVm09Jg==
=eCh7
-----END PGP SIGNATURE-----
Merge tag 'for-linus-5.7-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/ubifs
Pull UBI and UBIFS fixes from Richard Weinberger:
- Correctly set next cursor for detailed_erase_block_info debugfs file
- Don't use crypto_shash_descsize() for digest size in UBIFS
- Remove broken lazytime support from UBIFS
* tag 'for-linus-5.7-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/ubifs:
ubi: Fix seq_file usage in detailed_erase_block_info debugfs file
ubifs: fix wrong use of crypto_shash_descsize()
ubifs: remove broken lazytime support
Today security_bprm_set_creds has several implementations:
apparmor_bprm_set_creds, cap_bprm_set_creds, selinux_bprm_set_creds,
smack_bprm_set_creds, and tomoyo_bprm_set_creds.
Except for cap_bprm_set_creds they all test bprm->called_set_creds and
return immediately if it is true. The function cap_bprm_set_creds
ignores bprm->calld_sed_creds entirely.
Create a new LSM hook security_bprm_creds_for_exec that is called just
before prepare_binprm in __do_execve_file, resulting in a LSM hook
that is called exactly once for the entire of exec. Modify the bits
of security_bprm_set_creds that only want to be called once per exec
into security_bprm_creds_for_exec, leaving only cap_bprm_set_creds
behind.
Remove bprm->called_set_creds all of it's former users have been moved
to security_bprm_creds_for_exec.
Add or upate comments a appropriate to bring them up to date and
to reflect this change.
Link: https://lkml.kernel.org/r/87v9kszrzh.fsf_-_@x220.int.ebiederm.org
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Casey Schaufler <casey@schaufler-ca.com> # For the LSM and Smack bits
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCXsU6SgAKCRDh3BK/laaZ
PABRAP9MCZz/CLH2sEqHqH9KQHScNc4uf4bReiCU1hrLs7PbYwD/Y+vbRMffki7I
B/gt0Dg4kGxG5CV+ckeZK0+p2NWUUgQ=
=PPLW
-----END PGP SIGNATURE-----
Merge tag 'ovl-fixes-5.7-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs
Pull overlayfs fixes from Miklos Szeredi:
"Fix two bugs introduced in this cycle and one introduced in v5.5"
* tag 'ovl-fixes-5.7-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
ovl: potential crash in ovl_fid_to_fh()
ovl: clear ATTR_OPEN from attr->ia_valid
ovl: clear ATTR_FILE from attr->ia_valid
And replace the arcane return value convention with a simple bool
where true means success and false means failure.
[AV: braino fix folded in]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
All the op vectors are exactly the same, they are just used to encode
packet or nomerge behavior. There already is a flag for the packet
behavior, so just add a new one to allow for merging. Inverting it vs
the previous nomerge special casing actually allows for much nicer code.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
No need for a local function pointer when we can trivial branch on the
->splice_write presence.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
No need for a local function pointer when we can trivial branch on the
->splice_read presence.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
When IORING_SETUP_SQPOLL is enabled, io_ring_ctx_wait_and_kill() will wait
for sq thread to idle by busy loop:
while (ctx->sqo_thread && !wq_has_sleeper(&ctx->sqo_wait))
cond_resched();
Above loop isn't very CPU friendly, it may introduce a short cpu burst on
the current cpu.
If ctx->refs is dying, we forbid sq_thread from submitting any further
SQEs. Instead they just get discarded when we exit.
Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In io_sq_thread(), currently if we get an -EBUSY error and go to sleep,
we will won't clear it again, which will result in io_sq_thread() will
never have a chance to submit sqes again. Below test program test.c
can reveal this bug:
int main(int argc, char *argv[])
{
struct io_uring ring;
int i, fd, ret;
struct io_uring_sqe *sqe;
struct io_uring_cqe *cqe;
struct iovec *iovecs;
void *buf;
struct io_uring_params p;
if (argc < 2) {
printf("%s: file\n", argv[0]);
return 1;
}
memset(&p, 0, sizeof(p));
p.flags = IORING_SETUP_SQPOLL;
ret = io_uring_queue_init_params(4, &ring, &p);
if (ret < 0) {
fprintf(stderr, "queue_init: %s\n", strerror(-ret));
return 1;
}
fd = open(argv[1], O_RDONLY | O_DIRECT);
if (fd < 0) {
perror("open");
return 1;
}
iovecs = calloc(10, sizeof(struct iovec));
for (i = 0; i < 10; i++) {
if (posix_memalign(&buf, 4096, 4096))
return 1;
iovecs[i].iov_base = buf;
iovecs[i].iov_len = 4096;
}
ret = io_uring_register_files(&ring, &fd, 1);
if (ret < 0) {
fprintf(stderr, "%s: register %d\n", __FUNCTION__, ret);
return ret;
}
for (i = 0; i < 10; i++) {
sqe = io_uring_get_sqe(&ring);
if (!sqe)
break;
io_uring_prep_readv(sqe, 0, &iovecs[i], 1, 0);
sqe->flags |= IOSQE_FIXED_FILE;
ret = io_uring_submit(&ring);
sleep(1);
printf("submit %d\n", i);
}
for (i = 0; i < 10; i++) {
io_uring_wait_cqe(&ring, &cqe);
printf("receive: %d\n", i);
if (cqe->res != 4096) {
fprintf(stderr, "ret=%d, wanted 4096\n", cqe->res);
ret = 1;
}
io_uring_cqe_seen(&ring, cqe);
}
close(fd);
io_uring_queue_exit(&ring);
return 0;
}
sudo ./test testfile
above command will hang on the tenth request, to fix this bug, when io
sq_thread is waken up, we reset the variable 'ret' to be zero.
Suggested-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
After the copy operation completes the cache is not up-to-date. Truncate
all pages in the interval that has successfully been copied.
Truncating completely copied dirty pages is okay, since the data has been
overwritten anyway. Truncating partially copied dirty pages is not okay;
add a comment for now.
Fixes: 88bc7d5097 ("fuse: add support for copy_file_range()")
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
a) Dirty cache needs to be written back not just in the writeback_cache
case, since the dirty pages may come from memory maps.
b) The fuse_writeback_range() helper takes an inclusive interval, so the
end position needs to be pos+len-1 instead of pos+len.
Fixes: 88bc7d5097 ("fuse: add support for copy_file_range()")
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
We normally disable any commands that aren't specifically poll commands
for a ring that is setup for polling, but we do allow buffer provide and
remove commands to support buffer selection for polled IO. Once a
request is issued, we add it to the poll list to poll for completion. But
we should not do that for non-IO commands, as those request complete
inline immediately and aren't pollable. If we do, we can leave requests
on the iopoll list after they are freed.
Fixes: ddf0322db7 ("io_uring: add IORING_OP_PROVIDE_BUFFERS")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pull vfs fix from Al Viro:
"Stable fodder fix: copy_fdtable() would get screwed on 64bit boxen
with sysctl_nr_open raised to 512M or higher, which became possible
since 2.6.25.
Nobody sane would set the things up that way, but..."
* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
fix multiplication overflow in copy_fdtable()
cpy and set really should be size_t; we won't get an overflow on that,
since sysctl_nr_open can't be set above ~(size_t)0 / sizeof(void *),
so nr that would've managed to overflow size_t on that multiplication
won't get anywhere near copy_fdtable() - we'll fail with EMFILE
before that.
Cc: stable@kernel.org # v2.6.25+
Fixes: 9cfe015aa4 (get rid of NR_OPEN and introduce a sysctl_nr_open)
Reported-by: Thiago Macieira <thiago.macieira@intel.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
kiocb.private is used in iomap_dio_rw() so store buf_index separately.
Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com>
Move 'buf_index' to a hole in io_kiocb.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add an extra validation of the len parameter, as for ext4 some files
might have smaller file size limits than others. This also means the
redundant size check in ext4_ioctl_get_es_cache can go away, as all
size checking is done in the shared fiemap handler.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20200505154324.3226743-3-hch@lst.de
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
ext4 supports max number of logical blocks in a file to be 0xffffffff.
(This is since ext4_extent's ee_block is __le32).
This means that EXT4_MAX_LOGICAL_BLOCK should be 0xfffffffe (starting
from 0 logical offset). This patch fixes this.
The issue was seen when ext4 moved to iomap_fiemap API and when
overlayfs was mounted on top of ext4. Since overlayfs was missing
filemap_check_ranges(), so it could pass a arbitrary huge length which
lead to overflow of map.m_len logic.
This patch fixes that.
Fixes: d3b6f23f71 ("ext4: move ext4_fiemap to use iomap framework")
Reported-by: syzbot+77fa5bdb65cc39711820@syzkaller.appspotmail.com
Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20200505154324.3226743-2-hch@lst.de
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Move freeing the dynamically allocated attr and COW fork, as well
as zeroing the pointers where actually needed into the callers, and
just pass the xfs_ifork structure to xfs_idestroy_fork. Also simplify
the kmem_free calls by not checking for NULL first.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Both the data and attr fork have a format that is stored in the legacy
idinode. Move it into the xfs_ifork structure instead, where it uses
up padding.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
There are there are three extents counters per inode, one for each of
the forks. Two are in the legacy icdinode and one is directly in
struct xfs_inode. Switch to a single counter in the xfs_ifork structure
where it uses up padding at the end of the structure. This simplifies
various bits of code that just wants the number of extents counter and
can now directly dereference it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
xfs_ifree only need to free inline data in the data fork, as we've
already taken care of the attr fork before (and in fact freed the
fork structure). Just open code the freeing of the inline data.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Just checking di_forkoff directly is a little easier to follow.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
XFS_IFORK_Q is supposed to be a predicate, not a function returning a
value. Its usage is in xchk_bmap_check_rmaps is incorrect, but that
function only cares about whether or not the "size" of the data is zero
or not. Convert that logic to use a proper boolean, and teach the
caller to skip the call entirely if the end result would be that we'd do
nothing anyway. This avoids a crash later in this series.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
[hch: generalized the NULL ifor check]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Now that we fully verify the inode forks before they are added to the
inode cache, the crash reported in
https://bugzilla.kernel.org/show_bug.cgi?id=204031
can't happen anymore, as we'll never let an inode that has inconsistent
nextents counts vs the presence of an in-core attr fork leak into the
inactivate code path. So remove the work around to try to handle the
case, and just return an error and warn if the fork is not present.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
We don't call xfs_bmapi_read for the COW fork anymore, so remove the
special casing.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Call the data/attr local fork verifiers as soon as we are ready for them.
This keeps them close to the code setting up the forks, and avoids a
few branches later on. Also open code xfs_inode_verify_forks in the
only remaining caller.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The split between xfs_inode_verify_forks and the two helpers
implementing the actual functionality is a little strange. Reshuffle
it so that xfs_inode_verify_forks verifies if the data and attr forks
are actually in local format and only call the low-level helpers if
that is the case. Handle the actual error reporting in the low-level
handlers to streamline the caller.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
xfs_ifork_ops add up to two indirect calls per inode read and flush,
despite just having a single instance in the kernel. In xfsprogs
phase6 in xfs_repair overrides the verify_dir method to deal with inodes
that do not have a valid parent, but that can be fixed pretty easily
by ensuring they always have a valid looking parent.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
There is not much point in the xfs_iread function, as it has a single
caller and not a whole lot of code. Move it into the only caller,
and trim down the overdocumentation to just documenting the important
"why" instead of a lot of redundant "what".
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
i_delayed_blks is set to 0 in xfs_inode_alloc and can't have anything
assigned to it until the inode is visible to the VFS.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Keep the code dealing with the dinode together, and also ensure we verify
the dinode in the owner change log recovery case as well.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Handle inodes with a 0 di_mode in xfs_inode_from_disk, instead of partially
duplicating inode reading in xfs_iread.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
xfs_iformat_fork is a weird catchall. Split it into one helper for
the data fork and one for the attr fork, and then call both helper
as well as the COW fork initialization from xfs_inode_from_disk. Order
the COW fork initialization after the attr fork initialization given
that it can't fail to simplify the error handling.
Note that the newly split helpers are moved down the file in
xfs_inode_fork.c to avoid the need for forward declarations.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
We always need to fill out the fork structures when reading the inode,
so call xfs_iformat_fork from the tail of xfs_inode_from_disk.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The last argument to xfs_bmapi_raad contains XFS_BMAPI_* flags, not the
fork. Given that XFS_DATA_FORK evaluates to 0 no real harm is done,
but let's fix this anyway.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Fix this error message to complain about project and group quota flag
bits instead of "PUOTA" and "QUOTA".
Signed-off-by: Kaixu Xia <kaixuxia@tencent.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Since the old SWAPEXT ioctl doesn't know how to adjust quota ids,
bail out of the ids don't match and quotas are enabled.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
While QAing the new xfs_repair quotacheck code, I uncovered a quota
corruption bug resulting from a bad interaction between dquot buffer
initialization and quotacheck. The bug can be reproduced with the
following sequence:
# mkfs.xfs -f /dev/sdf
# mount /dev/sdf /opt -o usrquota
# su nobody -s /bin/bash -c 'touch /opt/barf'
# sync
# xfs_quota -x -c 'report -ahi' /opt
User quota on /opt (/dev/sdf)
Inodes
User ID Used Soft Hard Warn/Grace
---------- ---------------------------------
root 3 0 0 00 [------]
nobody 1 0 0 00 [------]
# xfs_io -x -c 'shutdown' /opt
# umount /opt
# mount /dev/sdf /opt -o usrquota
# touch /opt/man2
# xfs_quota -x -c 'report -ahi' /opt
User quota on /opt (/dev/sdf)
Inodes
User ID Used Soft Hard Warn/Grace
---------- ---------------------------------
root 1 0 0 00 [------]
nobody 1 0 0 00 [------]
# umount /opt
Notice how the initial quotacheck set the root dquot icount to 3
(rootino, rbmino, rsumino), but after shutdown -> remount -> recovery,
xfs_quota reports that the root dquot has only 1 icount. We haven't
deleted anything from the filesystem, which means that quota is now
under-counting. This behavior is not limited to icount or the root
dquot, but this is the shortest reproducer.
I traced the cause of this discrepancy to the way that we handle ondisk
dquot updates during quotacheck vs. regular fs activity. Normally, when
we allocate a disk block for a dquot, we log the buffer as a regular
(dquot) buffer. Subsequent updates to the dquots backed by that block
are done via separate dquot log item updates, which means that they
depend on the logged buffer update being written to disk before the
dquot items. Because individual dquots have their own LSN fields, that
initial dquot buffer must always be recovered.
However, the story changes for quotacheck, which can cause dquot block
allocations but persists the final dquot counter values via a delwri
list. Because recovery doesn't gate dquot buffer replay on an LSN, this
means that the initial dquot buffer can be replayed over the (newer)
contents that were delwritten at the end of quotacheck. In effect, this
re-initializes the dquot counters after they've been updated. If the
log does not contain any other dquot items to recover, the obsolete
dquot contents will not be corrected by log recovery.
Because quotacheck uses a transaction to log the setting of the CHKD
flags in the superblock, we skip quotacheck during the second mount
call, which allows the incorrect icount to remain.
Fix this by changing the ondisk dquot initialization function to use
ordered buffers to write out fresh dquot blocks if it detects that we're
running quotacheck. If the system goes down before quotacheck can
complete, the CHKD flags will not be set in the superblock and the next
mount will run quotacheck again, which can fix uninitialized dquot
buffers. This requires amending the defer code to maintaine ordered
buffer state across defer rolls for the sake of the dquot allocation
code.
For regular operations we preserve the current behavior since the dquot
items require properly initialized ondisk dquot records.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
The eMMC inline crypto standard will only specify 32 DUN bits (a.k.a. IV
bits), unlike UFS's 64. IV_INO_LBLK_64 is therefore not applicable, but
an encryption format which uses one key per policy and permits the
moving of encrypted file contents (as f2fs's garbage collector requires)
is still desirable.
To support such hardware, add a new encryption format IV_INO_LBLK_32
that makes the best use of the 32 bits: the IV is set to
'SipHash-2-4(inode_number) + file_logical_block_number mod 2^32', where
the SipHash key is derived from the fscrypt master key. We hash only
the inode number and not also the block number, because we need to
maintain contiguity of DUNs to merge bios.
Unlike with IV_INO_LBLK_64, with this format IV reuse is possible; this
is unavoidable given the size of the DUN. This means this format should
only be used where the requirements of the first paragraph apply.
However, the hash spreads out the IVs in the whole usable range, and the
use of a keyed hash makes it difficult for an attacker to determine
which files use which IVs.
Besides the above differences, this flag works like IV_INO_LBLK_64 in
that on ext4 it is only allowed if the stable_inodes feature has been
enabled to prevent inode numbers and the filesystem UUID from changing.
Link: https://lore.kernel.org/r/20200515204141.251098-1-ebiggers@kernel.org
Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Paul Crowley <paulcrowley@google.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
The attr fork can transition from shortform to leaf format while
empty if the first xattr doesn't fit in shortform. While this empty
leaf block state is intended to be transient, it is technically not
due to the transactional implementation of the xattr set operation.
We historically have a couple of bandaids to work around this
problem. The first is to hold the buffer after the format conversion
to prevent premature writeback of the empty leaf buffer and the
second is to bypass the xattr count check in the verifier during
recovery. The latter assumes that the xattr set is also in the log
and will be recovered into the buffer soon after the empty leaf
buffer is reconstructed. This is not guaranteed, however.
If the filesystem crashes after the format conversion but before the
xattr set that induced it, only the format conversion may exist in
the log. When recovered, this creates a latent corrupted state on
the inode as any subsequent attempts to read the buffer fail due to
verifier failure. This includes further attempts to set xattrs on
the inode or attempts to destroy the attr fork, which prevents the
inode from ever being removed from the unlinked list.
To avoid this condition, accept that an empty attr leaf block is a
valid state and remove the count check from the verifier. This means
that on rare occasions an attr fork might exist in an unexpected
state, but is otherwise consistent and functional. Note that we
retain the logic to avoid racing with metadata writeback to reduce
the window where this can occur.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Add handling for loss of notifications by having read() insert a
loss-notification message after it has read the pipe buffer that was last
in the ring when the loss occurred.
Lossage can come about either by running out of notification descriptors or
by running out of space in the pipe ring.
Signed-off-by: David Howells <dhowells@redhat.com>
Allow a buffer to be marked such that read() must return the entire buffer
in one go or return ENOBUFS. Multiple buffers can be amalgamated into a
single read, but a short read will occur if the next "whole" buffer won't
fit.
This is useful for watch queue notifications to make sure we don't split a
notification across multiple reads, especially given that we need to
fabricate an overrun record under some circumstances - and that isn't in
the buffers.
Signed-off-by: David Howells <dhowells@redhat.com>
Make it possible to have a general notification queue built on top of a
standard pipe. Notifications are 'spliced' into the pipe and then read
out. splice(), vmsplice() and sendfile() are forbidden on pipes used for
notifications as post_one_notification() cannot take pipe->mutex. This
means that notifications could be posted in between individual pipe
buffers, making iov_iter_revert() difficult to effect.
The way the notification queue is used is:
(1) An application opens a pipe with a special flag and indicates the
number of messages it wishes to be able to queue at once (this can
only be set once):
pipe2(fds, O_NOTIFICATION_PIPE);
ioctl(fds[0], IOC_WATCH_QUEUE_SET_SIZE, queue_depth);
(2) The application then uses poll() and read() as normal to extract data
from the pipe. read() will return multiple notifications if the
buffer is big enough, but it will not split a notification across
buffers - rather it will return a short read or EMSGSIZE.
Notification messages include a length in the header so that the
caller can split them up.
Each message has a header that describes it:
struct watch_notification {
__u32 type:24;
__u32 subtype:8;
__u32 info;
};
The type indicates the source (eg. mount tree changes, superblock events,
keyring changes, block layer events) and the subtype indicates the event
type (eg. mount, unmount; EIO, EDQUOT; link, unlink). The info field
indicates a number of things, including the entry length, an ID assigned to
a watchpoint contributing to this buffer and type-specific flags.
Supplementary data, such as the key ID that generated an event, can be
attached in additional slots. The maximum message size is 127 bytes.
Messages may not be padded or aligned, so there is no guarantee, for
example, that the notification type will be on a 4-byte bounary.
Signed-off-by: David Howells <dhowells@redhat.com>
A GETATTR request can race with FUSE_NOTIFY_INVAL_INODE, resulting in the
attribute cache being updated with stale information after the
invalidation.
Fix this by bumping the attribute version in fuse_reverse_inval_inode().
Reported-by: Krzysztof Rusek <rusek@9livesdata.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
page_count() is unstable. Unless there has been an RCU grace period
between when the page was removed from the page cache and now, a
speculative reference may exist from the page cache.
Reported-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
When PageWaiters was added, updating this check was missed.
Reported-by: Nikolaus Rath <Nikolaus@rath.org>
Reported-by: Hugh Dickins <hughd@google.com>
Fixes: 6290602709 ("mm: add PageWaiters indicating tasks are waiting for a page bit")
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Instead of custom page dumping, use the standard helper.
Reported-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
fuse_fill_super_common() allocates and installs one fuse_device. Hence
virtiofs allocates and install all fuse devices by itself except one.
This makes logic little twisted. There does not seem to be any real need
that why virtiofs can't allocate and install all fuse devices itself.
So opt out of fuse device allocation and installation while calling
fuse_fill_super_common().
Regular fuse still wants fuse_fill_super_common() to install fuse_device.
It needs to prevent against races where two mounters are trying to mount
fuse using same fd. In that case one will succeed while other will get
-EINVAL.
virtiofs does not have this issue because sget_fc() resolves the race
w.r.t multiple mounters and only one instance of virtio_fs_fill_super()
should be in progress for same filesystem.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Fuse mounts without "allow_other" are off-limits to all non-owners. Yet it
makes sense to allow querying st_dev on the root, since this value is
provided by the kernel, not the userspace filesystem.
Allow statx(2) with a zero request mask to succeed on a fuse mounts for all
users.
Reported-by: Nikolaus Rath <Nikolaus@rath.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
We want cached data to synced with the userspace filesystem on close(), for
example to allow getting correct st_blocks value. Do this regardless of
whether the userspace filesystem implements a FLUSH method or not.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Under writeback mode, inode->i_blocks is not updated, making utils du
read st.blocks as 0.
For example, when using virtiofs (cache=always & nondax mode) with
writeback_cache enabled, writing a new file and check its disk usage
with du, du reports 0 usage.
# uname -r
5.6.0-rc6+
# mount -t virtiofs virtiofs /mnt/virtiofs
# rm -f /mnt/virtiofs/testfile
# create new file and do extend write
# xfs_io -fc "pwrite 0 4k" /mnt/virtiofs/testfile
wrote 4096/4096 bytes at offset 0
4 KiB, 1 ops; 0.0001 sec (28.103 MiB/sec and 7194.2446 ops/sec)
# du -k /mnt/virtiofs/testfile
0 <==== disk usage is 0
# stat -c %s,%b /mnt/virtiofs/testfile
4096,0 <==== i_size is correct, but st_blocks is 0
Fix it by invalidating attr in fuse_flush(), so we get up-to-date attr
from server on next getattr.
Signed-off-by: Eryu Guan <eguan@linux.alibaba.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
syzbot found that
touch /proc/testfile
causes NULL pointer dereference at tomoyo_get_local_path()
because inode of the dentry is NULL.
Before c59f415a7c, Tomoyo received pid_ns from proc's s_fs_info
directly. Since proc_pid_ns() can only work with inode, using it in
the tomoyo_get_local_path() was wrong.
To avoid creating more functions for getting proc_ns, change the
argument type of the proc_pid_ns() function. Then, Tomoyo can use
the existing super_block to get pid_ns.
Link: https://lkml.kernel.org/r/0000000000002f0c7505a5b0e04c@google.com
Link: https://lkml.kernel.org/r/20200518180738.2939611-1-gladkov.alexey@gmail.com
Reported-by: syzbot+c1af344512918c61362c@syzkaller.appspotmail.com
Fixes: c59f415a7c ("Use proc_pid_ns() to get pid_namespace from the proc superblock")
Signed-off-by: Alexey Gladkov <gladkov.alexey@gmail.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Since v1 encryption policies are deprecated, make test_dummy_encryption
test v2 policies by default.
Note that this causes ext4/023 and ext4/028 to start failing due to
known bugs in those tests (see previous commit).
Link: https://lore.kernel.org/r/20200512233251.118314-5-ebiggers@kernel.org
Reviewed-by: Jaegeuk Kim <jaegeuk@kernel.org>
Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Eric Biggers <ebiggers@google.com>
v1 encryption policies are deprecated in favor of v2, and some new
features (e.g. encryption+casefolding) are only being added for v2.
Therefore, the "test_dummy_encryption" mount option (which is used for
encryption I/O testing with xfstests) needs to support v2 policies.
To do this, extend its syntax to be "test_dummy_encryption=v1" or
"test_dummy_encryption=v2". The existing "test_dummy_encryption" (no
argument) also continues to be accepted, to specify the default setting
-- currently v1, but the next patch changes it to v2.
To cleanly support both v1 and v2 while also making it easy to support
specifying other encryption settings in the future (say, accepting
"$contents_mode:$filenames_mode:v2"), make ext4 and f2fs maintain a
pointer to the dummy fscrypt_context rather than using mount flags.
To avoid concurrency issues, don't allow test_dummy_encryption to be set
or changed during a remount. (The former restriction is new, but
xfstests doesn't run into it, so no one should notice.)
Tested with 'gce-xfstests -c {ext4,f2fs}/encrypt -g auto'. On ext4,
there are two regressions, both of which are test bugs: ext4/023 and
ext4/028 fail because they set an xattr and expect it to be stored
inline, but the increase in size of the fscrypt_context from
24 to 40 bytes causes this xattr to be spilled into an external block.
Link: https://lore.kernel.org/r/20200512233251.118314-4-ebiggers@kernel.org
Acked-by: Jaegeuk Kim <jaegeuk@kernel.org>
Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Eric Biggers <ebiggers@google.com>
When parsing the mount option, we don't have sbi->user_block_count.
Should do it after getting it.
Cc: <stable@vger.kernel.org>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
- Fix potential memory leak in exfat_find.
- Set exfat's splice_write to iter_file_splice_write to fix the splice
failure on direct-opened file
-----BEGIN PGP SIGNATURE-----
iQJMBAABCgA2FiEE6NzKS6Uv/XAAGHgyZwv7A1FEIQgFAl7CCAkYHG5hbWphZS5q
ZW9uQHNhbXN1bmcuY29tAAoJEGcL+wNRRCEIX3AQAM7cV9GZecl6YfQu5AIeFbHT
uvSnvuW5O5JS9qdra4knSTthHYJ8eUucjcPlxUtHhs4oznm+erjZc9A0tRwDQyjy
EjoZZGEBOphWFLCY28K9LdJZD89JhNh9v5XUD9dId3XFnznaRjvZRHlbCVzqAWG1
DUcRedNEderpkg0FySEBIx6EHhKX6+YgkKOWlGG8r8bqdRrgZbjyAyduRdKlyX31
7XIeS4qFMDWLrqcbJdmL9pljx4VH2MswNIXK6kA2pydMwItGhod2yRWzFMYPeTDm
fTRDKzHvfA3J30h3wMI5FJu/ikfuVqsmp8i5rND7v/eRP13uuxZCSI2MfnUzHEj2
ciWxGfr5kFGg/1eAjNtOy3AnS5wsaEQ0ixYFGgKb8ENvToyT4cHa+9X2y0PrVnRu
bOyqJTBwlSisqp3DiK8aAhklHHbX1/CheGOLMj1B48H42eREUHFn/yPYroOb+Ot/
CiRH4feACSCMRGn8HdlgnguOs4zwZIWtLQWpfqhu4CJSNFa3IW6PSl53U1vPzuXG
v2Cdxn6D1gCqxsFbSmzmMJVkNfILrY7sLSU9lqrXWCQ4T6I8FpBxIvU8CCi1boQD
7hpdXstL/0xhb/gTFQL2uJ2MasQdSzVQgl6dmGK5riJkqwgaWz4FDro+IF3JxdQT
qtUZ5nd6e33pl6PwK3nt
=JN5f
-----END PGP SIGNATURE-----
Merge tag 'for-5.7-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat
Pull exfat fixes from Namjae Jeon:
- Fix potential memory leak in exfat_find
- Set exfat's splice_write to iter_file_splice_write to fix a splice
failure on direct-opened files
* tag 'for-5.7-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat:
exfat: fix possible memory leak in exfat_find()
exfat: use iter_file_splice_write
Don't call req->page_done() on each page as we finish filling it with
the data coming from the network. Whilst this might speed up the
application a bit, it's a problem if there's a network failure and the
operation has to be reissued.
If this happens, an oops occurs because afs_readpages_page_done() clears
the pointer to each page it unlocks and when a retry happens, the
pointers to the pages it wants to fill are now NULL (and the pages have
been unlocked anyway).
Instead, wait till the operation completes successfully and only then
release all the pages after clearing any terminal gap (the server can
give us less data than we requested as we're allowed to ask for more
than is available).
KASAN produces a bug like the following, and even without KASAN, it can
oops and panic.
BUG: KASAN: wild-memory-access in _copy_to_iter+0x323/0x5f4
Write of size 1404 at addr 0005088000000000 by task md5sum/5235
CPU: 0 PID: 5235 Comm: md5sum Not tainted 5.7.0-rc3-fscache+ #250
Hardware name: ASUS All Series/H97-PLUS, BIOS 2306 10/09/2014
Call Trace:
memcpy+0x39/0x58
_copy_to_iter+0x323/0x5f4
__skb_datagram_iter+0x89/0x2a6
skb_copy_datagram_iter+0x129/0x135
rxrpc_recvmsg_data.isra.0+0x615/0xd42
rxrpc_kernel_recv_data+0x1e9/0x3ae
afs_extract_data+0x139/0x33a
yfs_deliver_fs_fetch_data64+0x47a/0x91b
afs_deliver_to_call+0x304/0x709
afs_wait_for_call_to_complete+0x1cc/0x4ad
yfs_fs_fetch_data+0x279/0x288
afs_fetch_data+0x1e1/0x38d
afs_readpages+0x593/0x72e
read_pages+0xf5/0x21e
__do_page_cache_readahead+0x128/0x23f
ondemand_readahead+0x36e/0x37f
generic_file_buffered_read+0x234/0x680
new_sync_read+0x109/0x17e
vfs_read+0xe6/0x138
ksys_read+0xd8/0x14d
do_syscall_64+0x6e/0x8a
entry_SYSCALL_64_after_hwframe+0x49/0xb3
Fixes: 196ee9cd2d ("afs: Make afs_fs_fetch_data() take a list of pages")
Fixes: 30062bd13e ("afs: Implement YFS support in the fs client")
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We currently move it to the io_wqe_manager for execution, but we cannot
safely do so as we may lack some of the state to execute it out of
context. As we cancel work anyway when the ring/task exits, just mark
this request as canceled and io_async_task_func() will do the right
thing.
Fixes: aa96bf8a9e ("io_uring: use io-wq manager as backup task if task is exiting")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The change to exec is relevant to the cleanup work I have been doing.
Merge it here so that I can build on top of it, and so hopefully
that other merge logic can pick up on this and see how to deal
with the conflict between that change and my exec cleanup work.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
'es' is malloced from exfat_get_dentry_set() in exfat_find() and should
be freed before leaving from the error handling cases, otherwise it will
cause memory leak.
Fixes: 5f2aa07507 ("exfat: add inode operations")
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Doing copy_file_range() on exfat with a file opened for direct IO leads
to an -EFAULT:
# xfs_io -f -d -c "truncate 32768" \
-c "copy_range -d 16384 -l 16384 -f 0" /mnt/test/junk
copy_range: Bad address
and the reason seems to be that we go through:
default_file_splice_write
splice_from_pipe
__splice_from_pipe
write_pipe_buf
__kernel_write
new_sync_write
generic_file_write_iter
generic_file_direct_write
exfat_direct_IO
do_blockdev_direct_IO
iov_iter_get_pages
and land in iterate_all_kinds(), which does "return -EFAULT" for our kvec
iter.
Setting exfat's splice_write to iter_file_splice_write fixes this and lets
fsx (which originally detected the problem) run to success from
the xfstests harness.
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
If the request is still hashed in io_async_task_func(), then it cannot
have been canceled and it's pointless to check. So save that check.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Function is going to be used in transport over RDMA module in subsequent
patches, so export it to GPL modules.
Link: https://lore.kernel.org/r/20200511135131.27580-2-danil.kipnis@cloud.ionos.com
Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: linux-kernel@vger.kernel.org
[jwang: extend the commit message]
Signed-off-by: Jack Wang <jinpu.wang@cloud.ionos.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
crypto_shash_descsize() returns the size of the shash_desc context
needed to compute the hash, not the size of the hash itself.
crypto_shash_digestsize() would be correct, or alternatively using
c->hash_len and c->hmac_desc_len which already store the correct values.
But actually it's simpler to just use stack arrays, so do that instead.
Fixes: 49525e5eec ("ubifs: Add helper functions for authentication support")
Fixes: da8ef65f95 ("ubifs: Authenticate replayed journal")
Cc: <stable@vger.kernel.org> # v4.20+
Cc: Sascha Hauer <s.hauer@pengutronix.de>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Acked-by: Sascha Hauer <s.hauer@pengutronix.de>
Signed-off-by: Richard Weinberger <richard@nod.at>
We checked for 'force_nonblock' higher up, so it's definitely false
at this point. Kill the check, it's a remnant of when we tried to do
inline splice without always punting to async context.
Fixes: 2fb3e82284 ("io_uring: punt splice async because of inode mutex")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add IORING_OP_TEE implementing tee(2) support. Almost identical to
splice bits, but without offsets.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
req->flags stores all sqe->flags. After checking that sqe->flags are
valid set if IOSQE* flags, no need to double check it, just forward them
all.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
io_file_put() deals with flushing state's file refs, adding "state" to
its name makes it a bit clearer. Also, avoid double check of
state->file in __io_file_get() in some cases.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
A submission is "async" IIF it's done by SQPOLL thread. Instead of
passing @async flag into io_submit_sqes(), deduce it from ctx->flags.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We only need apoll in the one section, do the juggling with the work
restoration there. This removes a special case further down as well.
No functional changes in this patch.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pull execve fix from Eric Biederman:
"While working on my exec cleanups I found a bug in exec that I
introduced by accident a couple of years ago. I apparently missed the
fact that bprm->file can change.
Now I have a very personal motive to clean up exec and make it more
approachable.
The change is just moving woud_dump to where it acts on the final
bprm->file not the initial bprm->file. I have been careful and tested
and verify this fix works"
* 'exec-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
exec: Move would_dump into flush_old_exec
I goofed when I added mm->user_ns support to would_dump. I missed the
fact that in the case of binfmt_loader, binfmt_em86, binfmt_misc, and
binfmt_script bprm->file is reassigned. Which made the move of
would_dump from setup_new_exec to __do_execve_file before exec_binprm
incorrect as it can result in would_dump running on the script instead
of the interpreter of the script.
The net result is that the code stopped making unreadable interpreters
undumpable. Which allows them to be ptraced and written to disk
without special permissions. Oops.
The move was necessary because the call in set_new_exec was after
bprm->mm was no longer valid.
To correct this mistake move the misplaced would_dump from
__do_execve_file into flos_old_exec, before exec_mmap is called.
I tested and confirmed that without this fix I can attach with gdb to
a script with an unreadable interpreter, and with this fix I can not.
Cc: stable@vger.kernel.org
Fixes: f84df2a6f2 ("exec: Ensure mm->user_ns contains the execed files")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
As for other not inlined requests, alloc req->io for FORCE_ASYNC reqs,
so they can be prepared properly.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If req->io is not NULL, it's already prepared. Don't do it again,
it's dangerous.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl6/cDAQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpo1DEADMUm23Ha/X5YSXujM55hzV6MKQQ+BjNwl1
WIU0qdDkzmoiqSpcengjHPo/2CZvjlavXGaLDWKqfz1pojld9iBhCVgFaOm03Aiq
yKbG5gu5UcKNGtlE7cObhkaX8WFDtyZBuoyx+MB6F2eOdO2sbuOEV4o4bRB0w6Gt
hn91zHtQRS5WPBxxLGbj8xDOWrrhbiHhfM2QkUX9kd10D9UcOdHsdPt2hPs2L6OF
sKkjwcVEd8oR4A5l1pKEwrbm1+7Gn1H0fR8HuBtaZ3ZU65ugaPB6cj6Brn8GDBuT
JZ4ThKPqb/6jYVpxWVKCMYGLydSDZYxsQWJP7FgICqeqQgb6q+wHYScgJTPx8v4/
rJgfgWSPbxVf/kKhjFj2JYo8QDcR5R7IUlO4TxcRaIockPdRdc4kTc7EFtzodUoN
/5wWAuVuIRBicu+KBshrB3RNTT7NCcYfSom+eqMPIvWTvw1vIpK/5ULV6WZjqFbV
bPt3mUjEYV3AfQ9l39pkoclQ3YWWq718l1iuyHdD+1u/41D85UTfGHa/m9NibO5g
TP6Jj3FWLeD0UCpUiTEuUfYfdXxnP7R5SNnF+F15fucuHtRILh7rhTCqGbG4crLw
0a1fOx0wmf1JUxQmQtJLQTQ4LdUfwqBbSekKrviClc8HizvFroQyxaFMIjgRXn8X
F37nPAzswA==
=kqDx
-----END PGP SIGNATURE-----
Merge tag 'io_uring-5.7-2020-05-15' of git://git.kernel.dk/linux-block
Pull io_uring fixes from Jens Axboe:
"Two small fixes that should go into this release:
- Check and handle zero length splice (Pavel)
- Fix a regression in this merge window for fixed files used with
polled block IO"
* tag 'io_uring-5.7-2020-05-15' of git://git.kernel.dk/linux-block:
io_uring: polled fixed file must go through free iteration
io_uring: fix zero len do_splice()
Highlights include:
Stable fixes:
- nfs: fix NULL deference in nfs4_get_valid_delegation
Bugfixes:
- Fix corruption of the return value in cachefiles_read_or_alloc_pages()
- Fix several fscache cookie issues
- Fix a fscache queuing race that can trigger a BUG_ON
- NFS: Fix 2 use-after-free regressions due to the RPC_TASK_CRED_NOREF flag
- SUNRPC: Fix a use-after-free regression in rpc_free_client_work()
- SUNRPC: Fix a race when tearing down the rpc client debugfs directory
- SUNRPC: Signalled ASYNC tasks need to exit
- NFSv3: fix rpc receive buffer size for MOUNT call
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEESQctxSBg8JpV8KqEZwvnipYKAPIFAl6/AkgACgkQZwvnipYK
APICLQ/9ENY+mQmVMSbtw2VGlBphS+GLN44k70NgmjAaI9n8f3ILZyLl0/iHEPFU
ZaCWhPWEs76olLfCWwoQFzISIUTHWVUow8mJn7gTh4mq8n9UFv8zgUxtJ3HTfEHt
rtujG9xsK0Oa351UPJWE5yC7PvFzMohcc1vVD8WQeGJQ3sMBVHTuOuPatIv3vK6s
8MflKTbtz/gUkcbWMCk9ljMPXr6/Ksgu9GZDnDFAYZBfFkwx//RNmq6K+z1Ru15s
tkmPPZGNMCfKblrnUXUmPt78wxSExmWrXroSMNas2fyeOXgPL0ogNx8vfdFcFsxs
sHpMkF97+npntNj1y5om6GrdU4SRYUFqXT7pNqV4wuGguOfFELIXrJIQuCDaKrGD
ApEoo9UDisGCLdqs738ascZFHZiTQoy6drbpR8moalqhYkTI7Al/pPxHnozGDYsJ
+wElaFXZX2hlPc6ih1q54RcB+D4qswDC9QudArKc9hJEKPv+SsmiVhBG/f+X+Jca
M19UJGWZvRtY8L+0yJdG22O9Hwo0zSK917gtOZNgkwtkkKgzjj2kcNHTK2jGBEuR
pqEIQCreUH8Le0WR9cPeJeYc2/HeCEWDHrf/q+gFRClaZMe+0Sfu3z3pBH3v0T9q
qoZ1VvCDUhggsZyJZebcPTCL+ghqOXFJajHLlcA6BSdGJ/iXq2M=
=lqFu
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-5.7-5' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client bugfixes from Trond Myklebust:
"Highlights include:
Stable fixes:
- nfs: fix NULL deference in nfs4_get_valid_delegation
Bugfixes:
- Fix corruption of the return value in cachefiles_read_or_alloc_pages()
- Fix several fscache cookie issues
- Fix a fscache queuing race that can trigger a BUG_ON
- NFS: Fix two use-after-free regressions due to the RPC_TASK_CRED_NOREF flag
- SUNRPC: Fix a use-after-free regression in rpc_free_client_work()
- SUNRPC: Fix a race when tearing down the rpc client debugfs directory
- SUNRPC: Signalled ASYNC tasks need to exit
- NFSv3: fix rpc receive buffer size for MOUNT call"
* tag 'nfs-for-5.7-5' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
NFSv3: fix rpc receive buffer size for MOUNT call
SUNRPC: 'Directory with parent 'rpc_clnt' already present!'
NFS/pnfs: Don't use RPC_TASK_CRED_NOREF with pnfs
NFS: Don't use RPC_TASK_CRED_NOREF with delegreturn
SUNRPC: Signalled ASYNC tasks need to exit
nfs: fix NULL deference in nfs4_get_valid_delegation
SUNRPC: fix use-after-free in rpc_free_client_work()
cachefiles: Fix race between read_waiter and read_copier involving op->to_do
NFSv4: Fix fscache cookie aux_data to ensure change_attr is included
NFS: Fix fscache super_cookie allocation
NFS: Fix fscache super_cookie index_key from changing after umount
cachefiles: Fix corruption of the return value in cachefiles_read_or_alloc_pages()
Currently, the test_dummy_encryption mount option (which is used for
encryption I/O testing with xfstests) uses v1 encryption policies, and
it relies on userspace inserting a test key into the session keyring.
We need test_dummy_encryption to support v2 encryption policies too.
Requiring userspace to add the test key doesn't work well with v2
policies, since v2 policies only support the filesystem keyring (not the
session keyring), and keys in the filesystem keyring are lost when the
filesystem is unmounted. Hooking all test code that unmounts and
re-mounts the filesystem would be difficult.
Instead, let's make the filesystem automatically add the test key to its
keyring when test_dummy_encryption is enabled.
That puts the responsibility for choosing the test key on the kernel.
We could just hard-code a key. But out of paranoia, let's first try
using a per-boot random key, to prevent this code from being misused.
A per-boot key will work as long as no one expects dummy-encrypted files
to remain accessible after a reboot. (gce-xfstests doesn't.)
Therefore, this patch adds a function fscrypt_add_test_dummy_key() which
implements the above. The next patch will use it.
Link: https://lore.kernel.org/r/20200512233251.118314-3-ebiggers@kernel.org
Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Move the bpf verifier trace check into the new switch statement in
HEAD.
Resolve the overlapping changes in hinic, where bug fixes overlap
the addition of VF support.
Signed-off-by: David S. Miller <davem@davemloft.net>
There's no point in using list_del_init() on entries that are going
away, and the associated lock is always used in process context so
let's not use the IRQ disabling+saving variant of the spinlock.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This new flag should be set/clear from the application to
disable/enable eventfd notifications when a request is completed
and queued to the CQ ring.
Before this patch, notifications were always sent if an eventfd is
registered, so IORING_CQ_EVENTFD_DISABLED is not set during the
initialization.
It will be up to the application to set the flag after initialization
if no notifications are required at the beginning.
Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This patch adds the new 'cq_flags' field that should be written by
the application and read by the kernel.
This new field is available to the userspace application through
'cq_off.flags'.
We are using 4-bytes previously reserved and set to zero. This means
that if the application finds this field to zero, then the new
functionality is not supported.
In the next patch we will introduce the first flag available.
Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Some file descriptors use separate waitqueues for their f_ops->poll()
handler, most commonly one for read and one for write. The io_uring
poll implementation doesn't work with that, as the 2nd poll_wait()
call will cause the io_uring poll request to -EINVAL.
This affects (at least) tty devices and /dev/random as well. This is a
big problem for event loops where some file descriptors work, and others
don't.
With this fix, io_uring handles multiple waitqueues.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We currently embed and queue a work item per fixed_file_ref_node that
we update, but if the workload does a lot of these, then the associated
kworker-events overhead can become quite noticeable.
Since we rarely need to wait on these, batch them at 1 second intervals
instead. If we do need to wait for them, we just flush the pending
delayed work.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This change adds accounting for the memory allocated for shadow stacks.
Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Acked-by: Will Deacon <will@kernel.org>
Signed-off-by: Will Deacon <will@kernel.org>
Alexei Starovoitov says:
====================
pull-request: bpf-next 2020-05-14
The following pull-request contains BPF updates for your *net-next* tree.
The main changes are:
1) Merged tag 'perf-for-bpf-2020-05-06' from tip tree that includes CAP_PERFMON.
2) support for narrow loads in bpf_sock_addr progs and additional
helpers in cg-skb progs, from Andrey.
3) bpf benchmark runner, from Andrii.
4) arm and riscv JIT optimizations, from Luke.
5) bpf iterator infrastructure, from Yonghong.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
We used to have three completions, now we just have two. With the two,
let's not allocate them dynamically, just embed then in the ctx and
name them appropriately.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Failed async writes that are requeued may not clean up a refcount
on the file, which can result in a leaked open. This scenario arises
very reliably when using persistent handles and a reconnect occurs
while writing.
cifs_writev_requeue only releases the reference if the write fails
(rc != 0). The server->ops->async_writev operation will take its own
reference, so the initial reference can always be released.
Signed-off-by: Adam McCoy <adam@forsedomani.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
CC: Stable <stable@vger.kernel.org>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Prior to commit e3d3ab64dd66 ("SUNRPC: Use au_rslack when
computing reply buffer size"), there was enough slack in the reply
buffer to commodate filehandles of size 60bytes. However, the real
problem was that the reply buffer size for the MOUNT operation was
not correctly calculated. Received buffer size used the filehandle
size for NFSv2 (32bytes) which is much smaller than the allowed
filehandle size for the v3 mounts.
Fix the reply buffer size (decode arguments size) for the MNT command.
Fixes: 2c94b8eca1 ("SUNRPC: Use au_rslack when computing reply buffer size")
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
There is a possible race when ep_scan_ready_list() leaves ->rdllist and
->obflist empty for a short period of time although some events are
pending. It is quite likely that ep_events_available() observes empty
lists and goes to sleep.
Since commit 339ddb53d3 ("fs/epoll: remove unnecessary wakeups of
nested epoll") we are conservative in wakeups (there is only one place
for wakeup and this is ep_poll_callback()), thus ep_events_available()
must always observe correct state of two lists.
The easiest and correct way is to do the final check under the lock.
This does not impact the performance, since lock is taken anyway for
adding a wait entry to the wait queue.
The discussion of the problem can be found here:
https://lore.kernel.org/linux-fsdevel/a2f22c3c-c25a-4bda-8339-a7bdaf17849e@akamai.com/
In this patch barrierless __set_current_state() is used. This is safe
since waitqueue_active() is called under the same lock on wakeup side.
Short-circuit for fatal signals (i.e. fatal_signal_pending() check) is
moved to the line just before actual events harvesting routine. This is
fully compliant to what is said in the comment of the patch where the
actual fatal_signal_pending() check was added: c257a340ed ("fs, epoll:
short circuit fetching events if thread has been killed").
Fixes: 339ddb53d3 ("fs/epoll: remove unnecessary wakeups of nested epoll")
Reported-by: Jason Baron <jbaron@akamai.com>
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Roman Penyaev <rpenyaev@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jason Baron <jbaron@akamai.com>
Cc: Khazhismel Kumykov <khazhy@google.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/20200505145609.1865152-1-rpenyaev@suse.de
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
POSIX defines faccessat() as having a fourth "flags" argument, while the
linux syscall doesn't have it. Glibc tries to emulate AT_EACCESS and
AT_SYMLINK_NOFOLLOW, but AT_EACCESS emulation is broken.
Add a new faccessat(2) syscall with the added flags argument and implement
both flags.
The value of AT_EACCESS is defined in glibc headers to be the same as
AT_REMOVEDIR. Use this value for the kernel interface as well, together
with the explanatory comment.
Also add AT_EMPTY_PATH support, which is not documented by POSIX, but can
be useful and is trivial to implement.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Parsing "silent" and clearing SB_SILENT makes zero sense.
Parsing "silent" and setting SB_SILENT would make a bit more sense, but
apparently nobody cares.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Unlike the others, this is _not_ a standard option accepted by mount(8).
In fact SB_POSIXACL is an internal flag, and accepting MS_POSIXACL on the
mount(2) interface is possibly a bug.
The only filesystem that apparently wants to handle the "posixacl" option
is 9p, but it has special handling of that option besides setting
SB_POSIXACL.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Makes little sense to keep this blacklist synced with what mount(8) parses
and what it doesn't. E.g. it has various forms of "*atime" options, but
not "atime"...
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Determining whether a path or file descriptor refers to a mountpoint (or
more precisely a mount root) is not trivial using current tools.
Add a flag to statx that indicates whether the path or fd refers to the
root of a mount or not.
Cc: linux-api@vger.kernel.org
Cc: linux-man@vger.kernel.org
Reported-by: Lennart Poettering <mzxreary@0pointer.de>
Reported-by: J. Bruce Fields <bfields@fieldses.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Systemd is hacking around to get it and it's trivial to add to statx, so...
Cc: linux-api@vger.kernel.org
Cc: linux-man@vger.kernel.org
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
IS_NOATIME(inode) is defined as __IS_FLG(inode, SB_RDONLY|SB_NOATIME), so
generic_fillattr() will clear STATX_ATIME from the result_mask if the super
block is marked read only.
This was probably not the intention, so fix to only clear STATX_ATIME if
the fs doesn't support atime at all.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Acked-by: David Howells <dhowells@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Constants of the *_ALL type can be actively harmful due to the fact that
developers will usually fail to consider the possible effects of future
changes to the definition.
Deprecate STATX_ALL in the uapi, while no damage has been done yet.
We could keep something like this around in the kernel, but there's
actually no point, since all filesystems should be explicitly checking
flags that they support and not rely on the VFS masking unknown ones out: a
flag could be known to the VFS, yet not known to the filesystem.
Cc: David Howells <dhowells@redhat.com>
Cc: linux-api@vger.kernel.org
Cc: linux-man@vger.kernel.org
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
This makes it possible to use utimensat on an O_PATH file (including
symlinks).
It supersedes the nonstandard utimensat(fd, NULL, ...) form.
Cc: linux-api@vger.kernel.org
Cc: linux-man@vger.kernel.org
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Split out a helper that overrides the credentials in preparation for
actually doing the access check.
This prepares for the next patch that optionally disables the creds
override.
Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
If mounts are deleted after a read(2) call on /proc/self/mounts (or its
kin), the subsequent read(2) could miss a mount that comes after the
deleted one in the list. This is because the file position is interpreted
as the number mount entries from the start of the list.
E.g. first read gets entries #0 to #9; the seq file index will be 10. Then
entry #5 is deleted, resulting in #10 becoming #9 and #11 becoming #10,
etc... The next read will continue from entry #10, and #9 is missed.
Solve this by adding a cursor entry for each open instance. Taking the
global namespace_sem for write seems excessive, since we are only dealing
with a per-namespace list. Instead add a per-namespace spinlock and use
that together with namespace_sem taken for read to protect against
concurrent modification of the mount list. This may reduce parallelism of
is_local_mountpoint(), but it's hardly a big contention point. We could
also use RCU freeing of cursors to make traversal not need additional
locks, if that turns out to be neceesary.
Only move the cursor once for each read (cursor is not added on open) to
minimize cacheline invalidation. When EOF is reached, the cursor is taken
off the list, in order to prevent an excessive number of cursors due to
inactive open file descriptors.
Reported-by: Karel Zak <kzak@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Avi Kivity reports that on fuse filesystems running in a user namespace
asyncronous fsync fails with EOVERFLOW.
The reason is that f_ops->fsync() is called with the creds of the kthread
performing aio work instead of the creds of the process originally
submitting IOCB_CMD_FSYNC.
Fuse sends the creds of the caller in the request header and it needs to
translate the uid and gid into the server's user namespace. Since the
kthread is running in init_user_ns, the translation will fail and the
operation returns an error.
It can be argued that fsync doesn't actually need any creds, but just
zeroing out those fields in the header (as with requests that currently
don't take creds) is a backward compatibility risk.
Instead of working around this issue in fuse, solve the core of the problem
by calling the filesystem with the proper creds.
Reported-by: Avi Kivity <avi@scylladb.com>
Tested-by: Giuseppe Scrivano <gscrivan@redhat.com>
Fixes: c9582eb0ff ("fuse: Fail all requests with invalid uids or gids")
Cc: stable@vger.kernel.org # 4.18+
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Whiteouts, unlike real device node should not require privileges to create.
The general concern with device nodes is that opening them can have side
effects. The kernel already avoids zero major (see
Documentation/admin-guide/devices.txt). To be on the safe side the patch
explicitly forbids registering a char device with 0/0 number (see
cdev_add()).
This guarantees that a non-O_PATH open on a whiteout will fail with ENODEV;
i.e. it won't have any side effect.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
We can use the ext4_has_feature_bigalloc() function directly to check
bigalloc feature and the variable has_bigalloc is reduncant, so remove
it.
Signed-off-by: Kaixu Xia <kaixuxia@tencent.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Link: https://lore.kernel.org/r/1586935542-29588-1-git-send-email-kaixuxia@tencent.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
This patch corrects the SPDX License Identifier style in header files
related to XFS File System support. For C header files
Documentation/process/license-rules.rst mandates C-like comments.
(opposed to C source files where C++ style should be used).
Changes made by using a script provided by Joe Perches here:
https://lkml.org/lkml/2019/2/7/46.
Suggested-by: Joe Perches <joe@perches.com>
Signed-off-by: Nishad Kamdar <nishadkamdar@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The current codebase makes use of the zero-length array language
extension to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:
struct foo {
int stuff;
struct boo array[];
};
By making use of the mechanism above, we will get a compiler warning
in case the flexible array does not occur last in the structure, which
will help us prevent some kind of undefined behavior bugs from being
inadvertently introduced[3] to the codebase from now on.
Also, notice that, dynamic memory allocations won't be affected by
this change:
"Flexible array members have incomplete type, and so the sizeof operator
may not be applied. As a quirk of the original implementation of
zero-length arrays, sizeof evaluates to zero."[1]
sizeof(flexible-array-member) triggers a warning because flexible array
members have incomplete type[1]. There are some instances of code in
which the sizeof operator is being incorrectly/erroneously applied to
zero-length arrays and the result is zero. Such instances may be hiding
some bugs. So, this work (flexible-array member conversions) will also
help to get completely rid of those sorts of issues.
This issue was found with the help of Coccinelle.
[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 7649773293 ("cxgb3/l2t: Fix undefined behaviour")
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
When we changed the file registration handling, it became important to
iterate the bulk request freeing list for fixed files as well, or we
miss dropping the fixed file reference. If not, we're leaking references,
and we'll get a kworker stuck waiting for file references to disappear.
This also means we can remove the special casing of fixed vs non-fixed
files, we need to iterate for both and we can just rely on
__io_req_aux_free() doing io_put_file() instead of doing it manually.
Fixes: 0558955373 ("io_uring: refactor file register/unregister/update handling")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
DCACHE_DONTCACHE indicates a dentry should not be cached on final
dput().
Also add a helper function to mark DCACHE_DONTCACHE on all dentries
pointing to a specific inode when that inode is being set I_DONTCACHE.
This facilitates dropping dentry references to inodes sooner which
require eviction to swap S_DAX mode.
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
DAX effective mode (S_DAX) changes requires inode eviction.
XFS has an advisory flag (XFS_IDONTCACHE) to prevent caching of the
inode if no other additional references are taken. We lift this flag to
the VFS layer and change the behavior slightly by allowing the flag to
remain even if multiple references are taken.
This will expedite the eviction of inodes to change S_DAX.
Cc: Al Viro <viro@zeniv.linux.org.uk>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
fanotify_write() only aligned copy_from_user size to sizeof(response)
for higher values. This patch avoids all values below as suggested
by Amir Goldstein and set to response size unconditionally.
Link: https://lore.kernel.org/r/20200512181921.405973-1-fabf@skynet.be
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
fill_event_metadata() was removed in commit bb2f7b4542
("fanotify: open code fill_event_metadata()")
Link: https://lore.kernel.org/r/20200512181836.405879-1-fabf@skynet.be
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Call mutex_destroy() before freeing notification group. This only adds
some additional debug checks when mutex debugging is enabled but still
it may be useful.
Link: https://lore.kernel.org/r/20200512181803.405832-1-fabf@skynet.be
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
When we're doing pnfs then the credential being used for the RPC call
is not necessarily the same as the one used in the open context, so
don't use RPC_TASK_CRED_NOREF.
Fixes: 6129650720 ("NFSv4: Avoid referencing the cred unnecessarily during NFSv4 I/O")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
For quite a while we have been thinking about using pidfds to attach to
namespaces. This patchset has existed for about a year already but we've
wanted to wait to see how the general api would be received and adopted.
Now that more and more programs in userspace have started using pidfds
for process management it's time to send this one out.
This patch makes it possible to use pidfds to attach to the namespaces
of another process, i.e. they can be passed as the first argument to the
setns() syscall. When only a single namespace type is specified the
semantics are equivalent to passing an nsfd. That means
setns(nsfd, CLONE_NEWNET) equals setns(pidfd, CLONE_NEWNET). However,
when a pidfd is passed, multiple namespace flags can be specified in the
second setns() argument and setns() will attach the caller to all the
specified namespaces all at once or to none of them. Specifying 0 is not
valid together with a pidfd.
Here are just two obvious examples:
setns(pidfd, CLONE_NEWPID | CLONE_NEWNS | CLONE_NEWNET);
setns(pidfd, CLONE_NEWUSER);
Allowing to also attach subsets of namespaces supports various use-cases
where callers setns to a subset of namespaces to retain privilege, perform
an action and then re-attach another subset of namespaces.
If the need arises, as Eric suggested, we can extend this patchset to
assume even more context than just attaching all namespaces. His suggestion
specifically was about assuming the process' root directory when
setns(pidfd, 0) or setns(pidfd, SETNS_PIDFD) is specified. For now, just
keep it flexible in terms of supporting subsets of namespaces but let's
wait until we have users asking for even more context to be assumed. At
that point we can add an extension.
The obvious example where this is useful is a standard container
manager interacting with a running container: pushing and pulling files
or directories, injecting mounts, attaching/execing any kind of process,
managing network devices all these operations require attaching to all
or at least multiple namespaces at the same time. Given that nowadays
most containers are spawned with all namespaces enabled we're currently
looking at at least 14 syscalls, 7 to open the /proc/<pid>/ns/<ns>
nsfds, another 7 to actually perform the namespace switch. With time
namespaces we're looking at about 16 syscalls.
(We could amortize the first 7 or 8 syscalls for opening the nsfds by
stashing them in each container's monitor process but that would mean
we need to send around those file descriptors through unix sockets
everytime we want to interact with the container or keep on-disk
state. Even in scenarios where a caller wants to join a particular
namespace in a particular order callers still profit from batching
other namespaces. That mostly applies to the user namespace but
all container runtimes I found join the user namespace first no matter
if it privileges or deprivileges the container similar to how unshare
behaves.)
With pidfds this becomes a single syscall no matter how many namespaces
are supposed to be attached to.
A decently designed, large-scale container manager usually isn't the
parent of any of the containers it spawns so the containers don't die
when it crashes or needs to update or reinitialize. This means that
for the manager to interact with containers through pids is inherently
racy especially on systems where the maximum pid number is not
significicantly bumped. This is even more problematic since we often spawn
and manage thousands or ten-thousands of containers. Interacting with a
container through a pid thus can become risky quite quickly. Especially
since we allow for an administrator to enable advanced features such as
syscall interception where we're performing syscalls in lieu of the
container. In all of those cases we use pidfds if they are available and
we pass them around as stable references. Using them to setns() to the
target process' namespaces is as reliable as using nsfds. Either the
target process is already dead and we get ESRCH or we manage to attach
to its namespaces but we can't accidently attach to another process'
namespaces. So pidfds lend themselves to be used with this api.
The other main advantage is that with this change the pidfd becomes the
only relevant token for most container interactions and it's the only
token we need to create and send around.
Apart from significiantly reducing the number of syscalls from double
digit to single digit which is a decent reason post-spectre/meltdown
this also allows to switch to a set of namespaces atomically, i.e.
either attaching to all the specified namespaces succeeds or we fail. If
we fail we haven't changed a single namespace. There are currently three
namespaces that can fail (other than for ENOMEM which really is not
very interesting since we then have other problems anyway) for
non-trivial reasons, user, mount, and pid namespaces. We can fail to
attach to a pid namespace if it is not our current active pid namespace
or a descendant of it. We can fail to attach to a user namespace because
we are multi-threaded or because our current mount namespace shares
filesystem state with other tasks, or because we're trying to setns()
to the same user namespace, i.e. the target task has the same user
namespace as we do. We can fail to attach to a mount namespace because
it shares filesystem state with other tasks or because we fail to lookup
the new root for the new mount namespace. In most non-pathological
scenarios these issues can be somewhat mitigated. But there are cases where
we're half-attached to some namespace and failing to attach to another one.
I've talked about some of these problem during the hallway track (something
only the pre-COVID-19 generation will remember) of Plumbers in Los Angeles
in 2018(?). Even if all these issues could be avoided with super careful
userspace coding it would be nicer to have this done in-kernel. Pidfds seem
to lend themselves nicely for this.
The other neat thing about this is that setns() becomes an actual
counterpart to the namespace bits of unshare().
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: Serge Hallyn <serge@hallyn.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Serge Hallyn <serge@hallyn.com>
Cc: Jann Horn <jannh@google.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Link: https://lore.kernel.org/r/20200505140432.181565-3-christian.brauner@ubuntu.com
Overlayfs doesn't work well with the fanotify mechanism.
Fanotify first probes for the required buffer size for the file handle,
but overlayfs currently bails out without passing the size back.
That results in errors in the kernel log, such as:
[527944.485384] overlayfs: failed to encode file handle (/, err=-75, buflen=0, len=29, type=1)
[527944.485386] fanotify: failed to encode fid (fsid=ae521e68.a434d95f, type=255, bytes=0, err=-2)
Signed-off-by: Lubos Dolezel <lubos@dolezel.info>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
sync_filesystem() does not sync dirty data for readonly filesystem during
umount, so before changing to readonly filesystem we should sync dirty data
for data integrity.
Signed-off-by: Chengguang Xu <cgxu519@mykernel.net>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Share inode with different whiteout files for saving inode and speeding up
delete operation.
If EMLINK is encountered when linking a shared whiteout, create a new one.
In case of any other error, disable sharing for this super block.
Note: ofs->whiteout is protected by inode lock on workdir.
Signed-off-by: Chengguang Xu <cgxu519@mykernel.net>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Since the stacking of regular file operations [1], the overlayfs edition of
write_iter() is called when writing regular files.
Since then, xattr lookup is needed on every write since file_remove_privs()
is called from ovl_write_iter(), which would become the performance
bottleneck when writing small chunks of data. In my test case,
file_remove_privs() would consume ~15% CPU when running fstime of unixbench
(the workload is repeadly writing 1 KB to the same file) [2].
Inherit the SB_NOSEC flag from upperdir. Since then xattr lookup would be
done only once on the first write. Unixbench fstime gets a ~20% performance
gain with this patch.
[1] https://lore.kernel.org/lkml/20180606150905.GC9426@magnolia/T/
[2] https://www.spinics.net/lists/linux-unionfs/msg07153.html
Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Stacked filesystems like overlayfs has no own writeback, but they have to
forward syncfs() requests to backend for keeping data integrity.
During global sync() each overlayfs instance calls method ->sync_fs() for
backend although it itself is in global list of superblocks too. As a
result one syscall sync() could write one superblock several times and send
multiple disk barriers.
This patch adds flag SB_I_SKIP_SYNC into sb->sb_iflags to avoid that.
Reported-by: Dmitry Monakhov <dmtrmonakhov@yandex-team.ru>
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
With index=on, let index dir act as the work dir for copy up and cleanups.
This will help implementing whiteout inode sharing.
We still create the "work" dir on mount regardless of index=on and it is
used to test the features supported by upper fs. One reason is that before
the feature tests, we do not know if index could be enabled or not.
The reason we do not use "index" directory also as workdir with index=off
is because the existence of the "index" directory acts as a simple
persistent signal that index was enabled on this filesystem and tools may
want to use that signal.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
With index=on, we copy up lower hardlinks to work dir and move them into
index dir. Fix locking to allow work dir and index dir to be the same
directory.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Teach ovl_indexdir_cleanup() to remove temp directories containing
whiteouts to prepare for using index dir instead of work dir for removing
merge directories.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Similar to the way that a conflict between metacopy=on,redirect_dir=off is
resolved, also resolve conflicts between nfs_export=on,index=off and
nfs_export=on,metacopy=on.
An explicit mount option wins over a default config value. Both explicit
mount options result in an error.
Without this change the xfstests group overlay/exportfs are skipped if
metacopy is enabled by default.
Reported-by: Chengguang Xu <cgxu519@mykernel.net>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
The "buflen" value comes from the user and there is a potential that it
could be zero. In do_handle_to_path() we know that "handle->handle_bytes"
is non-zero and we do:
handle_dwords = handle->handle_bytes >> 2;
So values 1-3 become zero. Then in ovl_fh_to_dentry() we do:
int len = fh_len << 2;
So now len is in the "0,4-128" range and a multiple of 4. But if
"buflen" is zero it will try to copy negative bytes when we do the
memcpy in ovl_fid_to_fh().
memcpy(&fh->fb, fid, buflen - OVL_FH_WIRE_OFFSET);
And that will lead to a crash. Thanks to Amir Goldstein for his help
with this patch.
Fixes: cbe7fba8ed ("ovl: make sure that real fid is 32bit aligned in memory")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Cc: <stable@vger.kernel.org> # v5.5
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Synchronous direct I/O to a sequential write only zone can be issued using
the new REQ_OP_ZONE_APPEND request operation. As dispatching multiple
BIOs can potentially result in reordering, we cannot support asynchronous
IO via this interface.
We also can only dispatch up to queue_max_zone_append_sectors() via the
new zone-append method and have to return a short write back to user-space
in case an IO larger than queue_max_zone_append_sectors() has been issued.
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Acked-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Sync dio could be big, or may take long time in discard or in case of
IO failure.
We have prevented task hung in submit_bio_wait() and blk_execute_rq(),
so apply the same trick for prevent task hung from happening in sync dio.
Add helper of blk_io_schedule() and use io_schedule_timeout() to prevent
task hung warning.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Cc: Salman Qazi <sqazi@google.com>
Cc: Jesse Barnes <jsbarnes@google.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Remove the unnecessary 'extern' keywords from function declarations.
This makes it so that we don't have a mix of both styles, so it won't be
ambiguous what to use in new fs-verity patches. This also makes the
code shorter and matches the 'checkpatch --strict' expectation.
Link: https://lore.kernel.org/r/20200511192118.71427-3-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
Fix all kerneldoc warnings in fs/verity/ and include/linux/fsverity.h.
Most of these were due to missing documentation for function parameters.
Detected with:
scripts/kernel-doc -v -none fs/verity/*.{c,h} include/linux/fsverity.h
This cleanup makes it possible to check new patches for kerneldoc
warnings without having to filter out all the existing ones.
Link: https://lore.kernel.org/r/20200511192118.71427-2-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
Remove the unnecessary 'extern' keywords from function declarations.
This makes it so that we don't have a mix of both styles, so it won't be
ambiguous what to use in new fscrypt patches. This also makes the code
shorter and matches the 'checkpatch --strict' expectation.
Link: https://lore.kernel.org/r/20200511191358.53096-4-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
Fix all kerneldoc warnings in fs/crypto/ and include/linux/fscrypt.h.
Most of these were due to missing documentation for function parameters.
Detected with:
scripts/kernel-doc -v -none fs/crypto/*.{c,h} include/linux/fscrypt.h
This cleanup makes it possible to check new patches for kerneldoc
warnings without having to filter out all the existing ones.
For consistency, also adjust some function "brief descriptions" to
include the parentheses and to wrap at 80 characters. (The latter
matches the checkpatch expectation.)
Link: https://lore.kernel.org/r/20200511191358.53096-2-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
Building a kernel with clang sometimes fails with an objtool error in dlm:
fs/dlm/lock.o: warning: objtool: revert_lock_pc()+0xbd: can't find jump dest instruction at .text+0xd7fc
The problem is that BUG() never returns and the compiler knows
that anything after it is unreachable, however the panic still
emits some code that does not get fully eliminated.
Having both BUG() and panic() is really pointless as the BUG()
kills the current process and the subsequent panic() never hits.
In most cases, we probably don't really want either and should
replace the DLM_ASSERT() statements with WARN_ON(), as has
been done for some of them.
Remove the BUG() here so the user at least sees the panic message
and we can reliably build randconfig kernels.
Fixes: e7fd41792f ("[DLM] The core of the DLM for GFS2/CLVM")
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: clang-built-linux@googlegroups.com
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David Teigland <teigland@redhat.com>
We saw an issue in a production server on a customer deployment where
DLM 4.0.7 gets "stuck" and unable to join new lockspaces.
There is no useful response for the dlm in do_event() if
wait_event_interruptible() is interrupted, so switch to
wait_event().
Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com>
Signed-off-by: David Teigland <teigland@redhat.com>
Fix the following coccicheck warning:
fs/dlm/rcom.c:566:2-3: Unneeded semicolon
Signed-off-by: Wu Bo <wubo40@huawei.com>
Signed-off-by: David Teigland <teigland@redhat.com>
The current codebase makes use of the zero-length array language
extension to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:
struct foo {
int stuff;
struct boo array[];
};
By making use of the mechanism above, we will get a compiler warning
in case the flexible array does not occur last in the structure, which
will help us prevent some kind of undefined behavior bugs from being
inadvertently introduced[3] to the codebase from now on.
Also, notice that, dynamic memory allocations won't be affected by
this change:
"Flexible array members have incomplete type, and so the sizeof operator
may not be applied. As a quirk of the original implementation of
zero-length arrays, sizeof evaluates to zero."[1]
This issue was found with the help of Coccinelle.
[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 7649773293 ("cxgb3/l2t: Fix undefined behaviour")
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: David Teigland <teigland@redhat.com>
The current codebase makes use of the zero-length array language
extension to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:
struct foo {
int stuff;
struct boo array[];
};
By making use of the mechanism above, we will get a compiler warning
in case the flexible array does not occur last in the structure, which
will help us prevent some kind of undefined behavior bugs from being
inadvertently introduced[3] to the codebase from now on.
Also, notice that, dynamic memory allocations won't be affected by
this change:
"Flexible array members have incomplete type, and so the sizeof operator
may not be applied. As a quirk of the original implementation of
zero-length arrays, sizeof evaluates to zero."[1]
This issue was found with the help of Coccinelle.
[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 7649773293 ("cxgb3/l2t: Fix undefined behaviour")
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: David Teigland <teigland@redhat.com>
Fixes for bugs prior to v5.7-rc1:
- Fix random block reads when reading fragmented journals (v5.2).
- Fix a possible random memory access in gfs2_walk_metadata (v5.3).
Fixes for v5.7-rc1:
- Fix several overlooked gfs2_qa_get / gfs2_qa_put imbalances.
- Fix several bugs in the new filesystem withdraw logic.
-----BEGIN PGP SIGNATURE-----
iQJIBAABCAAyFiEEJZs3krPW0xkhLMTc1b+f6wMTZToFAl66togUHGFncnVlbmJh
QHJlZGhhdC5jb20ACgkQ1b+f6wMTZTruhxAAttUMbVZaxny7zB+gXc7fqvM3T6BE
m613knneGkkQIRQqzLXKictUkWiTItkNaM7HFwO9MJfDZO1xMett941RpIlW4oa5
d42EWRKEwAZZOx+Yz9tE9G/fPoh0Rz16Svl/0EJ6NG5QfTyyuSQoH+MbRibVlYy2
XVnfMKZAEyOsIJ8lu3xRzjLTwkRK/8X+QpF/syanEq9oaFMYtB7j1TOgimVUMV3m
5va4+PXARx1/Dsgn/21zgsZgQ4IW7ZYXzjxZuX9CwbKaszz+f77pyxkea5fDvVFo
16OaFXtl+dzBJ4vIdZr9OfQTvMfSCxWiXgjxj+6W152qXEQkyKDWGETH7A3yVZ4n
9G3N+Cdpp09gM8tmI9140uTDNXLg8M34fTtHntqckPKpNZ9IvzoXTp3ebSe92pwJ
+5K1//ifcTqbnHCwTCPPYEtIRGbm/I0en0H9A3tqFmKDNdarnVuZ5QHJVrSF9x8g
z+Go3NJlhevq64OGLXd8UlODRevpGPjQWdrcFjeuLhtcqbUVjcERoEcaBsJoKdus
NYn+yT5CqzMqLZzXLIAfm9TfCry9/NF7D/7acsZZ05BEyz+WOwHVMTcTdsAfT6Ft
1ytU7tufdM/Zw/8t6lI89rC/XcDwAm/vEpQLd27xUvMKOKaEQYKs8geh9du4Q+fN
yaQOvgDhmPVIKwI=
=4Sbr
-----END PGP SIGNATURE-----
Merge tag 'gfs2-v5.7-rc1.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2
Pull gfs2 fixes from Andreas Gruenbacher:
"Various gfs2 fixes.
Fixes for bugs prior to v5.7:
- Fix random block reads when reading fragmented journals (v5.2)
- Fix a possible random memory access in gfs2_walk_metadata (v5.3)
Fixes for v5.7:
- Fix several overlooked gfs2_qa_get / gfs2_qa_put imbalances
- Fix several bugs in the new filesystem withdraw logic"
* tag 'gfs2-v5.7-rc1.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
Revert "gfs2: Don't demote a glock until its revokes are written"
gfs2: If go_sync returns error, withdraw but skip invalidate
gfs2: Grab glock reference sooner in gfs2_add_revoke
gfs2: don't call quota_unhold if quotas are not locked
gfs2: move privileged user check to gfs2_quota_lock_check
gfs2: remove check for quotas on in gfs2_quota_check
gfs2: Change BUG_ON to an assert_withdraw in gfs2_quota_change
gfs2: Fix problems regarding gfs2_qa_get and _put
gfs2: More gfs2_find_jhead fixes
gfs2: Another gfs2_walk_metadata fix
gfs2: Fix use-after-free in gfs2_logd after withdraw
gfs2: Fix BUG during unmount after file system withdraw
gfs2: Fix error exit in do_xmote
gfs2: fix withdraw sequence deadlock
The "unlink" handling should perform list removal (which can also make
sure records don't get double-erased), and the "evict" handling should
be responsible only for memory freeing.
Link: https://lore.kernel.org/lkml/20200506152114.50375-8-keescook@chromium.org/
Signed-off-by: Kees Cook <keescook@chromium.org>
The pstorefs internal list lock doesn't need to be a spinlock and will
create problems when trying to access the list in the subsequent patch
that will walk the pstorefs records during pstore_unregister(). Change
this to a mutex to avoid may_sleep() warnings when unregistering devices.
Link: https://lore.kernel.org/lkml/20200506152114.50375-6-keescook@chromium.org/
Signed-off-by: Kees Cook <keescook@chromium.org>
The name "allpstore" doesn't carry much meaning, so rename it to what it
actually is: the list of all records present in the filesystem. The lock
is also renamed accordingly.
Link: https://lore.kernel.org/lkml/20200506152114.50375-5-keescook@chromium.org/
Signed-off-by: Kees Cook <keescook@chromium.org>
Currently pstore can only have a single backend attached at a time, and it
tracks the active backend via "psinfo", under a lock. The locking for this
does not need to be a spinlock, and in order to avoid may_sleep() issues
during future changes to pstore_unregister(), switch to a mutex instead.
Link: https://lore.kernel.org/lkml/20200506152114.50375-4-keescook@chromium.org/
Signed-off-by: Kees Cook <keescook@chromium.org>
There is no reason to be doing a module get/put in pstore_register(),
since the module calling pstore_register() cannot be unloaded since it
hasn't finished its initialization. Remove it so there is no confusion
about how registration ordering works.
Link: https://lore.kernel.org/lkml/20200506152114.50375-2-keescook@chromium.org/
Signed-off-by: Kees Cook <keescook@chromium.org>
During zstd compression, ZSTD_endStream() may return non-zero value
because distination buffer is full, but there is still compressed data
remained in intermediate buffer, it means that zstd algorithm can not
save at last one block space, let's just writeback raw data instead of
compressed one, this can fix data corruption when decompressing
incomplete stored compression data.
Fixes: 50cfa66f0d ("f2fs: compress: support zstd compress algorithm")
Signed-off-by: Daeho Jeong <daehojeong@google.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In error path of f2fs_read_multi_pages(), it should let last referrer
release decompress io context memory, otherwise, other referrer will
cause use-after-free issue.
Fixes: 4c8ff7095b ("f2fs: support data compression")
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If bio has no compressed data, we don't need to handle end_io work in
workqueue, instead, it should just let interrupter handle it directly
to speed up IO response.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The variable err is being assigned with a value that is never read
and it is being updated later with a new value. The initialization is
redundant and can be removed.
Addresses-Coverity: ("Unused value")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Sahitya raised an issue:
- prevent meta updates while checkpoint is in progress
allocate_segment_for_resize() can cause metapage updates if
it requires to change the current node/data segments for resizing.
Stop these meta updates when there is a checkpoint already
in progress to prevent inconsistent CP data.
Signed-off-by: Sahitya Tummala <stummala@codeaurora.org>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
.i_cluster_size should be power of 2, so we can use round_up() instead
of roundup() to enhance the calculation.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch introduces a new ioctl to rollback all compress inode
status:
- add reserved blocks in dnode blocks
- increase i_compr_blocks, i_blocks, total_valid_block_count
- remove immutable flag
Then compress inode can be restored to support overwrite
functionality again.
Signee-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
There could be a scenario where f2fs_sync_node_pages gets
called during checkpoint, which in turn tries to flush
inline data and calls iput(). This results in deadlock as
iput() tries to hold cp_rwsem, which is already held at the
beginning by checkpoint->block_operations().
Call stack :
Thread A Thread B
f2fs_write_checkpoint()
- block_operations(sbi)
- f2fs_lock_all(sbi);
- down_write(&sbi->cp_rwsem);
- open()
- igrab()
- write() write inline data
- unlink()
- f2fs_sync_node_pages()
- if (is_inline_node(page))
- flush_inline_data()
- ilookup()
page = f2fs_pagecache_get_page()
if (!page)
goto iput_out;
iput_out:
-close()
-iput()
iput(inode);
- f2fs_evict_inode()
- f2fs_truncate_blocks()
- f2fs_lock_op()
- down_read(&sbi->cp_rwsem);
Fixes: 2049d4fcb0 ("f2fs: avoid multiple node page writes due to inline_data")
Signed-off-by: Sayali Lokhande <sayalil@codeaurora.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This reserved space isn't committed yet but cannot be used for
allocations. For userspace it has no difference from used space.
See the same fix in ext4 commit f06925c739 ("ext4: report delalloc
reserve as non-free in statfs for project quota").
Fixes: ddc34e328d ("f2fs: introduce f2fs_statfs_project")
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
update_sit_info should be f2fs_update_sit_info,
otherwise build fails while no CONFIG_F2FS_STAT_FS.
Fixes: fc7100ea2a ("f2fs: Add f2fs stats to sysfs")
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Commonly, in order to handle lz4 worst compress case, caller should
allocate buffer with size of LZ4_compressBound(inputsize) for target
compressed data storing, however in this case, if caller didn't
allocate enough space, lz4 compressor still can handle output buffer
budget properly, and end up compressing when left space in output
buffer is not enough.
So we don't have to allocate buffer with size for worst case, then
we can avoid 2 * 4KB size intermediate buffer allocation when
log_cluster_size is 2, and avoid unnecessary compressing work of
compressor if we can not save at least 4KB space.
Suggested-by: Daeho Jeong <daehojeong@google.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
There are still reserved blocks on compressed inode, this patch
introduce a new ioctl to help release reserved blocks back to
filesystem, so that userspace can reuse those freed space.
----
Daeho fixed a bug like below.
Now, if writing pages and releasing compress blocks occur
simultaneously, and releasing cblocks is executed more than one time
to a file, then total block count of filesystem and block count of the
file could be incorrect and damaged.
We have to execute releasing compress blocks only one time for a file
without being interfered by writepages path.
---
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Daeho Jeong <daehojeong@google.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In f2fs_try_to_free_nids(), .nid_list_lock spinlock critical region will
increase as expected shrink number increase, to avoid spining other CPUs
for long time, we change to release nid caches with small batch each time
under .nid_list_lock coverage.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
fsync() may be called on a deleted file that's still open. So when
fsync() tries to set the parent inode number when the inode has
LOST_PINO and i_nlink == 1 (to avoid later checkpoints), it needs to
make sure to get the parent directory via a non-deleted alias.
Also remove the unnecessary igrab() and iput(), as the caller already
holds a reference to the inode.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Rework f2fs's handling of filenames to use a new 'struct f2fs_filename'.
Similar to 'struct ext4_filename', this stores the usr_fname, disk_name,
dirhash, crypto_buf, and casefolded name. Some of these names can be
NULL in some cases. 'struct f2fs_filename' differs from
'struct fscrypt_name' mainly in that the casefolded name is included.
For user-initiated directory operations like lookup() and create(),
initialize the f2fs_filename by translating the corresponding
fscrypt_name, then computing the dirhash and casefolded name if needed.
This makes the dirhash and casefolded name be cached for each syscall,
so we don't have to recompute them repeatedly. (Previously, f2fs
computed the dirhash once per directory level, and the casefolded name
once per directory block.) This improves performance.
This rework also makes it much easier to correctly handle all
combinations of normal, encrypted, casefolded, and encrypted+casefolded
directories. (The fourth isn't supported yet but is being worked on.)
The only other cases where an f2fs_filename gets initialized are for two
filesystem-internal operations: (1) when converting an inline directory
to a regular one, we grab the needed disk_name and hash from an existing
f2fs_dir_entry; and (2) when roll-forward recovering a new dentry, we
grab the needed disk_name from f2fs_inode::i_name and compute the hash.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Sharing f2fs_ci_compare() between comparing cached dentries
(f2fs_d_compare()) and comparing on-disk dentries (f2fs_match_name())
doesn't work as well as intended, as these actions fundamentally differ
in several ways (e.g. whether the task may sleep, whether the directory
is stable, whether the casefolded name was precomputed, whether the
dentry will need to be decrypted once we allow casefold+encrypt, etc.)
Just make f2fs_d_compare() implement what it needs directly, and rework
f2fs_ci_compare() to be specialized for f2fs_match_name().
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
We need to call fscrypt_free_filename() to free the memory allocated by
fscrypt_setup_filename().
Fixes: b06af2aff2 ("f2fs: convert inline_dir early before starting rename")
Cc: <stable@vger.kernel.org> # v5.6+
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
LZO-RLE extension (run length encoding) was introduced to improve
performance of LZO algorithm in scenario of data contains many zeros,
zram has changed to use this extended algorithm by default, this
patch adds to support this algorithm extension, to enable this
extension, it needs to enable F2FS_FS_LZO and F2FS_FS_LZORLE config,
and specifies "compress_algorithm=lzo-rle" mountoption.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If compression feature is on, in scenario of no enough free memory,
page refault ratio is higher than before, the root cause is:
- {,de}compression flow needs to allocate intermediate pages to store
compressed data in cluster, so during their allocation, vm may reclaim
mmaped pages.
- if above reclaimed pages belong to compressed cluster, during its
refault, it may cause more intermediate pages allocation, result in
reclaiming more mmaped pages.
So this patch introduces a mempool for intermediate page allocation,
in order to avoid high refault ratio, by default, number of
preallocated page in pool is 512, user can change the number by
assigning 'num_compress_pages' parameter during module initialization.
Ma Feng found warnings in the original patch and fixed like below.
Fix the following sparse warning:
fs/f2fs/compress.c:501:5: warning: symbol 'num_compress_pages' was not declared.
Should it be static?
fs/f2fs/compress.c:530:6: warning: symbol 'f2fs_compress_free_page' was not
declared. Should it be static?
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Ma Feng <mafeng.ma@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
We are not guaranteed that the credential will remain pinned.
Fixes: 6129650720 ("NFSv4: Avoid referencing the cred unnecessarily during NFSv4 I/O")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
(1) The reorganisation of bmap() use accidentally caused the return value
of cachefiles_read_or_alloc_pages() to get corrupted.
(2) The NFS superblock index key accidentally got changed to include a
number of kernel pointers - meaning that the key isn't matchable after
a reboot.
(3) A redundant check in nfs_fscache_get_super_cookie().
(4) The NFS change_attr sometimes set in the auxiliary data for the
caching of an file and sometimes not, which causes the cache to get
discarded when it shouldn't.
(5) There's a race between cachefiles_read_waiter() and
cachefiles_read_copier() that causes an occasional assertion failure.
We add the new state to the nfsi->open_states list, making it
potentially visible to other threads, before we've finished initializing
it.
That wasn't a problem when all the readers were also taking the i_lock
(as we do here), but since we switched to RCU, there's now a possibility
that a reader could see the partially initialized state.
Symptoms observed were a crash when another thread called
nfs4_get_valid_delegation() on a NULL inode, resulting in an oops like:
BUG: unable to handle page fault for address: ffffffffffffffb0 ...
RIP: 0010:nfs4_get_valid_delegation+0x6/0x30 [nfsv4] ...
Call Trace:
nfs4_open_prepare+0x80/0x1c0 [nfsv4]
__rpc_execute+0x75/0x390 [sunrpc]
? finish_task_switch+0x75/0x260
rpc_async_schedule+0x29/0x40 [sunrpc]
process_one_work+0x1ad/0x370
worker_thread+0x30/0x390
? create_worker+0x1a0/0x1a0
kthread+0x10c/0x130
? kthread_park+0x80/0x80
ret_from_fork+0x22/0x30
Fixes: 9ae075fdd1 "NFSv4: Convert open state lookup to use RCU"
Reviewed-by: Seiichi Ikarashi <s.ikarashi@fujitsu.com>
Tested-by: Daisuke Matsuda <matsuda-daisuke@fujitsu.com>
Tested-by: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Cc: stable@vger.kernel.org # v4.20+
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Make the code more robust by marking the point of no return sooner.
This ensures that future code changes don't need to worry about how
they return errors if they are past this point.
This results in no actual change in behavior as __do_execve_file does
not force SIGSEGV when there is a pending fatal signal pending past
the point of no return. Further the only error returns from de_thread
and exec_mmap that can occur result in fatal signals being pending.
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lkml.kernel.org/r/87sgga5klu.fsf_-_@x220.int.ebiederm.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Move the handing of the point of no return from search_binary_handler
into __do_execve_file so that it is easier to find, and to keep
things robust in the face of change.
Make it clear that an existing fatal signal will take precedence over
a forced SIGSEGV by not forcing SIGSEGV if a fatal signal is already
pending. This does not change the behavior but it saves a reader
of the code the tedium of reading and understanding force_sig
and the signal delivery code.
Update the comment in begin_new_exec about where SIGSEGV is forced.
Keep point_of_no_return from being a mystery by documenting
what the code is doing where it forces SIGSEGV if the
code is past the point of no return.
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lkml.kernel.org/r/87y2q25knl.fsf_-_@x220.int.ebiederm.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Like exec_mm_release sync_mm_rss is about flushing out the state of
the old_mm, which does not need to happen under exec_update_mutex.
Make this explicit by moving sync_mm_rss outside of exec_update_mutex.
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lkml.kernel.org/r/875zd66za3.fsf_-_@x220.int.ebiederm.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
rxrpc currently uses a fixed 4s retransmission timeout until the RTT is
sufficiently sampled. This can cause problems with some fileservers with
calls to the cache manager in the afs filesystem being dropped from the
fileserver because a packet goes missing and the retransmission timeout is
greater than the call expiry timeout.
Fix this by:
(1) Copying the RTT/RTO calculation code from Linux's TCP implementation
and altering it to fit rxrpc.
(2) Altering the various users of the RTT to make use of the new SRTT
value.
(3) Replacing the use of rxrpc_resend_timeout to use the calculated RTO
value instead (which is needed in jiffies), along with a backoff.
Notes:
(1) rxrpc provides RTT samples by matching the serial numbers on outgoing
DATA packets that have the RXRPC_REQUEST_ACK set and PING ACK packets
against the reference serial number in incoming REQUESTED ACK and
PING-RESPONSE ACK packets.
(2) Each packet that is transmitted on an rxrpc connection gets a new
per-connection serial number, even for retransmissions, so an ACK can
be cross-referenced to a specific trigger packet. This allows RTT
information to be drawn from retransmitted DATA packets also.
(3) rxrpc maintains the RTT/RTO state on the rxrpc_peer record rather than
on an rxrpc_call because many RPC calls won't live long enough to
generate more than one sample.
(4) The calculated SRTT value is in units of 8ths of a microsecond rather
than nanoseconds.
The (S)RTT and RTO values are displayed in /proc/net/rxrpc/peers.
Fixes: 17926a7932 ([AF_RXRPC]: Provide secure RxRPC sockets for use by userspace and kernel both"")
Signed-off-by: David Howells <dhowells@redhat.com>
Remove duplicate semicolon at the end of line in io_file_from_index()
Signed-off-by: Xiaoming Ni <nixiaoming@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Fix warning:
fs/nfsd/nfssvc.c:604:6: warning: old-style function definition [-Wold-style-definition]
bool i_am_nfsd()
^~~~~~~~~
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Ma Feng <mafeng.ma@huawei.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
We want the driver core fixes in here and this resolves a merge issue
with drivers/base/dd.c
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl63WVAQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpkXWD/9qJgqQpPkigCCwwPHZ+phthw6gHeAgBxPH
Cw6P9QB4QCdacZjQA6QH3zdxaDsCCitQRioWPgxngs1326TKYNzBi7U3eTEwiK12
cnRybLnkzei4yzYVUSJk637oOoQh3CiJLvYcJBppGFi7crpbvlQv68M2hu05vhwL
R/91H62X/5UaUlc1cJV63OBk8euWzF6XNbCQQrR4ayDvz+BsV5Fs72vYa1gx7qIt
as/67oTT6y4U4pd74nT4OGkxDIXbXfn2eTbh5sMNc4ilBkqMyNbf8aOHdWqXZIBd
18RKpNl6h/fiDMJ0jsGliReONLjfRBcJla68Kn1AFONMcyxcXidjptOwLOt2fYWf
YMguCVMhfgxVBslzLWoQ9AWSiNVh36ycORWlCOrnRaOaQCb9OaLZ2fwibfZ0JsMd
0259Z5vA7MIUoobCc5akXOYHbpByA9FSYkKudgTYLpdjkn05kxQyA12GgJjW3sVw
ZRjoUuDuZDDUct6JcLWdrlONT8st05g+qf6PCoD+Jac8HtbpqHfKJJUtYecUat75
4hGKhuvTzpuVY0wNHo3sgqKfsejQODTN6UhejNI11Zs/nx6O0ze/qoDuWZHncnKl
158le+K5rNS8SUNbDBTMWp3OX4SJm/Gsf30fOWkkt6z1iaEfKc5sCxBHvSOeBEvH
M9pzy56Vtw==
=73nU
-----END PGP SIGNATURE-----
Merge tag 'block-5.7-2020-05-09' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
- a small series fixing a use-after-free of bdi name (Christoph,Yufen)
- NVMe fix for a regression with the smaller CQ update (Alexey)
- NVMe fix for a hang at namespace scanning error recovery (Sagi)
- fix race with blk-iocost iocg->abs_vdebt updates (Tejun)
* tag 'block-5.7-2020-05-09' of git://git.kernel.dk/linux-block:
nvme: fix possible hang when ns scanning fails during error recovery
nvme-pci: fix "slimmer CQ head update"
bdi: add a ->dev_name field to struct backing_dev_info
bdi: use bdi_dev_name() to get device name
bdi: move bdi_dev_name out of line
vboxsf: don't use the source name in the bdi name
iocost: protect iocg->abs_vdebt with iocg->waitq.lock
This patch added netlink and ipv6_route targets, using
the same seq_ops (except show() and minor changes for stop())
for /proc/net/{netlink,ipv6_route}.
The net namespace for these targets are the current net
namespace at file open stage, similar to
/proc/net/{netlink,ipv6_route} reference counting
the net namespace at seq_file open stage.
Since module is not supported for now, ipv6_route is
supported only if the IPV6 is built-in, i.e., not compiled
as a module. The restriction can be lifted once module
is properly supported for bpf_iter.
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200509175910.2476329-1-yhs@fb.com
The name is only printed for a not registered bdi in writeback. Use the
device name there as is more useful anyway for the unlike case that the
warning triggers.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Merge the _node vs normal version and drop the superflous gfp_t argument.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pull in block-5.7 fixes for 5.8. Mostly to resolve a conflict with
the blk-iocost changes, but we also need the base of the bdi
use-after-free as well as we build on top of it.
* block-5.7:
nvme: fix possible hang when ns scanning fails during error recovery
nvme-pci: fix "slimmer CQ head update"
bdi: add a ->dev_name field to struct backing_dev_info
bdi: use bdi_dev_name() to get device name
bdi: move bdi_dev_name out of line
vboxsf: don't use the source name in the bdi name
iocost: protect iocg->abs_vdebt with iocg->waitq.lock
block: remove the bd_openers checks in blk_drop_partitions
nvme: prevent double free in nvme_alloc_ns() error handling
null_blk: Cleanup zoned device initialization
null_blk: Fix zoned command handling
block: remove unused header
blk-iocost: Fix error on iocost_ioc_vrate_adj
bdev: Reduce time holding bd_mutex in sync in blkdev_close()
buffer: remove useless comment and WB_REASON_FREE_MORE_MEM, reason.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Use the common interface bdi_dev_name() to get device name.
Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Add missing <linux/backing-dev.h> include BFQ
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl62HvYQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgptEAEACbuLfgFok0Vw8j7KNW0WNNKlS2o6nXQlW5
cl95JsqYdSL+toiDPQnJFtdoaxMhzL90kbWZzvPTBj+yTpLzRX0YnwFqXwFfmrga
gd/7SOM5C97F1LCPL+luhbgp5HUq+ZVH882KjMiOVLvjjAb4SeKSexQGoxeKvtcV
Pg3xm+zsbKKvclRDEqhnZB1X93WFAIrufuKBuV5xMZar7lkeRS9zwBUHySXa00xF
i7lbvDqtNn3itgNQd7VGSNCF5u4JxCUm73SumY3nDMFXBfvSNk0nUpFBpTYLjb7G
0XY71tfWrBlbk1sssqr1Dbs+pRuxJRj9FgtfNAMid7gcK0L9k6n7v08cFxkIz4Sv
XPHisD6QCOz7pZ5JwfdAp9Ea5g9z+QsN0G1Owr18fSgWwlgvhJ9rdd4H0Of7rWVj
mGyF5f+ZqoLD2UhaEmLgjQoSvzPlb6rsAUL9SxgpZkg/mk5l0j5tk32JS5bJL8h5
RTj0oeyqoVGKqnRy8heV/0z6TqcEtuNn/nOsht8adCgIUVpk95bkjTGBM900IK/X
HhdJMqPlTEDXQic+ZxVYNHDTZFhq4UOVJkoDfEwIN971LZfUaiz8XZ6uG5m4rFqj
iRmLN5XJNVNK52hNT1dLQyeQ4j3a5OnVGsvjZ33QLy2P6rCZd7yU6jKfsoL8JDEU
uAzkaWqLjA==
=YeXV
-----END PGP SIGNATURE-----
Merge tag 'io_uring-5.7-2020-05-08' of git://git.kernel.dk/linux-block
Pull io_uring fixes from Jens Axboe:
- Fix finish_wait() balancing in file cancelation (Xiaoguang)
- Ensure early cleanup of resources in ring map failure (Xiaoguang)
- Ensure IORING_OP_SLICE does the right file mode checks (Pavel)
- Remove file opening from openat/openat2/statx, it's not needed and
messes with O_PATH
* tag 'io_uring-5.7-2020-05-08' of git://git.kernel.dk/linux-block:
io_uring: don't use 'fd' for openat/openat2/statx
splice: move f_mode checks to do_{splice,tee}()
io_uring: handle -EFAULT properly in io_uring_setup()
io_uring: fix mismatched finish_wait() calls in io_uring_cancel_files()
do_splice() doesn't expect len to be 0. Just always return 0 in this
case as splice(2) does.
Fixes: 7d67af2c01 ("io_uring: add splice(2) support")
Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The comment describes work that now happens in unshare_sighand so
move the comment where it makes sense.
Link: https://lkml.kernel.org/r/87mu6i6zcs.fsf_-_@x220.int.ebiederm.org
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Add a simple struct nsset. It holds all necessary pieces to switch to a new
set of namespaces without leaving a task in a half-switched state which we
will make use of in the next patch. This patch switches the existing setns
logic over without causing a change in setns() behavior. This brings
setns() closer to how unshare() works(). The prepare_ns() function is
responsible to prepare all necessary information. This has two reasons.
First it minimizes dependencies between individual namespaces, i.e. all
install handler can expect that all fields are properly initialized
independent in what order they are called in. Second, this makes the code
easier to maintain and easier to follow if it needs to be changed.
The prepare_ns() helper will only be switched over to use a flags argument
in the next patch. Here it will still use nstype as a simple integer
argument which was argued would be clearer. I'm not particularly
opinionated about this if it really helps or not. The struct nsset itself
already contains the flags field since its name already indicates that it
can contain information required by different namespaces. None of this
should have functional consequences.
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: Serge Hallyn <serge@hallyn.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Serge Hallyn <serge@hallyn.com>
Cc: Jann Horn <jannh@google.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Link: https://lore.kernel.org/r/20200505140432.181565-2-christian.brauner@ubuntu.com
The "struct io_submit_state *state" parameter is not used, remove it.
Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The attempt protecting us from closing the ring itself wasn't really
complete, and we actually don't need it. The referencing of requests
themselve, and the references they hold on the ring, ensures that the
life time of the ring is sane. With the check removed, we can also
remove the need to have the close operation fget() the file.
Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We currently revoke read delegations on any write open or any operation
that modifies file data or metadata (including rename, link, and
unlink). But if the delegation in question is the only read delegation
and is held by the client performing the operation, that's not really
necessary.
It's not always possible to prevent this in the NFSv4.0 case, because
there's not always a way to determine which client an NFSv4.0 delegation
came from. (In theory we could try to guess this from the transport
layer, e.g., by assuming all traffic on a given TCP connection comes
from the same client. But that's not really correct.)
In the NFSv4.1 case the session layer always tells us the client.
This patch should remove such self-conflicts in all cases where we can
reliably determine the client from the compound.
To do that we need to track "who" is performing a given (possibly
lease-breaking) file operation. We're doing that by storing the
information in the svc_rqst and using kthread_data() to map the current
task back to a svc_rqst.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Commit 402cb8dda9 ("fscache: Attach the index key and aux data to
the cookie") added the aux_data and aux_data_len to parameters to
fscache_acquire_cookie(), and updated the callers in the NFS client.
In the process it modified the aux_data to include the change_attr,
but missed adding change_attr to a couple places where aux_data was
used. Specifically, when opening a file and the change_attr is not
added, the following attempt to lookup an object will fail inside
cachefiles_check_object_xattr() = -116 due to
nfs_fscache_inode_check_aux() failing memcmp on auxdata and returning
FSCACHE_CHECKAUX_OBSOLETE.
Fix this by adding nfs_fscache_update_auxdata() to set the auxdata
from all relevant fields in the inode, including the change_attr.
Fixes: 402cb8dda9 ("fscache: Attach the index key and aux data to the cookie")
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Commit f2aedb713c ("NFS: Add fs_context support.") reworked
NFS mount code paths for fs_context support which included
super_block initialization. In the process there was an extra
return left in the code and so we never call
nfs_fscache_get_super_cookie even if 'fsc' is given on as mount
option. In addition, there is an extra check inside
nfs_fscache_get_super_cookie for the NFS_OPTION_FSCACHE which
is unnecessary since the only caller nfs_get_cache_cookie
checks this flag.
Fixes: f2aedb713c ("NFS: Add fs_context support.")
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Commit 402cb8dda9 ("fscache: Attach the index key and aux data to
the cookie") added the index_key and index_key_len parameters to
fscache_acquire_cookie(), and updated the callers in the NFS client.
One of the callers was inside nfs_fscache_get_super_cookie()
and was changed to use the full struct nfs_fscache_key as the
index_key. However, a couple members of this structure contain
pointers and thus will change each time the same NFS share is
remounted. Since index_key is used for fscache_cookie->key_hash
and this subsequently is used to compare cookies, the effectiveness
of fscache with NFS is reduced to the point at which a umount
occurs. Any subsequent remount of the same share will cause a
unique NFS super_block index_key and key_hash to be generated for
the same data, rendering any prior fscache data unable to be
found. A simple reproducer demonstrates the problem.
1. Mount share with 'fsc', create a file, drop page cache
systemctl start cachefilesd
mount -o vers=3,fsc 127.0.0.1:/export /mnt
dd if=/dev/zero of=/mnt/file1.bin bs=4096 count=1
echo 3 > /proc/sys/vm/drop_caches
2. Read file into page cache and fscache, then unmount
dd if=/mnt/file1.bin of=/dev/null bs=4096 count=1
umount /mnt
3. Remount and re-read which should come from fscache
mount -o vers=3,fsc 127.0.0.1:/export /mnt
echo 3 > /proc/sys/vm/drop_caches
dd if=/mnt/file1.bin of=/dev/null bs=4096 count=1
4. Check for READ ops in mountstats - there should be none
grep READ: /proc/self/mountstats
Looking at the history and the removed function, nfs_super_get_key(),
we should only use nfs_fscache_key.key plus any uniquifier, for
the fscache index_key.
Fixes: 402cb8dda9 ("fscache: Attach the index key and aux data to the cookie")
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Signed-off-by: David Howells <dhowells@redhat.com>
This reverts commit df5db5f9ee.
This patch fixes a regression: patch df5db5f9ee allowed function
run_queue() to bypass its call to do_xmote() if revokes were queued for
the glock. That's wrong because its call to do_xmote() is what is
responsible for calling the go_sync() glops functions to sync both
the ail list and any revokes queued for it. By bypassing the call,
gfs2 could get into a stand-off where the glock could not be demoted
until its revokes are written back, but the revokes would not be
written back because do_xmote() was never called.
It "sort of" works, however, because there are other mechanisms like
the log flush daemon (logd) that can sync the ail items and revokes,
if it deems it necessary. The problem is: without file system pressure,
it might never deem it necessary.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Before this patch, if the go_sync operation returned an error during
the do_xmote process (such as unable to sync metadata to the journal)
the code did goto out. That kept the glock locked, so it could not be
given away, which correctly avoids file system corruption. However,
it never set the withdraw bit or requeueing the glock work. So it would
hang forever, unable to ever demote the glock.
This patch changes to goto to a new label, skip_inval, so that errors
from go_sync are treated the same way as errors from go_inval:
The delayed withdraw bit is set and the work is requeued. That way,
the logd should eventually figure out there's a problem and withdraw
properly there.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
big-endian arches, a spammy log message and a couple error paths.
Also included a MAINTAINERS update.
-----BEGIN PGP SIGNATURE-----
iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAl61ktUTHGlkcnlvbW92
QGdtYWlsLmNvbQAKCRBKf944AhHzi3yKB/9s0kZ7fLYtGzqtuoIjualsaM0lsBBS
rWAN4BkIVsxp3eOd5Hdb+ngIY5ykLLcUd+4gKqUNHkB7/1upDq9ZURKlyTwel5Wy
889YEYESCVQQxPVY9KNvafaPeuR++2r9Thlp9hWyczrtvXtz80sFIrtO9TwDrj1P
ZXPN3lxppGlxQiVNQfKIw2Cs78OxaNu9BthXZ7jN2OGaMQ0NU6sZ4LRXz8rbY+od
AbfLEfwz4dPHQ/44k3rQg2IWNuOxRK+CNayxhuN0KWzock3MzGVYoYkPx0wNLiDx
rntMscBqh3kppILZPEIeIA5Nv0yDAf4tf2hcUDf7GoJT/L/f9v7Q2SHa
=75Ca
-----END PGP SIGNATURE-----
Merge tag 'ceph-for-5.7-rc5' of git://github.com/ceph/ceph-client
Pull ceph fixes from Ilya Dryomov:
"Fixes for an endianness handling bug that prevented mounts on
big-endian arches, a spammy log message and a couple error paths.
Also included a MAINTAINERS update"
* tag 'ceph-for-5.7-rc5' of git://github.com/ceph/ceph-client:
ceph: demote quotarealm lookup warning to a debug message
MAINTAINERS: remove myself as ceph co-maintainer
ceph: fix double unlock in handle_cap_export()
ceph: fix special error code in ceph_try_get_caps()
ceph: fix endianness bug when handling MDS session feature bits
This patch rearranges gfs2_add_revoke so that the extra glock
reference is added earlier on in the function to avoid races in which
the glock is freed before the new reference is taken.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Before this patch, function gfs2_quota_unlock checked if quotas are
turned off, and if so, it branched to label out, which called
gfs2_quota_unhold. With the new system of gfs2_qa_get and put, we
no longer want to call gfs2_quota_unhold or we won't balance our
gets and puts.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Before this patch, function gfs2_quota_lock checked if it was called
from a privileged user, and if so, it bypassed the quota check:
superuser can operate outside the quotas.
That's the wrong place for the check because the lock/unlock functions
are separate from the lock_check function, and you can do lock and
unlock without actually checking the quotas.
This patch moves the check to gfs2_quota_lock_check.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
This patch removes a check from gfs2_quota_check for whether quotas
are enabled by the superblock. There is a test just prior for the
GIF_QD_LOCKED bit in the inode, and that can only be set by functions
that already check that quotas are enabled in the superblock.
Therefore, the check is redundant.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Before this patch, gfs2_quota_change() would BUG_ON if the
qa_ref counter was not a positive number. This patch changes it to
be a withdraw instead. That way we can debug things more easily.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
This patch fixes a couple of places in which gfs2_qa_get and gfs2_qa_put are
not balanced: we now keep references around whenever a file is open for writing
(see gfs2_open_common and gfs2_release), so we need to put all references we
grab in function gfs2_create_inode. This was broken in the successful case and
on one error path.
This also means that we don't have a reference to put in gfs2_evict_inode.
In addition, gfs2_qa_put was called for the wrong inode in gfs2_link.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
A misconfigured cephx can easily result in having the kernel client
flooding the logs with:
ceph: Can't lookup inode 1 (err: -13)
Change this message to debug level.
Cc: stable@vger.kernel.org
URL: https://tracker.ceph.com/issues/44546
Signed-off-by: Luis Henriques <lhenriques@suse.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Here are a number of small driver core fixes for 5.7-rc5 to resolve a
bunch of reported issues with the current tree.
Biggest here are the reverts and patches from John Stultz to resolve a
bunch of deferred probe regressions we have been seeing in 5.7-rc right
now.
Along with those are some other smaller fixes:
- coredump crash fix
- devlink fix for when permissive mode was enabled
- amba and platform device dma_parms fixes
- component error silenced for when deferred probe happens
All of these have been in linux-next for a while with no reported
issues.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCXrVnyg8cZ3JlZ0Brcm9h
aC5jb20ACgkQMUfUDdst+ylWBgCfbwjUbsDsHsrsVgWfOakIaoPUQ8IAmwetMKvS
ny1Kq7Cia+2y2e+7fDyo
=UKEM
-----END PGP SIGNATURE-----
Merge tag 'driver-core-5.7-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core fixes from Greg KH:
"Here are a number of small driver core fixes for 5.7-rc5 to resolve a
bunch of reported issues with the current tree.
Biggest here are the reverts and patches from John Stultz to resolve a
bunch of deferred probe regressions we have been seeing in 5.7-rc
right now.
Along with those are some other smaller fixes:
- coredump crash fix
- devlink fix for when permissive mode was enabled
- amba and platform device dma_parms fixes
- component error silenced for when deferred probe happens
All of these have been in linux-next for a while with no reported
issues"
* tag 'driver-core-5.7-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
regulator: Revert "Use driver_deferred_probe_timeout for regulator_init_complete_work"
driver core: Ensure wait_for_device_probe() waits until the deferred_probe_timeout fires
driver core: Use dev_warn() instead of dev_WARN() for deferred_probe_timeout warnings
driver core: Revert default driver_deferred_probe_timeout value to 0
component: Silence bind error on -EPROBE_DEFER
driver core: Fix handling of fw_devlink=permissive
coredump: fix crash when umh is disabled
amba: Initialize dma_parms for amba devices
driver core: platform: Initialize dma_parms for platform devices
Remove duplicate headers which are included twice.
Signed-off-by: Chen Zhou <chenzhou10@huawei.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The random buffer write failure errortag patch introduced a local
mount pointer variable for the test macro, but the macro is compiled
out on !DEBUG kernels. This results in an unused variable warning.
Access the mount structure through the buffer pointer and remove the
local mount pointer to address the warning.
Fixes: 7376d74547 ("xfs: random buffer write failure errortag")
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Remove unnecessary includes from the log recovery code.
Suggested-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Move the helpers that handle incore buffer cancellation records to
xfs_buf_item_recover.c since they're not directly related to the main
log recovery machinery. No functional changes.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
The only purpose of XFS_LI_RECOVERED is to prevent log recovery from
trying to replay recovered intents more than once. Therefore, we can
move the bit setting up to the ->iop_recover caller.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Now that we've made the recovered item tests all the same, we can hoist
the test and the ail locking code to the ->iop_recover caller and call
the recovery function directly.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Rename XFS_{EFI,BUI,RUI,CUI}_RECOVERED to XFS_LI_RECOVERED so that we
track recovery status in the log item, then get rid of the now unused
flags fields in each of those log item types.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
During recovery, every intent that we recover from the log has to be
added to the AIL. Replace the open-coded addition with a helper.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Replace the open-coded AIL item walking with a proper helper when we're
trying to release an intent item that has been finished. We add a new
->iop_match method to decide if an intent item matches a supplied ID.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Now that we've finished converting all types of log intent items to
provide an ->iop_recover function, we can convert the "is this an intent
item?" predicate to look for a non-null iop_recover pointer.
Move the predicate closer to the functions that use it.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Move the code that processes the log items created from the recovered
log items into the per-item source code files and use dispatch functions
to call them. No functional changes.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Move the code that processes the log items created from the recovered
log items into the per-item source code files and use dispatch functions
to call them. No functional changes.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Move the code that processes the log items created from the recovered
log items into the per-item source code files and use dispatch functions
to call them. No functional changes.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Move the code that processes the log items created from the recovered
log items into the per-item source code files and use dispatch functions
to call them. No functional changes.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Quotaoff doesn't actually do anything, so take advantage of the
commit_pass2 pointer being optional and get rid of the switch
statement clause.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Move the bmap update intent and intent-done pass2 commit code into the
per-item source code files and use dispatch functions to call them. We
do these one at a time because there's a lot of code to move. No
functional changes.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Move the refcount update intent and intent-done pass2 commit code into
the per-item source code files and use dispatch functions to call them.
We do these one at a time because there's a lot of code to move. No
functional changes.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Move the rmap update intent and intent-done pass2 commit code into the
per-item source code files and use dispatch functions to call them. We
do these one at a time because there's a lot of code to move. No
functional changes.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Move the extent free intent and intent-done pass2 commit code into the
per-item source code files and use dispatch functions to call them. We
do these one at a time because there's a lot of code to move. No
functional changes.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Move the log icreate item pass2 commit code into the per-item source code
files and use the dispatch function to call it. We do these one at a
time because there's a lot of code to move. No functional changes.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Move the log dquot item pass2 commit code into the per-item source code
files and use the dispatch function to call it. We do these one at a
time because there's a lot of code to move. No functional changes.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Move the log inode item pass2 commit code into the per-item source code
files and use the dispatch function to call it. We do these one at a
time because there's a lot of code to move. No functional changes.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Move the log buffer item pass2 commit code into the per-item source code
files and use the dispatch function to call it. We do these one at a
time because there's a lot of code to move. No functional changes.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Move the pass1 commit code into the per-item source code files and use
the dispatch function to call them.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Move the pass2 readhead code into the per-item source code files and use
the dispatch function to call them.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Create a generic dispatch structure to delegate recovery of different
log item types into various code modules. This will enable us to move
code specific to a particular log item type out of xfs_log_recover.c and
into the log item source.
The first operation we virtualize is the log item sorting.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Remove the old typedefs.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Merge misc fixes from Andrew Morton:
"14 fixes and one selftest to verify the ipc fixes herein"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
mm: limit boost_watermark on small zones
ubsan: disable UBSAN_ALIGNMENT under COMPILE_TEST
mm/vmscan: remove unnecessary argument description of isolate_lru_pages()
epoll: atomically remove wait entry on wake up
kselftests: introduce new epoll60 testcase for catching lost wakeups
percpu: make pcpu_alloc() aware of current gfp context
mm/slub: fix incorrect interpretation of s->offset
scripts/gdb: repair rb_first() and rb_last()
eventpoll: fix missing wakeup for ovflist in ep_poll_callback
arch/x86/kvm/svm/sev.c: change flag passed to GUP fast in sev_pin_memory()
scripts/decodecode: fix trapping instruction formatting
kernel/kcov.c: fix typos in kcov_remote_start documentation
mm/page_alloc: fix watchdog soft lockups during set_zone_contiguous()
mm, memcg: fix error return value of mem_cgroup_css_alloc()
ipc/mqueue.c: change __do_notify() to bypass check_kill_permission()
to support bmap() on compressed inode: if queried block locates in
non-compressed cluster, return its physical block address.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Map normal/compressed cluster of compressed inode correctly, and give
the right fiemap flag FIEMAP_EXTENT_ENCODED on mapped compressed extent.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Remove the pointless string length checks. Just use strcmp().
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch corrects the SPDX License Identifier style in
header files related to F2FS File System support.
For C header files Documentation/process/license-rules.rst
mandates C-like comments (opposed to C source files where
C++ style should be used).
Changes made by using a script provided by Joe Perches here:
https://lkml.org/lkml/2019/2/7/46.
Suggested-by: Joe Perches <joe@perches.com>
Signed-off-by: Nishad Kamdar <nishadkamdar@gmail.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
It turns out that when extending an existing bio, gfs2_find_jhead fails to
check if the block number is consecutive, which leads to incorrect reads for
fragmented journals.
In addition, limit the maximum bio size to an arbitrary value of 2 megabytes:
since commit 07173c3ec2 ("block: enable multipage bvecs"), if we just keep
adding pages until bio_add_page fails, bios will grow much larger than useful,
which pins more memory than necessary with barely any additional performance
gains.
Fixes: f4686c26ec ("gfs2: read journal in large chunks")
Cc: stable@vger.kernel.org # v5.2+
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Make sure we don't walk past the end of the metadata in gfs2_walk_metadata: the
inode holds fewer pointers than indirect blocks.
Slightly clean up gfs2_iomap_get.
Fixes: a27a0c9b6a ("gfs2: gfs2_walk_metadata fix")
Cc: stable@vger.kernel.org # v5.3+
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
When the gfs2_logd daemon withdrew, the withdraw sequence called
into make_fs_ro() to make the file system read-only. That caused the
journal descriptors to be freed. However, those journal descriptors
were used by gfs2_logd's call to gfs2_ail_flush_reqd(). This caused
a use-after free and NULL pointer dereference.
This patch changes function gfs2_logd() so that it stops all logd
work until the thread is told to stop. Once a withdraw is done,
it only does an interruptible sleep.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Before this patch, when the logd daemon was forced to withdraw, it
would try to request its journal be recovered by another cluster node.
However, in single-user cases with lock_nolock, there are no other
nodes to recover the journal. Function signal_our_withdraw() was
recognizing the lock_nolock situation, but not until after it had
evicted its journal inode. Since the journal descriptor that points
to the inode was never removed from the master list, when the unmount
occurred, it did another iput on the evicted inode, which resulted in
a BUG_ON(inode->i_state & I_CLEAR).
This patch moves the check for this situation earlier in function
signal_our_withdraw(), which avoids the extra iput, so the unmount
may happen normally.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Before this patch, if an error was detected from glock function go_sync
by function do_xmote, it would return. But the function had temporarily
unlocked the gl_lockref spin_lock, and it never re-locked it. When the
caller of do_xmote tried to unlock it again, it was already unlocked,
which resulted in a corrupted spin_lock value.
This patch makes sure the gl_lockref spin_lock is re-locked after it is
unlocked.
Thanks to Wu Bo <wubo40@huawei.com> for reporting this problem.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
The current codebase makes use of the zero-length array language
extension to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:
struct foo {
int stuff;
struct boo array[];
};
By making use of the mechanism above, we will get a compiler warning
in case the flexible array does not occur last in the structure, which
will help us prevent some kind of undefined behavior bugs from being
inadvertently introduced[3] to the codebase from now on.
Also, notice that, dynamic memory allocations won't be affected by
this change:
"Flexible array members have incomplete type, and so the sizeof operator
may not be applied. As a quirk of the original implementation of
zero-length arrays, sizeof evaluates to zero."[1]
sizeof(flexible-array-member) triggers a warning because flexible array
members have incomplete type[1]. There are some instances of code in
which the sizeof operator is being incorrectly/erroneously applied to
zero-length arrays and the result is zero. Such instances may be hiding
some bugs. So, this work (flexible-array member conversions) will also
help to get completely rid of those sorts of issues.
This issue was found with the help of Coccinelle.
[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 7649773293 ("cxgb3/l2t: Fix undefined behaviour")
Link: https://lore.kernel.org/r/20200507185230.GA14229@embeddedor
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
<linux/cryptohash.h> sounds very generic and important, like it's the
header to include if you're doing cryptographic hashing in the kernel.
But actually it only includes the library implementation of the SHA-1
compression function (not even the full SHA-1). This should basically
never be used anymore; SHA-1 is no longer considered secure, and there
are much better ways to do cryptographic hashing in the kernel.
Most files that include this header don't actually need it. So in
preparation for removing it, remove all these unneeded includes of it.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Instead of manually allocating a 'struct shash_desc' on the stack and
calling crypto_shash_digest(), switch to using the new helper function
crypto_shash_tfm_digest() which does this for us.
Cc: linux-mtd@lists.infradead.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Instead of manually allocating a 'struct shash_desc' on the stack and
calling crypto_shash_digest(), switch to using the new helper function
crypto_shash_tfm_digest() which does this for us.
Cc: linux-nfs@vger.kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
Acked-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Instead of manually allocating a 'struct shash_desc' on the stack and
calling crypto_shash_digest(), switch to using the new helper function
crypto_shash_tfm_digest() which does this for us.
Cc: ecryptfs@vger.kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Instead of manually allocating a 'struct shash_desc' on the stack and
calling crypto_shash_digest(), switch to using the new helper function
crypto_shash_tfm_digest() which does this for us.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
This patch does two things:
- fixes a lost wakeup introduced by commit 339ddb53d3 ("fs/epoll:
remove unnecessary wakeups of nested epoll")
- improves performance for events delivery.
The description of the problem is the following: if N (>1) threads are
waiting on ep->wq for new events and M (>1) events come, it is quite
likely that >1 wakeups hit the same wait queue entry, because there is
quite a big window between __add_wait_queue_exclusive() and the
following __remove_wait_queue() calls in ep_poll() function.
This can lead to lost wakeups, because thread, which was woken up, can
handle not all the events in ->rdllist. (in better words the problem is
described here: https://lkml.org/lkml/2019/10/7/905)
The idea of the current patch is to use init_wait() instead of
init_waitqueue_entry().
Internally init_wait() sets autoremove_wake_function as a callback,
which removes the wait entry atomically (under the wq locks) from the
list, thus the next coming wakeup hits the next wait entry in the wait
queue, thus preventing lost wakeups.
Problem is very well reproduced by the epoll60 test case [1].
Wait entry removal on wakeup has also performance benefits, because
there is no need to take a ep->lock and remove wait entry from the queue
after the successful wakeup. Here is the timing output of the epoll60
test case:
With explicit wakeup from ep_scan_ready_list() (the state of the
code prior 339ddb53d3):
real 0m6.970s
user 0m49.786s
sys 0m0.113s
After this patch:
real 0m5.220s
user 0m36.879s
sys 0m0.019s
The other testcase is the stress-epoll [2], where one thread consumes
all the events and other threads produce many events:
With explicit wakeup from ep_scan_ready_list() (the state of the
code prior 339ddb53d3):
threads events/ms run-time ms
8 5427 1474
16 6163 2596
32 6824 4689
64 7060 9064
128 6991 18309
After this patch:
threads events/ms run-time ms
8 5598 1429
16 7073 2262
32 7502 4265
64 7640 8376
128 7634 16767
(number of "events/ms" represents event bandwidth, thus higher is
better; number of "run-time ms" represents overall time spent
doing the benchmark, thus lower is better)
[1] tools/testing/selftests/filesystems/epoll/epoll_wakeup_test.c
[2] https://github.com/rouming/test-tools/blob/master/stress-epoll.c
Signed-off-by: Roman Penyaev <rpenyaev@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jason Baron <jbaron@akamai.com>
Cc: Khazhismel Kumykov <khazhy@google.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Heiher <r@hev.cc>
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/20200430130326.1368509-2-rpenyaev@suse.de
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In the event that we add to ovflist, before commit 339ddb53d3
("fs/epoll: remove unnecessary wakeups of nested epoll") we would be
woken up by ep_scan_ready_list, and did no wakeup in ep_poll_callback.
With that wakeup removed, if we add to ovflist here, we may never wake
up. Rather than adding back the ep_scan_ready_list wakeup - which was
resulting in unnecessary wakeups, trigger a wake-up in ep_poll_callback.
We noticed that one of our workloads was missing wakeups starting with
339ddb53d3 and upon manual inspection, this wakeup seemed missing to me.
With this patch added, we no longer see missing wakeups. I haven't yet
tried to make a small reproducer, but the existing kselftests in
filesystem/epoll passed for me with this patch.
[khazhy@google.com: use if/elif instead of goto + cleanup suggested by Roman]
Link: http://lkml.kernel.org/r/20200424190039.192373-1-khazhy@google.com
Fixes: 339ddb53d3 ("fs/epoll: remove unnecessary wakeups of nested epoll")
Signed-off-by: Khazhismel Kumykov <khazhy@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Roman Penyaev <rpenyaev@suse.de>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Roman Penyaev <rpenyaev@suse.de>
Cc: Heiher <r@hev.cc>
Cc: Jason Baron <jbaron@akamai.com>
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/20200424025057.118641-1-khazhy@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There is and has been for a very long time been a lot more going on in
flush_old_exec than just flushing the old state. After the movement
of code from setup_new_exec there is a whole lot more going on than
just flushing the old executables state.
Rename flush_old_exec to begin_new_exec to more accurately reflect
what this function does.
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Greg Ungerer <gerg@linux-m68k.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
The current idiom for the callers is:
flush_old_exec(bprm);
set_personality(...);
setup_new_exec(bprm);
In 2010 Linus split flush_old_exec into flush_old_exec and
setup_new_exec. With the intention that setup_new_exec be what is
called after the processes new personality is set.
Move the code that doesn't depend upon the personality from
setup_new_exec into flush_old_exec. This is to facilitate future
changes by having as much code together in one function as possible.
To see why it is safe to move this code please note that effectively
this change moves the personality setting in the binfmt and the following
three lines of code after everything except unlocking the mutexes:
arch_pick_mmap_layout
arch_setup_new_exec
mm->task_size = TASK_SIZE
The function arch_pick_mmap_layout at most sets:
mm->get_unmapped_area
mm->mmap_base
mm->mmap_legacy_base
mm->mmap_compat_base
mm->mmap_compat_legacy_base
which nothing in flush_old_exec or setup_new_exec depends on.
The function arch_setup_new_exec only sets architecture specific
state and the rest of the functions only deal in state that applies
to all architectures.
The last line just sets mm->task_size and again nothing in flush_old_exec
or setup_new_exec depend on task_size.
Ref: 221af7f87b ("Split 'flush_old_exec' into two functions")
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Greg Ungerer <gerg@linux-m68k.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
At least gcc 8.3 when generating code for x86_64 has a hard time
consolidating multiple calls to current aka get_current(), and winds
up unnecessarily rereading %gs:current_task several times in
setup_new_exec.
Caching the value of current in the local variable of me generates
slightly better and shorter assembly.
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Greg Ungerer <gerg@linux-m68k.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
The two functions are now always called one right after the
other so merge them together to make future maintenance easier.
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Greg Ungerer <gerg@linux-m68k.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Update the comments and make the code easier to understand by
renaming this flag.
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Greg Ungerer <gerg@linux-m68k.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
With install_exec_creds updated to follow immediately after
setup_new_exec, the failure of unshare_sighand is the only
code path where exec_update_mutex is held but not explicitly
unlocked.
Update that code path to explicitly unlock exec_update_mutex.
Remove the unlocking of exec_update_mutex from free_bprm.
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Greg Ungerer <gerg@linux-m68k.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
In 2016 Linus moved install_exec_creds immediately after
setup_new_exec, in binfmt_elf as a cleanup and as part of closing a
potential information leak.
Perform the same cleanup for the other binary formats.
Different binary formats doing the same things the same way makes exec
easier to reason about and easier to maintain.
Greg Ungerer reports:
> I tested the the whole series on non-MMU m68k and non-MMU arm
> (exercising binfmt_flat) and it all tested out with no problems,
> so for the binfmt_flat changes:
Tested-by: Greg Ungerer <gerg@linux-m68k.org>
Ref: 9f834ec18d ("binfmt_elf: switch to new creds when switching to new mm")
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Greg Ungerer <gerg@linux-m68k.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
We currently make some guesses as when to open this fd, but in reality
we have no business (or need) to do so at all. In fact, it makes certain
things fail, like O_PATH.
Remove the fd lookup from these opcodes, we're just passing the 'fd' to
generic helpers anyway. With that, we can also remove the special casing
of fd values in io_req_needs_file(), and the 'fd_non_neg' check that
we have. And we can ensure that we only read sqe->fd once.
This fixes O_PATH usage with openat/openat2, and ditto statx path side
oddities.
Cc: stable@vger.kernel.org: # v5.6
Reported-by: Max Kellermann <mk@cm4all.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
do_splice() is used by io_uring, as will be do_tee(). Move f_mode
checks from sys_{splice,tee}() to do_{splice,tee}(), so they're
enforced for io_uring as well.
Fixes: 7d67af2c01 ("io_uring: add splice(2) support")
Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
iget_flags is unused in xfs_imap_to_bp(). Remove the parameter and
fix up the callers.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Allison Collins <allison.henderson@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Both types control shutdown messaging and neither is used in the
current codebase.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Allison Collins <allison.henderson@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Introduce an error tag to randomly fail async buffer writes. This is
primarily to facilitate testing of the XFS error configuration
mechanism.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Allison Collins <allison.henderson@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The stale parameter was used to control the now unused shutdown
parameter of xfs_trans_ail_remove().
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Allison Collins <allison.henderson@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Now that the functions and callers of
xfs_trans_ail_[remove|delete]() have been fixed up appropriately,
the only difference between the two is the shutdown behavior. There
are only a few callers of the _remove() variant, so make the
shutdown conditional on the parameter and combine the two functions.
Suggested-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Allison Collins <allison.henderson@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The shutdown parameter of xfs_trans_ail_remove() is no longer used.
The remaining callers use it for items that legitimately might not
be in the AIL or from contexts where AIL state has already been
checked. Remove the unnecessary parameter and fix up the callers.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Allison Collins <allison.henderson@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Various intent log items call xfs_trans_ail_remove() with a log I/O
error shutdown type, but this helper historically checks whether an
item is in the AIL before calling xfs_trans_ail_delete(). This means
the shutdown check is essentially a no-op for users of
xfs_trans_ail_remove().
It is possible that some items might not be AIL resident when the
AIL remove attempt occurs, but this should be isolated to cases
where the filesystem has already shutdown. For example, this
includes abort of the transaction committing the intent and I/O
error of the iclog buffer committing the intent to the log.
Therefore, update these callsites to use xfs_trans_ail_delete() to
provide AIL state validation for the common path of items being
released and removed when associated done items commit to the
physical log.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Allison Collins <allison.henderson@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Several callers acquire the lock just prior to the call. Callers
that require ->ail_lock for other purposes already check IN_AIL
state and thus don't require the additional shutdown check in the
helper. Push the lock down into xfs_trans_ail_delete(), open code
the instances that still acquire it, and remove the unnecessary ailp
parameter.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Allison Collins <allison.henderson@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The dquot flush handler effectively aborts the dquot flush if the
filesystem is already shut down, but doesn't actually shut down if
the flush fails. Update xfs_qm_dqflush() to consistently abort the
dquot flush and shutdown the fs if the flush fails with an
unexpected error.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Allison Collins <allison.henderson@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The pre-flush dquot verification in xfs_qm_dqflush() duplicates the
read verifier by checking the dquot in the on-disk buffer. Instead,
verify the in-core variant before it is flushed to the buffer.
Fixes: 7224fa482a ("xfs: add full xfs_dqblk verifier")
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Allison Collins <allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
At unmount time, XFS emits an alert for every in-core buffer that
might have undergone a write error. In practice this behavior is
probably reasonable given that the filesystem is likely short lived
once I/O errors begin to occur consistently. Under certain test or
otherwise expected error conditions, this can spam the logs and slow
down the unmount.
Now that we have a ratelimit mechanism specifically for buffer
alerts, reuse it for the per-buffer alerts in xfs_wait_buftarg().
Also lift the final repair message out of the loop so it always
prints and assert that the metadata error handling code has shut
down the fs.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Allison Collins <allison.henderson@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
XFS has some inconsistent log message rate limiting with respect to
buffer alerts. The metadata I/O error notification uses the generic
ratelimited alert, the buffer push code uses a custom rate limit and
the similar quiesce time failure checks are not rate limited at all
(when they should be).
The custom rate limit defined in the buf item code is specifically
crafted for buffer alerts. It is more aggressive than generic rate
limiting code because it must accommodate a high frequency of I/O
error events in a relative short timeframe.
Factor out the custom rate limit state from the buf item code into a
per-buftarg rate limit so various alerts are limited based on the
target. Define a buffer alert helper function and use it for the
buffer alerts that are already ratelimited.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Allison Collins <allison.henderson@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The buffer write failure flag is intended to control the internal
write retry that XFS has historically implemented to help mitigate
the severity of transient I/O errors. The flag is set when a buffer
is resubmitted from the I/O completion path due to a previous
failure. It is checked on subsequent I/O completions to skip the
internal retry and fall through to the higher level configurable
error handling mechanism. The flag is cleared in the synchronous and
delwri submission paths and also checked in various places to log
write failure messages.
There are a couple minor problems with the current usage of this
flag. One is that we issue an internal retry after every submission
from xfsaild due to how delwri submission clears the flag. This
results in double the expected or configured number of write
attempts when under sustained failures. Another more subtle issue is
that the flag is never cleared on successful I/O completion. This
can cause xfs_wait_buftarg() to suggest that dirty buffers are being
thrown away due to the existence of the flag, when the reality is
that the flag might still be set because the write succeeded on the
retry.
Clear the write failure flag on successful I/O completion to address
both of these problems. This means that the internal retry attempt
occurs once since the last time a buffer write failed and that
various other contexts only see the flag set when the immediately
previous write attempt has failed.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Allison Collins <allison.henderson@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The shutdown check in xfs_iflush() duplicates checks down in the
buffer code. If the fs is shut down, xfs_trans_read_buf_map() always
returns an error and falls into the same error path. Remove the
unnecessary check along with the warning in xfs_imap_to_bp()
that generates excessive noise in the log if the fs is shut down.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Allison Collins <allison.henderson@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The inode flush code has several layers of error handling between
the inode and cluster flushing code. If the inode flush fails before
acquiring the backing buffer, the inode flush is aborted. If the
cluster flush fails, the current inode flush is aborted and the
cluster buffer is failed to handle the initial inode and any others
that might have been attached before the error.
Since xfs_iflush() is the only caller of xfs_iflush_cluster(), the
error handling between the two can be condensed in the top-level
function. If we update xfs_iflush_int() to always fall through to
the log item update and attach the item completion handler to the
buffer, any errors that occur after the first call to
xfs_iflush_int() can be handled with a buffer I/O failure.
Lift the error handling from xfs_iflush_cluster() into xfs_iflush()
and consolidate with the existing error handling. This also replaces
the need to release the buffer because failing the buffer with
XBF_ASYNC drops the current reference.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Allison Collins <allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
We use the same buffer I/O failure code in a few different places.
It's not much code, but it's not necessarily self-explanatory.
Factor it into a helper and document it in one place.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Allison Collins <allison.henderson@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Flush locked log items whose underlying buffers fail metadata
writeback are tagged with a special flag to indicate that the flush
lock is already held. This is currently implemented in the type
specific ->iop_push() callback, but the processing required for such
items is not type specific because we're only doing basic state
management on the underlying buffer.
Factor the failed log item handling out of the inode and dquot
->iop_push() callbacks and open code the buffer resubmit helper into
a single helper called from xfsaild_push_item(). This provides a
generic mechanism for handling failed metadata buffer writeback with
a bit less code.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Allison Collins <allison.henderson@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The DIOREAD_NOLOCK flag has been cleared when doing the test_opt
that is meaningless, so remove the unnecessary test_opt for DIOREAD_NOLOCK.
Signed-off-by: Kaixu Xia <kaixuxia@tencent.com>
Link: https://lore.kernel.org/r/1586751862-19437-1-git-send-email-kaixuxia@tencent.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Simplify the bdi name to mirror what we are doing elsewhere, and
drop them name in favor of just using a number. This avoids a
potentially very long bdi name.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hans de Goede <hdegoede@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
eventfd is using ->read() as it's file_operations read handler, but
this prevents passing in information about whether a given IO operation
is blocking or not. We can only use the file flags for that. To support
async (-EAGAIN/poll based) retries for io_uring, we need ->read_iter()
support. Convert eventfd to using ->read_iter().
With ->read_iter(), we can support IOCB_NOWAIT. Ensure the fd setup
is done such that we set file->f_mode with FMODE_NOWAIT.
[missing include added]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Make sure we release resources properly if we cannot clean out the COW
extents in preparation for an extent swap.
Fixes: 96987eea53 ("xfs: cancel COW blocks before swapext")
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
If the client attempts BIND_CONN_TO_SESSION on an already bound
connection, it should be either a no-op or an error.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Add filename to states output for ease of debugging.
Signed-off-by: Achilles Gaikwad <agaikwad@redhat.com>
Signed-off-by: Kenneth Dsouza <kdsouza@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
When we decode the stateid we byte-swap si_generation.
But for simplicity's sake and ease of comparison with network traces,
it's better to display the whole thing in network order.
Reported-by: Scott Mayhew <smayhew@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
There's a problem with how I'm formatting stateids. Before I fix it,
I'd like to move the stateid formatting into a common helper.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
After a gfs2 file system withdraw, any attempt to read metadata is
automatically rejected by function gfs2_meta_read() except for reads
of the journal inode. This turns out to be a problem because function
signal_our_withdraw() repeatedly calls check_journal_clean() which reads
the metadata (both its dinode and indirect blocks) to see if the entire
journal is mapped. The dinode read works, but reading the indirect blocks
returns -EIO which gets sent back up and causes a consistency error.
This results in withdraw-from-withdraw, which becomes a deadlock.
This patch changes the test in gfs2_meta_read() to allow all metadata
reads for the journal. Instead of checking the journal block, it now
checks for the journal inode glock which is the same for all blocks in
the journal. This allows check_journal_clean() to properly check the
journal without trying to withdraw recursively.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
As per POSIX, the correct spelling is EACCES:
include/uapi/asm-generic/errno-base.h:#define EACCES 13 /* Permission denied */
Fixes: b8f7442bc4 ("CIFS: refactor cifs_get_inode_info()")
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Signed-off-by: Steve French <stfrench@microsoft.com>
There is no logic in elf_fdpic_core_dump itself or in the various arch
helpers called from it which use uaccess routines on kernel pointers
except for the file writes thate are nicely encapsulated by using
__kernel_write in dump_emit.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
There is no logic in elf_core_dump itself or in the various arch helpers
called from it which use uaccess routines on kernel pointers except for
the file writes thate are nicely encapsulated by using __kernel_write in
dump_emit.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The code in binfmt_elf.c is differnt from the rest of the code that
processes siginfo, as it sends siginfo from a kernel buffer to a file
rather than from kernel memory to userspace buffers. To remove it's
use of set_fs the code needs some different siginfo helpers.
Add the helper copy_siginfo_to_external to copy from the kernel's
internal siginfo layout to a buffer in the siginfo layout that
userspace expects.
Modify fill_siginfo_note to use copy_siginfo_to_external instead of
set_fs and copy_siginfo_to_user.
Update compat_binfmt_elf.c to use the previously added
copy_siginfo_to_external32 to handle the compat case.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
If copy_to_user() in io_uring_setup() failed, we'll leak many kernel
resources, which will be recycled until process terminates. This bug
can be reproduced by using mprotect to set params to PROT_READ. To fix
this issue, refactor io_uring_create() a bit to add a new 'struct
io_uring_params __user *params' parameter and move the copy_to_user()
in io_uring_setup() to io_uring_setup(), if copy_to_user() failed,
we can free kernel resource properly.
Suggested-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
- Add a SPDX header;
- Adjust document and section titles;
- Use copyright symbol;
- Some whitespace fixes and new line breaks;
- Mark literal blocks as such;
- Add it to filesystems/index.rst.
Also, as this file is alone on its own dir, and it doesn't
seem too likely that other documents will follow it, let's
move it to the filesystems/ root documentation dir.
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Link: https://lore.kernel.org/r/c2424ec2ad4d735751434ff7f52144c44aa02d5a.1588021877.git.mchehab+huawei@kernel.org
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
This document has its own style. It seems to be print output
for the old matrixial printers where backspace were used to
do double prints.
For the conversion, I used several regex expressions to get
rid of some weird stuff. The patch also does almost all possible
conversions in order to get a nice output document, while keeping
it readable/editable as is:
- Add a SPDX header;
- Add a document title;
- Adjust document title;
- Adjust section titles;
- Some whitespace fixes and new line breaks;
- Mark literal blocks as such;
- Adjust list markups;
- Mark some unumbered titles with bold font;
- Use footnoote markups;
- Add table markups;
- Use notes markups;
- Add it to filesystems/index.rst.
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Link: https://lore.kernel.org/r/25c06c40c3d7b947a131c3be124ce0e93cc00ae3.1588021877.git.mchehab+huawei@kernel.org
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
- Add a SPDX header;
- Adjust document and section titles;
- Comment out text ToC for html/pdf output;
- Some whitespace fixes and new line breaks;
- Add table markups;
- Add it to filesystems/index.rst.
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Link: https://lore.kernel.org/r/e33ec382a53cf10ffcbd802f6de3f384159cddba.1588021877.git.mchehab+huawei@kernel.org
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
- Add a SPDX header;
- Adjust document and section titles;
- Comment out text ToC for html/pdf output;
- Some whitespace fixes and new line breaks;
- Adjust the events list to make them look better for html output;
- Add it to filesystems/index.rst.
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Link: https://lore.kernel.org/r/49026a8ea7e714c2e0f003aa26b975b1025476b7.1588021877.git.mchehab+huawei@kernel.org
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Merge in user support for Branch Target Identification, which narrowly
missed the cut for 5.7 after a late ABI concern.
* for-next/bti-user:
arm64: bti: Document behaviour for dynamically linked binaries
arm64: elf: Fix allnoconfig kernel build with !ARCH_USE_GNU_PROPERTY
arm64: BTI: Add Kconfig entry for userspace BTI
mm: smaps: Report arm64 guarded pages in smaps
arm64: mm: Display guarded pages in ptdump
KVM: arm64: BTI: Reset BTYPE when skipping emulated instructions
arm64: BTI: Reset BTYPE when skipping emulated instructions
arm64: traps: Shuffle code to eliminate forward declarations
arm64: unify native/compat instruction skipping
arm64: BTI: Decode BYTPE bits when printing PSTATE
arm64: elf: Enable BTI at exec based on ELF program properties
elf: Allow arch to tweak initial mmap prot flags
arm64: Basic Branch Target Identification support
ELF: Add ELF program property parsing support
ELF: UAPI and Kconfig additions for ELF program properties
If the ceph_mdsc_open_export_target_session() return fails, it will
do a "goto retry", but the session mutex has already been unlocked.
Re-lock the mutex in that case to ensure that we don't unlock it
twice.
Signed-off-by: Wu Bo <wubo40@huawei.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
There are 3 speical error codes: -EAGAIN/-EFBIG/-ESTALE.
After calling try_get_cap_refs, ceph_try_get_caps test for the
-EAGAIN twice. Ensure that it tests for -ESTALE instead.
Signed-off-by: Wu Bo <wubo40@huawei.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Eduard reported a problem mounting cephfs on s390 arch. The feature
mask sent by the MDS is little-endian, so we need to convert it
before storing and testing against it.
Cc: stable@vger.kernel.org
Reported-and-Tested-by: Eduard Shishkin <edward6@linux.ibm.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Instead just call the CDROM layer functionality directly.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Instead just call the CDROM layer functionality directly, and turn the
hot mess in isofs_get_last_session into remotely readable code.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Instead just call the CDROM layer functionality directly.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The functionality in xfs_diflags_to_linux() and xfs_diflags_to_iflags() are
nearly identical. The only difference is that *_to_linux() is called after
inode setup and disallows changing the DAX flag.
Combining them can be done with a flag which indicates if this is the initial
setup to allow the DAX flag to be properly set only at init time.
So remove xfs_diflags_to_linux() and call the modified xfs_diflags_to_iflags()
directly.
While we are here simplify xfs_diflags_to_iflags() to take struct xfs_inode and
use xfs_ip2xflags() to ensure future diflags are included correctly.
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
xfs_inode_supports_dax() should reflect if the inode can support DAX not
that it is enabled for DAX.
Change the use of xfs_inode_supports_dax() to reflect only if the inode
and underlying storage support dax.
Add a new function xfs_inode_should_enable_dax() which reflects if the
inode should be enabled for DAX.
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
As agreed upon[1]. We make the dax mount option a tri-state. '-o dax'
continues to operate the same. We add 'always', 'never', and 'inode'
(default).
[1] https://lore.kernel.org/lkml/20200405061945.GA94792@iweiny-DESK2.sc.intel.com/
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
In prep for the new tri-state mount option which then introduces
XFS_MOUNT_DAX_NEVER.
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
An earlier call of xfs_reinit_inode() from xfs_iget_cache_hit() already
handles initialization of i_rwsem.
Doing so again is unneeded.
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Given how XFS is all based around btrees it doesn't make much sense
to offer a totally generic state when we can just use the btree cursor.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
All defer op instance place their own extension of the log item into
the dfp_done field. Replace that with a xfs_log_item to improve type
safety and make the code easier to follow.
Also use the opportunity to improve the ->finish_item calling conventions
to place the done log item as the higher level structure before the
list_entry used for the individual items.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Split out a helper that operates on a single xfs_defer_pending structure
to untangle the code.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
All defer op instance place their own extension of the log item into
the dfp_intent field. Replace that with a xfs_log_item to improve type
safety and make the code easier to follow.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
This avoids a per-item indirect call, and also simplifies the interface
a bit.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
These are aways called together, and my merging them we reduce the amount
of indirect calls, improve type safety and in general clean up the code
a bit.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Create a helper that encapsulates the whole logic to create a defer
intent. This reorders some of the work that was done, but none of
that has an affect on the operation as only fields that don't directly
interact are affected.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Split out a xlog_add_buffer_cancelled helper which does the low-level
manipulation of the buffer cancelation table, and in that helper call
xlog_find_buffer_cancelled instead of open coding it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Don't bother to allocate memory and convert the log item when we
only need the block number and the length. Just extract them directly
and call xlog_buf_readahead separately in each branch.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Add a little helper to readahead a buffer if it hasn't been cancelled.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
This list contains pretty much everything that is not a buffer. The
comment calls it item_list, which is a much better name than inode
list, so switch the actual variable name to that as well.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Replace the somewhat convoluted use of xlog_peek_buffer_cancelled and
xlog_check_buffer_cancelled with two obvious helpers:
xlog_is_buffer_cancelled, which returns true if there is a buffer in
the cancellation table, and
xlog_put_buffer_cancelled, which also decrements the reference count
of the buffer cancellation table.
Both share a little helper to look up the entry.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
There are a couple places where we directly call printk_once() and one
of them doesn't follow the standard xfs subsystem printk format as a
result.
#define printk_once variants to go with our existing printk_ratelimited
#defines so we can do one-shot printks in a consistent manner.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
I ran into a linker warning in XFS that originates from a mismatch
between libelf, binutils and objtool when certain files in the kernel
are built with "gcc -g":
x86_64-linux-ld: fs/xfs/xfs_trace.o: unable to initialize decompress status for section .debug_info
After some discussion, nobody could identify why xfs sets this flag
here. CONFIG_XFS_DEBUG used to enable lots of unrelated settings, but
now its main purpose is to enable extra consistency checks and assertions
that are unrelated to the debug info.
Remove the Makefile logic to set the flag here. If anyone relies
on the debug info, this can simply be enabled again with the global
CONFIG_DEBUG_INFO option.
Dave Chinner writes:
I'm pretty sure it was needed for the original kgdb integration back
in the early 2000s. That was when SGI used to patch their XFS dev
tree with kgdb and debug symbols were needed by the custom kgdb
modules that were ported across from the Irix kernel debugger.
ISTR that the early kcrash kernel dump analysis tools (again,
originated from the Irix "icrash" kernel dump tools) had custom XFS
debug scripts that needed also the debug info to work correctly...
Which is a long way of saying "we don't need it anymore" instead of
"nobody knows why it was set"... :)
Suggested-by: Christoph Hellwig <hch@infradead.org>
Link: https://lore.kernel.org/lkml/20200409074130.GD21033@infradead.org/
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Allison Collins <allison.henderson@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Since the "no-allocation" reservations has been removed, the resblks
value should be larger than zero, so remove the unnecessary check.
Signed-off-by: Kaixu Xia <kaixuxia@tencent.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Simplify the setting of the flags value, and only consider
quota enforcement stuff here.
Signed-off-by: Kaixu Xia <kaixuxia@tencent.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The check XFS_IS_QUOTA_RUNNING() has been done when enter the
xfs_qm_vop_create_dqattach() function, it will return directly
if the result is false, so the followed XFS_IS_QUOTA_RUNNING()
assertion is unnecessary. If we truly care about this, the check
also can be added to the condition of next if statements.
Signed-off-by: Kaixu Xia <kaixuxia@tencent.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The initial value of variable udqp is NULL, and we only set the
flag XFS_QMOPT_PQUOTA in xfs_qm_vop_dqalloc() function, so only
the pdqp value is initialized and the udqp value is still NULL.
Since the udqp value is NULL in the rest part of xfs_ioctl_setattr()
function, it is meaningless and do nothing. So remove it from
xfs_ioctl_setattr().
Signed-off-by: Kaixu Xia <kaixuxia@tencent.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
We share an inode between gquota and pquota with the older
superblock that doesn't have separate pquotino, and for the
need_alloc == false case we don't need to call xfs_dir_ialloc()
function, so add the check if reserved free disk blocks is
needed.
Signed-off-by: Kaixu Xia <kaixuxia@tencent.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The two if statements have same condition, and the mask value
does not change in xfs_setattr_nonsize(), so combine them.
Signed-off-by: Kaixu Xia <kaixuxia@tencent.com>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The trace event xfs_dquot_dqalloc does not depend on the
value uq, so remove the condition, and trace quota allocations
for all quota types.
Signed-off-by: Kaixu Xia <kaixuxia@tencent.com>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
When we're sorting recovered log items ahead of recovering them and
encounter a log item of unknown type, actually print the type code when
we're rejecting the whole transaction to aid in debugging.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
In order for users to determine if a file is currently operating in DAX
state (effective DAX). Define a statx attribute value and set that
attribute if the effective DAX flag is set.
To go along with this we propose the following addition to the statx man
page:
STATX_ATTR_DAX
The file is in the DAX (cpu direct access) state. DAX state
attempts to minimize software cache effects for both I/O and
memory mappings of this file. It requires a file system which
has been configured to support DAX.
DAX generally assumes all accesses are via cpu load / store
instructions which can minimize overhead for small accesses, but
may adversely affect cpu utilization for large transfers.
File I/O is done directly to/from user-space buffers and memory
mapped I/O may be performed with direct memory mappings that
bypass kernel page cache.
While the DAX property tends to result in data being transferred
synchronously, it does not give the same guarantees of O_SYNC
where data and the necessary metadata are transferred together.
A DAX file may support being mapped with the MAP_SYNC flag,
which enables a program to use CPU cache flush instructions to
persist CPU store operations without an explicit fsync(2). See
mmap(2) for more information.
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The patch which changed cachefiles from calling ->bmap() to using the
bmap() wrapper overwrote the running return value with the result of
calling bmap(). This causes an assertion failure elsewhere in the code.
Fix this by using ret2 rather than ret to hold the return value.
The oops looks like:
kernel BUG at fs/nfs/fscache.c:468!
...
RIP: 0010:__nfs_readpages_from_fscache+0x18b/0x190 [nfs]
...
Call Trace:
nfs_readpages+0xbf/0x1c0 [nfs]
? __alloc_pages_nodemask+0x16c/0x320
read_pages+0x67/0x1a0
__do_page_cache_readahead+0x1cf/0x1f0
ondemand_readahead+0x172/0x2b0
page_cache_async_readahead+0xaa/0xe0
generic_file_buffered_read+0x852/0xd50
? mem_cgroup_commit_charge+0x6e/0x140
? nfs4_have_delegation+0x19/0x30 [nfsv4]
generic_file_read_iter+0x100/0x140
? nfs_revalidate_mapping+0x176/0x2b0 [nfs]
nfs_file_read+0x6d/0xc0 [nfs]
new_sync_read+0x11a/0x1c0
__vfs_read+0x29/0x40
vfs_read+0x8e/0x140
ksys_read+0x61/0xd0
__x64_sys_read+0x1a/0x20
do_syscall_64+0x60/0x1e0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f5d148267e0
Fixes: 10d83e11a5 ("cachefiles: drop direct usage of ->bmap method.")
Reported-by: David Wysochanski <dwysocha@redhat.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Tested-by: David Wysochanski <dwysocha@redhat.com>
cc: Carlos Maiolino <cmaiolino@redhat.com>
The prepare_to_wait() and finish_wait() calls in io_uring_cancel_files()
are mismatched. Currently I don't see any issues related this bug, just
find it by learning codes.
Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl6u7jUACgkQxWXV+ddt
WDu6AQ/+K1vegSRJMhG1c0U3XECeYfki7NZVizzMs+G6oCU2LxBPla+qidugc0pA
5wAjP5AFaJQWv9JrVRyBfnvsH9HedL+9fNVmZlWZZ1ujXvZSyArdp5n9IyPCJ926
gA39nHSlcUOYSUfkiU8OqUOTyQjh9ZzSxbqIwsc4lKK9FrcLJ8fLXtbyKjLsxx7A
CTUYmyip6weQvMhQBWMFiN8LLle49s28BBbCfPenD+1sSF0UR6UyrFjDxBqusjkQ
mkoFwgnVLkES6ni1fJSUdDJMOaPkCCwn9EBiTwF29ki2Kbhu/erCHUZ+OLEDUOMg
JqIbAxWmx9+VNthVJWpVjNk9Eojr8LstpItG747DepE3S34bbtTSw9n0Ppp1lNrG
YFAA2ZIyhv5lZaq7f/hxfKQtz3MjsnKDoXZQbVnYh+FOiIssjDrK45UB9FP4Gy5I
nO/AejuOfaBqijz6PLLmHBA/SlsF50ejek32iiQQU+jVb9WGxCYUARXBVSh+7Iw5
PS6KkWQgXePCn3ulIc3eeQDJhP4gY1vCqIUsY5GbM/zHlBP75bDk0qP/kIu2j4yR
2Vrw3sG1tylBTWInjm7HiP9/9ZGy552AVSgqTeiv32VeBZ1hmQP04IbyzqYz4Clq
Qf7TJCDmTJSBr6TfvpsYtTyARhvh0pZ7X1b4Ymm5D/laSWXevf0=
=xn0p
-----END PGP SIGNATURE-----
Merge tag 'for-5.7-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull more btrfs fixes from David Sterba:
"A few more stability fixes, minor build warning fixes and git url
fixup:
- fix partial loss of prealloc extent past i_size after fsync
- fix potential deadlock due to wrong transaction handle passing via
journal_info
- fix gcc 4.8 struct intialization warning
- update git URL in MAINTAINERS entry"
* tag 'for-5.7-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
MAINTAINERS: btrfs: fix git repo URL
btrfs: fix gcc-4.8 build warning for struct initializer
btrfs: transaction: Avoid deadlock due to bad initialization timing of fs_info::journal_info
btrfs: fix partial loss of prealloc extent past i_size after fsync
- Move the FIBMAP range check and warning out of the backend iomap
implementation and into the frontend ioctl_fibmap so that the checking
is consistent for all implementations.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAl6q6lQACgkQ+H93GTRK
tOvt4g/+NlLRvPceod9x7goJGuBAJD3gmuP/Ma7qzFi5YZE7tbbBKikvKWIgtz8l
D4kPRepVTeOCECWzvYwbreqizk0WNr5Buc5Ia3QMPrigIUPomRygvNAcFmLIRF58
VFKIoUupM9oxPbzc5RXLx0QHYanUFZY41AzFTTQb9EGRw+WUzpih6FUxRrra0pFp
c5FN9pUaX7kAaUfryS5oK5f6T1ZmZWXQyaNOv+fXLdtd9eNMUxTOiBr+agZn0Ay3
XIdYWfI2ruyDiYYvaO52NAj9+MRwP9oW0aQLnFHwThv1M4I5qxtg0Ljhl4wT6vq5
VC2HHicETTuN0nTMQo183AU8AS9/SbSaFmgliVGrWiHp+IOyZzEYe3++damAUenH
k9o7un6i8nISVdoGs3U2yv6hJN1vmvWOK4JE26EOU/AfjHyYE8aqNRf4XR/f5bTr
nfD45eoN8V00iCIunL2UhluBeON1+KGUdMevn0ia948I9e5+DVMIsUm+vSf3c0ah
F8oQlGUucApi3KzVA72nmIwG/gP7oUrtjgBKSoRE+W3/ixcy1S5mc0oUYh4I62Ia
Sgv9pHUNwbWSVXfWIx83YmkaJpCurp5VuJy4FWsg6BNCB81lIosSKKjHpwwx3Xyi
19WWxvPFrZ2JxxWp6M5XWvYydQS590Mc5j2ywHluZsrwOVc2UBc=
=6rBo
-----END PGP SIGNATURE-----
Merge tag 'iomap-5.7-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull iomap fix from Darrick Wong:
"Hoist the check for an unrepresentable FIBMAP return value into
ioctl_fibmap.
The internal kernel function can handle 64-bit values (and is needed
to fix a regression on ext4 + jbd2). It is only the userspace ioctl
that is so old that it cannot deal"
* tag 'iomap-5.7-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
fibmap: Warn and return an error in case of block > INT_MAX
Highlights include:
Stable fixes
- fix handling of backchannel binding in BIND_CONN_TO_SESSION
Bugfixes
- Fix a credential use-after-free issue in pnfs_roc()
- Fix potential posix_acl refcnt leak in nfs3_set_acl
- defer slow parts of rpc_free_client() to a workqueue
- Fix an Oopsable race in __nfs_list_for_each_server()
- Fix trace point use-after-free race
- Regression: the RDMA client no longer responds to server disconnect requests
- Fix return values of xdr_stream_encode_item_{present, absent}
- _pnfs_return_layout() must always wait for layoutreturn completion
Cleanups
- Remove unreachable error conditions
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEESQctxSBg8JpV8KqEZwvnipYKAPIFAl6tczsACgkQZwvnipYK
APKHWg//QGx2Tolj5dh2jBHa47A5/SYnJxCZAA0/fWdwRtFkW3HyyGne1jU86do2
SMAVpBpri1WJPt5d3DH66gu4l4UxG1h84s7QP4lGfSa85EmtLh+LoZQCZRqYoDOo
JAMzWctELu1TUpaa1N5Dhg/qMtMy6ulRMWgzTLqB9a/pQa3onugTK6W7xiut2prj
PBfFq7N9XXmPboSeGV9bR4L8XKSbTCLEt3U1F2zAGU7UUINvDfpjEXq7BHYCewKL
ObPW6EWZksyna16H8i/xGWoKgE4JFVjMwQAP7UdDBi+FW9RI6UpTBoR6z9N748j0
jEocDbI21wgnwmtrVTbzsYm6ttHl4D4egoNxn7m5zjxTU4Ba/RQG2aaHUGFOYpJj
1FI1f6V1Y5v4mJajdsEH+pGW/4vK/4YMR+7YHJ/hYU/WiXjLf7onIIifdWt4SQdo
lvZbGcx6IAHYUA4lI7hkcvrK4bbqAnPLFq28nlUWEID5q5D+nA1ZR9iN0FToviDy
FYyhQzyfD1kt98SV1DjWUqvDDd6IB64iDZTXGmtWvj6c2nbezGiFffvtzUL5LFxY
QfI8lkpmUyt1EiWlZWhtOh4zsiM5yMZkJB/3RJv3RMmswizSSAHdgCKWhdLpX0bl
TG1L8yEmcTc5ANS37EhlpcBNbfYw7oIF/OXuReTSRoMQl5hxjfY=
=w0zk
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-5.7-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client bugfixes from Trond Myklebust:
"Highlights include:
Stable fixes:
- fix handling of backchannel binding in BIND_CONN_TO_SESSION
Bugfixes:
- Fix a credential use-after-free issue in pnfs_roc()
- Fix potential posix_acl refcnt leak in nfs3_set_acl
- defer slow parts of rpc_free_client() to a workqueue
- Fix an Oopsable race in __nfs_list_for_each_server()
- Fix trace point use-after-free race
- Regression: the RDMA client no longer responds to server disconnect
requests
- Fix return values of xdr_stream_encode_item_{present, absent}
- _pnfs_return_layout() must always wait for layoutreturn completion
Cleanups:
- Remove unreachable error conditions"
* tag 'nfs-for-5.7-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
NFS: Fix a race in __nfs_list_for_each_server()
NFSv4.1: fix handling of backchannel binding in BIND_CONN_TO_SESSION
SUNRPC: defer slow parts of rpc_free_client() to a workqueue.
NFSv4: Remove unreachable error condition due to rpc_run_task()
SUNRPC: Remove unreachable error condition
xprtrdma: Fix use of xdr_stream_encode_item_{present, absent}
xprtrdma: Fix trace point use-after-free race
xprtrdma: Restore wake-up-all to rpcrdma_cm_event_handler()
nfs: Fix potential posix_acl refcnt leak in nfs3_set_acl
NFS/pnfs: Fix a credential use-after-free issue in pnfs_roc()
NFS/pnfs: Ensure that _pnfs_return_layout() waits for layoutreturn completion
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl6spz8QHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpjHjEACp2V+14XpWl1F6rJpLSq0BJZ3wCReqj7it
tPImiZsx3fLiwslW8IFrDuT1tyCpODOECSA87vXebHjHvgmrbDayrAUJXlyYSk0N
+zwXTg7wH9XQ0CEQbzPIA/DK3evJ/CqRgTAa8r/ZIdm1sx8jIyq2QrwAo9IX7YyG
mQttrm37C4RrzU2dqcp0aBFhmiC6GRI34IYNK6idJ3wUFOCAg1Ur3veX9aG94gaV
cA1P12sSYnIAIAxUko/siPIvtJJ9s1tLJ6UREpqUMzgrfSEhZTPRvyv8xQLmTIl1
BcFj7Y3iorGde5PQUEPYoW7GXydU1LefJLH1C8GAbwRO1YyPD78Rff0sV8Bi0y9Z
hLnnvc7GEII/z0yxqnasEYYlWxhAcusO7HQDf1NMsxfuNXy5ofn1Kfuk5FFEcvj+
AjqvpN+sfJ9GPHrAGNT06kTMV0imCEmxuEanEc7cg1c2nfH4mJqt/vbH9tyD0aFk
JBuOeXToYywRqHHGSGcHGPkClcDoAw6dXqqKdJj6i0ya+EIsP2Ztp40Ae9yCDqew
AhrYQuEsJ7WJvxjogKn8fX8GSRnOJF1jb54pcNffw/e5q04e5YG/ACII+W/L1nPB
81BDcQjzB+f6xNxDZFGh0tQKvuVDe8b//vY+g2v6YoJYcAkLUSjy2FJDpoBjhzUu
03mYIP8kAg==
=cZOE
-----END PGP SIGNATURE-----
Merge tag 'io_uring-5.7-2020-05-01' of git://git.kernel.dk/linux-block
Pull io_uring fixes from Jens Axboe:
- Fix for statx not grabbing the file table, making AT_EMPTY_PATH fail
- Cover a few cases where async poll can handle retry, eliminating the
need for an async thread
- fallback request busy/free fix (Bijan)
- syzbot reported SQPOLL thread exit fix for non-preempt (Xiaoguang)
- Fix extra put of req for sync_file_range (Pavel)
- Always punt splice async. We'll improve this for 5.8, but wanted to
eliminate the inode mutex lock from the non-blocking path for 5.7
(Pavel)
* tag 'io_uring-5.7-2020-05-01' of git://git.kernel.dk/linux-block:
io_uring: punt splice async because of inode mutex
io_uring: check non-sync defer_list carefully
io_uring: fix extra put in sync_file_range()
io_uring: use cond_resched() in io_ring_ctx_wait_and_kill()
io_uring: use proper references for fallback_req locking
io_uring: only force async punt if poll based retry can't handle it
io_uring: enable poll retry for any file with ->read_iter / ->write_iter
io_uring: statx must grab the file table for valid fd
Nonblocking do_splice() still may wait for some time on an inode mutex.
Let's play safe and always punt it async.
Reported-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
io_req_defer() do double-checked locking. Use proper helpers for that,
i.e. list_empty_careful().
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
While working on to make io_uring sqpoll mode support syscalls that need
struct files_struct, I got cpu soft lockup in io_ring_ctx_wait_and_kill(),
while (ctx->sqo_thread && !wq_has_sleeper(&ctx->sqo_wait))
cpu_relax();
above loop never has an chance to exit, it's because preempt isn't enabled
in the kernel, and the context calling io_ring_ctx_wait_and_kill() and
io_sq_thread() run in the same cpu, if io_sq_thread calls a cond_resched()
yield cpu and another context enters above loop, then io_sq_thread() will
always in runqueue and never exit.
Use cond_resched() can fix this issue.
Reported-by: syzbot+66243bb7126c410cefe6@syzkaller.appspotmail.com
Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Use ctx->fallback_req address for test_and_set_bit_lock() and
clear_bit_unlock().
Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We do blocking retry from our poll handler, if the file supports polled
notifications. Only mark the request as needing an async worker if we
can't poll for it.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We can have files like eventfd where it's perfectly fine to do poll
based retry on them, right now io_file_supports_async() doesn't take
that into account.
Pass in data direction and check the f_op instead of just always needing
an async worker.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The struct nfs_server gets put on the cl_superblocks list before
the server->super field has been initialised, in which case the
call to nfs_sb_active() will Oops. Add a check to ensure that
we skip such a list entry.
Fixes: 3c9e502b59 ("NFS: Add a helper nfs_client_for_each_server()")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
We better warn the fibmap user and not return a truncated and therefore
an incorrect block map address if the bmap() returned block address
is greater than INT_MAX (since user supplied integer pointer).
It's better to pr_warn() all user of ioctl_fibmap() and return a proper
error code rather than silently letting a FS corruption happen if the
user tries to fiddle around with the returned block map address.
We fix this by returning an error code of -ERANGE and returning 0 as the
block mapping address in case if it is > INT_MAX.
Now iomap_bmap() could be called from either of these two paths.
Either when a user is calling an ioctl_fibmap() interface to get
the block mapping address or by some filesystem via use of bmap()
internal kernel API.
bmap() kernel API is well equipped with handling of u64 addresses.
WARN condition in iomap_bmap_actor() was mainly added to warn all
the fibmap users. But now that we have directly added this warning
for all fibmap users and also made sure to return 0 as block map address
in case if addr > INT_MAX.
So we can now remove this logic from iomap_bmap_actor().
Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Some older compilers like gcc-4.8 warn about mismatched curly braces in
a initializer:
fs/btrfs/backref.c: In function 'is_shared_data_backref':
fs/btrfs/backref.c:394:9: error: missing braces around
initializer [-Werror=missing-braces]
struct prelim_ref target = {0};
^
fs/btrfs/backref.c:394:9: error: (near initialization for
'target.rbnode') [-Werror=missing-braces]
Use the GNU empty initializer extension to avoid this.
Fixes: ed58f2e66e ("btrfs: backref, don't add refs from shared block when resolving normal backref")
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
As of now during open(), we don't pass bunch of flags to underlying
filesystem. O_TRUNC is one of these. Normally this is not a problem as VFS
calls ->setattr() with zero size and underlying filesystem sets file size
to 0.
But when overlayfs is running on top of virtiofs, it has an optimization
where it does not send setattr request to server if dectects that
truncation is part of open(O_TRUNC). It assumes that server already zeroed
file size as part of open(O_TRUNC).
fuse_do_setattr() {
if (attr->ia_valid & ATTR_OPEN) {
/*
* No need to send request to userspace, since actual
* truncation has already been done by OPEN. But still
* need to truncate page cache.
*/
}
}
IOW, fuse expects O_TRUNC to be passed to it as part of open flags.
But currently overlayfs does not pass O_TRUNC to underlying filesystem
hence fuse/virtiofs breaks. Setup overlayfs on top of virtiofs and
following does not zero the file size of a file is either upper only or has
already been copied up.
fd = open(foo.txt, O_TRUNC | O_WRONLY);
There are two ways to fix this. Either pass O_TRUNC to underlying
filesystem or clear ATTR_OPEN from attr->ia_valid so that fuse ends up
sending a SETATTR request to server. Miklos is concerned that O_TRUNC might
have side affects so it is better to clear ATTR_OPEN for now. Hence this
patch clears ATTR_OPEN from attr->ia_valid.
I found this problem while running unionmount-testsuite. With this patch,
unionmount-testsuite passes with overlayfs on top of virtiofs.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Fixes: bccece1ead ("ovl: allow remote upper")
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
ovl_setattr() can be passed an attr which has ATTR_FILE set and
attr->ia_file is a file pointer to overlay file. This is done in
open(O_TRUNC) path.
We should either replace with attr->ia_file with underlying file object or
clear ATTR_FILE so that underlying filesystem does not end up using
overlayfs file object pointer.
There are no good use cases yet so for now clear ATTR_FILE. fuse seems to
be one user which can use this. But it can work even without this. So it
is not mandatory to pass ATTR_FILE to fuse.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Fixes: bccece1ead ("ovl: allow remote upper")
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
With the introduction of exchange_tids thread_group_leader and
has_group_leader_pid have become equivalent. Further at this point in the
code a thread group has exactly two threads, the previous thread_group_leader
that is waiting to be reaped and tsk. So we know it is impossible for tsk to
be the thread_group_leader.
This is also the last user of has_group_leader_pid so removing this check
will allow has_group_leader_pid to be removed.
So remove the "BUG_ON(has_group_leader_pid)" that will never fire.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Pull vfs fixes from Al Viro:
"Two old bugs..."
* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
propagate_one(): mnt_set_mountpoint() needs mount_lock
dlmfs_file_write(): fix the bogosity in handling non-zero *ppos
Commit 6fcf0c72e4, a fix to get_tree_bdev() put a missing blkdev_put() in
the wrong place, before a warnf() that displays the bdev under
consideration rather after it.
This results in a silent lockup in printk("%pg") called via warnf() from
get_tree_bdev() under some circumstances when there's a race with the
blockdev being frozen. This can be caused by xfstests/tests/generic/085 in
combination with Lukas Czerner's ext4 mount API conversion patchset. It
looks like it ought to occur with other users of get_tree_bdev() such as
XFS, but apparently doesn't.
Fix this by switching the order of the lines.
Fixes: 6fcf0c72e4 ("vfs: add missing blkdev_put() in get_tree_bdev()")
Reported-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Ian Kent <raven@themaw.net>
cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When the thread group leader changes during exec and the old leaders
thread is reaped proc_flush_pid will flush the dentries for the entire
process because the leader still has it's original pid.
Fix this by exchanging the pids in an rcu safe manner,
and wrapping the code to do that up in a helper exchange_tids.
When I removed switch_exec_pids and introduced this behavior
in d73d65293e ("[PATCH] pidhash: kill switch_exec_pids") there
really was nothing that cared as flushing happened with
the cached dentry and de_thread flushed both of them on exec.
This lack of fully exchanging pids became a problem a few months later
when I introduced 48e6484d49 ("[PATCH] proc: Rewrite the proc dentry
flush on exit optimization"). Which overlooked the de_thread case
was no longer swapping pids, and I was looking up proc dentries
by task->pid.
The current behavior isn't properly a bug as everything in proc will
continue to work correctly just a little bit less efficiently. Fix
this just so there are no little surprise corner cases waiting to bite
people.
-- Oleg points out this could be an issue in next_tgid in proc where
has_group_leader_pid is called, and reording some of the assignments
should fix that.
-- Oleg points out this will break the 10 year old hack in __exit_signal.c
> /*
> * This can only happen if the caller is de_thread().
> * FIXME: this is the temporary hack, we should teach
> * posix-cpu-timers to handle this case correctly.
> */
> if (unlikely(has_group_leader_pid(tsk)))
> posix_cpu_timers_exit_group(tsk);
The code in next_tgid has been changed to use PIDTYPE_TGID,
and the posix cpu timers code has been fixed so it does not
need the 10 year old hack, so this should be safe to merge
now.
Link: https://lore.kernel.org/lkml/87h7x3ajll.fsf_-_@x220.int.ebiederm.org/
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Fixes: 48e6484d49 ("[PATCH] proc: Rewrite the proc dentry flush on exit optimization").
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Currently, if the client sends BIND_CONN_TO_SESSION with
NFS4_CDFC4_FORE_OR_BOTH but only gets NFS4_CDFS4_FORE back it ignores
that it wasn't able to enable a backchannel.
To make sure, the client sends BIND_CONN_TO_SESSION as the first
operation on the connections (ie., no other session compounds haven't
been sent before), and if the client's request to bind the backchannel
is not satisfied, then reset the connection and retry.
Cc: stable@vger.kernel.org
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Pull in Christoph Hellwig's series that changes the sysctl's ->proc_handler
methods to take kernel pointers instead. It gets rid of the set_fs address
space overrides used by BPF. As per discussion, pull in the feature branch
into bpf-next as it relates to BPF sysctl progs.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20200427071508.GV23230@ZenIV.linux.org.uk/T/
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl6m1fcACgkQxWXV+ddt
WDuJAw//WLUlHd/NNlmV92pR0yAqqpBlnYSf/zHKqxFmetZWANiFx7l9+ag03g7o
nVYLdmsj/Y38IuEbwmWP2/0K/gfErdUxs5Qq/eE/Ui10hk+53sUFAiKBMoVdWmta
zt5WlXUc4YBqGMqU15iz7YQfjPZDuinvWgvCEBNAZ66O3cdhcdQRRZtYGGYUJbvA
tUrIejCsTj/U9UfVwgoSC9aAsSnUPL2ef7enxT6iUA/+1bTTBd6dUX+GCzAOnvzJ
ejDWr55wgmrUhhEkDs+0yvEiO+sBXcQM1QJCHfFLp6lddKmPI4G63LLgAT8y5FS7
DA1d2PNVT17yJxVA5E4ahaSGabiL8WjleZFPVURVPuoT867HRDbH2YR6B8QdGNYt
iXu9yPU1CDTnSGiMBj3Q+X6M7w6ABoWr6JEXGX7kfGlpXTFZ2JinzvC615+Ina1u
Vufcwg8kQpF/teIDZYV1U4gXT6y1UFneJYthXCl+Y0DXIeV4pAeAPuLVjL3asSQa
ARgO6LSgeVdYc6kRyxW3wVMBPq0Peygc5iYQo3wEv2zD5vRFlRp/2uF488VaTN4e
OUNBrSJK8luZDUSVH5k9z5MVXH9Dz4HyFqQ9uuV4W7CzcjQlipI1R4Em/j6Ub8g1
l09gu10XQU07LVgrZdSNesAIv4R3/zola+9F320IFimLoPL73KI=
=Tbyq
-----END PGP SIGNATURE-----
Merge tag 'for-5.7-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- regression fixes:
- transaction leak when deleting unused block group
- log cleanup after transaction abort
- fix block group leak when removing fails
- transaction leak if relocation recovery fails
- fix SPDX header
* tag 'for-5.7-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: fix transaction leak in btrfs_recover_relocation
btrfs: fix block group leak when removing fails
btrfs: drop logs when we've aborted a transaction
btrfs: fix memory leak of transaction when deleting unused block group
btrfs: discard: Use the correct style for SPDX License Identifier
Clay reports that OP_STATX fails for a test case with a valid fd
and empty path:
-- Test 0: statx:fd 3: SUCCEED, file mode 100755
-- Test 1: statx:path ./uring_statx: SUCCEED, file mode 100755
-- Test 2: io_uring_statx:fd 3: FAIL, errno 9: Bad file descriptor
-- Test 3: io_uring_statx:path ./uring_statx: SUCCEED, file mode 100755
This is due to statx not grabbing the process file table, hence we can't
lookup the fd in async context. If the fd is valid, ensure that we grab
the file table so we can grab the file from async context.
Cc: stable@vger.kernel.org # v5.6
Reported-by: Clay Harris <bugs@claycon.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
[BUG]
One run of btrfs/063 triggered the following lockdep warning:
============================================
WARNING: possible recursive locking detected
5.6.0-rc7-custom+ #48 Not tainted
--------------------------------------------
kworker/u24:0/7 is trying to acquire lock:
ffff88817d3a46e0 (sb_internal#2){.+.+}, at: start_transaction+0x66c/0x890 [btrfs]
but task is already holding lock:
ffff88817d3a46e0 (sb_internal#2){.+.+}, at: start_transaction+0x66c/0x890 [btrfs]
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0
----
lock(sb_internal#2);
lock(sb_internal#2);
*** DEADLOCK ***
May be due to missing lock nesting notation
4 locks held by kworker/u24:0/7:
#0: ffff88817b495948 ((wq_completion)btrfs-endio-write){+.+.}, at: process_one_work+0x557/0xb80
#1: ffff888189ea7db8 ((work_completion)(&work->normal_work)){+.+.}, at: process_one_work+0x557/0xb80
#2: ffff88817d3a46e0 (sb_internal#2){.+.+}, at: start_transaction+0x66c/0x890 [btrfs]
#3: ffff888174ca4da8 (&fs_info->reloc_mutex){+.+.}, at: btrfs_record_root_in_trans+0x83/0xd0 [btrfs]
stack backtrace:
CPU: 0 PID: 7 Comm: kworker/u24:0 Not tainted 5.6.0-rc7-custom+ #48
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
Workqueue: btrfs-endio-write btrfs_work_helper [btrfs]
Call Trace:
dump_stack+0xc2/0x11a
__lock_acquire.cold+0xce/0x214
lock_acquire+0xe6/0x210
__sb_start_write+0x14e/0x290
start_transaction+0x66c/0x890 [btrfs]
btrfs_join_transaction+0x1d/0x20 [btrfs]
find_free_extent+0x1504/0x1a50 [btrfs]
btrfs_reserve_extent+0xd5/0x1f0 [btrfs]
btrfs_alloc_tree_block+0x1ac/0x570 [btrfs]
btrfs_copy_root+0x213/0x580 [btrfs]
create_reloc_root+0x3bd/0x470 [btrfs]
btrfs_init_reloc_root+0x2d2/0x310 [btrfs]
record_root_in_trans+0x191/0x1d0 [btrfs]
btrfs_record_root_in_trans+0x90/0xd0 [btrfs]
start_transaction+0x16e/0x890 [btrfs]
btrfs_join_transaction+0x1d/0x20 [btrfs]
btrfs_finish_ordered_io+0x55d/0xcd0 [btrfs]
finish_ordered_fn+0x15/0x20 [btrfs]
btrfs_work_helper+0x116/0x9a0 [btrfs]
process_one_work+0x632/0xb80
worker_thread+0x80/0x690
kthread+0x1a3/0x1f0
ret_from_fork+0x27/0x50
It's pretty hard to reproduce, only one hit so far.
[CAUSE]
This is because we're calling btrfs_join_transaction() without re-using
the current running one:
btrfs_finish_ordered_io()
|- btrfs_join_transaction() <<< Call #1
|- btrfs_record_root_in_trans()
|- btrfs_reserve_extent()
|- btrfs_join_transaction() <<< Call #2
Normally such btrfs_join_transaction() call should re-use the existing
one, without trying to re-start a transaction.
But the problem is, in btrfs_join_transaction() call #1, we call
btrfs_record_root_in_trans() before initializing current::journal_info.
And in btrfs_join_transaction() call #2, we're relying on
current::journal_info to avoid such deadlock.
[FIX]
Call btrfs_record_root_in_trans() after we have initialized
current::journal_info.
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When we have an inode with a prealloc extent that starts at an offset
lower than the i_size and there is another prealloc extent that starts at
an offset beyond i_size, we can end up losing part of the first prealloc
extent (the part that starts at i_size) and have an implicit hole if we
fsync the file and then have a power failure.
Consider the following example with comments explaining how and why it
happens.
$ mkfs.btrfs -f /dev/sdb
$ mount /dev/sdb /mnt
# Create our test file with 2 consecutive prealloc extents, each with a
# size of 128Kb, and covering the range from 0 to 256Kb, with a file
# size of 0.
$ xfs_io -f -c "falloc -k 0 128K" /mnt/foo
$ xfs_io -c "falloc -k 128K 128K" /mnt/foo
# Fsync the file to record both extents in the log tree.
$ xfs_io -c "fsync" /mnt/foo
# Now do a redudant extent allocation for the range from 0 to 64Kb.
# This will merely increase the file size from 0 to 64Kb. Instead we
# could also do a truncate to set the file size to 64Kb.
$ xfs_io -c "falloc 0 64K" /mnt/foo
# Fsync the file, so we update the inode item in the log tree with the
# new file size (64Kb). This also ends up setting the number of bytes
# for the first prealloc extent to 64Kb. This is done by the truncation
# at btrfs_log_prealloc_extents().
# This means that if a power failure happens after this, a write into
# the file range 64Kb to 128Kb will not use the prealloc extent and
# will result in allocation of a new extent.
$ xfs_io -c "fsync" /mnt/foo
# Now set the file size to 256K with a truncate and then fsync the file.
# Since no changes happened to the extents, the fsync only updates the
# i_size in the inode item at the log tree. This results in an implicit
# hole for the file range from 64Kb to 128Kb, something which fsck will
# complain when not using the NO_HOLES feature if we replay the log
# after a power failure.
$ xfs_io -c "truncate 256K" -c "fsync" /mnt/foo
So instead of always truncating the log to the inode's current i_size at
btrfs_log_prealloc_extents(), check first if there's a prealloc extent
that starts at an offset lower than the i_size and with a length that
crosses the i_size - if there is one, just make sure we truncate to a
size that corresponds to the end offset of that prealloc extent, so
that we don't lose the part of that extent that starts at i_size if a
power failure happens.
A test case for fstests follows soon.
Fixes: 31d11b83b9 ("Btrfs: fix duplicate extents after fsync of file with prealloc extents")
CC: stable@vger.kernel.org # 4.14+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
... to protect the modification of mp->m_count done by it. Most of
the places that modify that thing also have namespace_lock held,
but not all of them can do so, so we really need mount_lock here.
Kudos to Piotr Krysiuk <piotras@gmail.com>, who'd spotted a related
bug in pivot_root(2) (fixed unnoticed in 5.3); search for other
similar turds has caught out this one.
Cc: stable@kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
configfs_rmdir() invokes configfs_get_config_item(), which returns a
reference of the specified config_item object to "parent_item" with
increased refcnt.
When configfs_rmdir() returns, local variable "parent_item" becomes
invalid, so the refcount should be decreased to keep refcount balanced.
The reference counting issue happens in one exception handling path of
configfs_rmdir(). When down_write_killable() fails, the function forgets
to decrease the refcnt increased by configfs_get_config_item(), causing
a refcnt leak.
Fix this issue by calling config_item_put() when down_write_killable()
fails.
Signed-off-by: Xiyu Yang <xiyuyang19@fudan.edu.cn>
Signed-off-by: Xin Tan <tanxin.ctf@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Instead of having all the sysctl handlers deal with user pointers, which
is rather hairy in terms of the BPF interaction, copy the input to and
from userspace in common code. This also means that the strings are
always NUL-terminated by the common code, making the API a little bit
safer.
As most handler just pass through the data to one of the common handlers
a lot of the changes are mechnical.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
There is a specific API to treat raw data as UUID, i.e. import_uuid().
Use it instead of uuid_copy() with explicit casting.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAl6lIA4ACgkQiiy9cAdy
T1FI6QwAg4mCQPvqebKd0/OaJAPne/dzS+iDpxGhCHWjyRYfXwttSHj6HTDjbb20
OMrvOpKR4plV8LQOXyzbI7rJvDcL1UFbcBxUQUEp9I7BuVbKhE/7CWcBPc2bMiKF
1yJhUHUjsSMP35H4f3w8J+eKzXcJnXljsruI61FVn4kagRzsUrTOfyhtdfcobPHA
0o0eZPPhAmoN2Vaf8jpVDEECHotbIKRr6hwN4/lPiOjVvqmHbi42RFmn06rlKqWA
FBJqYKHK9VyL6458nTego5BXoJ4DSVf28Ow367sYFekpqA2eENfKRIHZ/feBzTH+
GOn44GJqMcpMXkGgMuR7qMk8wi+nYTBrGXgpXjD3Yw/mHLiPbmscrudwZ30HQ5Rr
1tgEgFd064gCzA/sm8MmAzSo5Du9oGyabuDewoatKHztNLZA9jMCO/kvuYoCtnLW
vwlPcnedl4fUir3sdzU9JwHxhcoiAREktqQCXWVew9FGedvdfxVDuPMejayrND9k
KK6zbll3
=x+F1
-----END PGP SIGNATURE-----
Merge tag '5.7-rc2-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs fixes from Steve French:
"Five cifs/smb3 fixes:two for DFS reconnect failover, one lease fix for
stable and the others to fix a missing spinlock during reconnect"
* tag '5.7-rc2-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6:
cifs: fix uninitialised lease_key in open_shroot()
cifs: ensure correct super block for DFS reconnect
cifs: do not share tcons with DFS
cifs: minor update to comments around the cifs_tcp_ses_lock mutex
cifs: protect updating server->dstaddr with a spinlock
Here are some small firmware/driver core/debugfs fixes for 5.7-rc3.
The debugfs change is now possible as now the last users of
debugfs_create_u32() have been fixed up in the different trees that got
merged into 5.7-rc1, and I don't want it creeping back in.
The firmware changes did cause a regression in linux-next, so the final
patch here reverts part of that, re-exporting the symbol to resolve that
issue. All of these patches, with the exception of the final one, have
been in linux-next with only that one reported issue.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCXqVliw8cZ3JlZ0Brcm9h
aC5jb20ACgkQMUfUDdst+ymf6ACfS5HoPt+kWKtfKteN/mt6WUeJz6oAoMDg4Qvf
4ncqmH9jt0lj5NAwHxFi
=DP2q
-----END PGP SIGNATURE-----
Merge tag 'driver-core-5.7-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core fixes from Greg KH:
"Here are some small firmware/driver core/debugfs fixes for 5.7-rc3.
The debugfs change is now possible as now the last users of
debugfs_create_u32() have been fixed up in the different trees that
got merged into 5.7-rc1, and I don't want it creeping back in.
The firmware changes did cause a regression in linux-next, so the
final patch here reverts part of that, re-exporting the symbol to
resolve that issue. All of these patches, with the exception of the
final one, have been in linux-next with only that one reported issue"
* tag 'driver-core-5.7-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
firmware_loader: revert removal of the fw_fallback_config export
debugfs: remove return value of debugfs_create_u32()
firmware_loader: remove unused exports
firmware: imx: fix compile-testing
Pull pid leak fix from Eric Biederman:
"Oleg noticed that put_pid(thread_pid) was not getting called when proc
was not compiled in.
Let's get that fixed before 5.7 is released and causes problems for
anyone"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
proc: Put thread_pid in release_task not proc_flush_pid
nfs4_proc_layoutget() invokes rpc_run_task(), which return the value to
"task". Since rpc_run_task() is impossible to return an ERR pointer,
there is no need to add the IS_ERR() condition on "task" here. So we
need to remove it.
Signed-off-by: Xiyu Yang <xiyuyang19@fudan.edu.cn>
Signed-off-by: Xin Tan <tanxin.ctf@gmail.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Combine the pid_task and thes test has_group_leader_pid into a single
dereference by using pid_task(PIDTYPE_TGID).
This makes the code simpler and proof against needing to even think
about any shenanigans that de_thread might get up to.
Acked-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Alexey Gladkov <gladkov.alexey@gmail.com> writes:
Procfs modernization:
---------------------
Historically procfs was always tied to pid namespaces, during pid
namespace creation we internally create a procfs mount for it. However,
this has the effect that all new procfs mounts are just a mirror of the
internal one, any change, any mount option update, any new future
introduction will propagate to all other procfs mounts that are in the
same pid namespace.
This may have solved several use cases in that time. However today we
face new requirements, and making procfs able to support new private
instances inside same pid namespace seems a major point. If we want to
to introduce new features and security mechanisms we have to make sure
first that we do not break existing usecases. Supporting private procfs
instances will allow to support new features and behaviour without
propagating it to all other procfs mounts.
Today procfs is more of a burden especially to some Embedded, IoT,
sandbox, container use cases. In user space we are over-mounting null
or inaccessible files on top to hide files and information. If we want
to hide pids we have to create PID namespaces otherwise mount options
propagate to all other proc mounts, changing a mount option value in one
mount will propagate to all other proc mounts. If we want to introduce
new features, then they will propagate to all other mounts too, resulting
either maybe new useful functionality or maybe breaking stuff. We have
also to note that userspace should not workaround procfs, the kernel
should just provide a sane simple interface.
In this regard several developers and maintainers pointed out that
there are problems with procfs and it has to be modernized:
"Here's another one: split up and modernize /proc." by Andy Lutomirski [1]
Discussion about kernel pointer leaks:
"And yes, as Kees and Daniel mentioned, it's definitely not just dmesg.
In fact, the primary things tend to be /proc and /sys, not dmesg
itself." By Linus Torvalds [2]
Lot of other areas in the kernel and filesystems have been updated to be
able to support private instances, devpts is one major example [3].
Which will be used for:
1) Embedded systems and IoT: usually we have one supervisor for
apps, we have some lightweight sandbox support, however if we create
pid namespaces we have to manage all the processes inside too,
where our goal is to be able to run a bunch of apps each one inside
its own mount namespace, maybe use network namespaces for vlans
setups, but right now we only want mount namespaces, without all the
other complexity. We want procfs to behave more like a real file system,
and block access to inodes that belong to other users. The 'hidepid=' will
not work since it is a shared mount option.
2) Containers, sandboxes and Private instances of file systems - devpts case
Historically, lot of file systems inside Linux kernel view when instantiated
were just a mirror of an already created and mounted filesystem. This was the
case of devpts filesystem, it seems at that time the requirements were to
optimize things and reuse the same memory, etc. This design used to work but not
anymore with today's containers, IoT, hostile environments and all the privacy
challenges that Linux faces.
In that regards, devpts was updated so that each new mounts is a total
independent file system by the following patches:
"devpts: Make each mount of devpts an independent filesystem" by
Eric W. Biederman [3] [4]
3) Linux Security Modules have multiple ptrace paths inside some
subsystems, however inside procfs, the implementation does not guarantee
that the ptrace() check which triggers the security_ptrace_check() hook
will always run. We have the 'hidepid' mount option that can be used to
force the ptrace_may_access() check inside has_pid_permissions() to run.
The problem is that 'hidepid' is per pid namespace and not attached to
the mount point, any remount or modification of 'hidepid' will propagate
to all other procfs mounts.
This also does not allow to support Yama LSM easily in desktop and user
sessions. Yama ptrace scope which restricts ptrace and some other
syscalls to be allowed only on inferiors, can be updated to have a
per-task context, where the context will be inherited during fork(),
clone() and preserved across execve(). If we support multiple private
procfs instances, then we may force the ptrace_may_access() on
/proc/<pids>/ to always run inside that new procfs instances. This will
allow to specifiy on user sessions if we should populate procfs with
pids that the user can ptrace or not.
By using Yama ptrace scope, some restricted users will only be able to see
inferiors inside /proc, they won't even be able to see their other
processes. Some software like Chromium, Firefox's crash handler, Wine
and others are already using Yama to restrict which processes can be
ptracable. With this change this will give the possibility to restrict
/proc/<pids>/ but more importantly this will give desktop users a
generic and usuable way to specifiy which users should see all processes
and which user can not.
Side notes:
* This covers the lack of seccomp where it is not able to parse
arguments, it is easy to install a seccomp filter on direct syscalls
that operate on pids, however /proc/<pid>/ is a Linux ABI using
filesystem syscalls. With this change all LSMs should be able to analyze
open/read/write/close... on /proc/<pid>/
4) This will allow to implement new features either in kernel or
userspace without having to worry about procfs.
In containers, sandboxes, etc we have workarounds to hide some /proc
inodes, this should be supported natively without doing extra complex
work, the kernel should be able to support sane options that work with
today and future Linux use cases.
5) Creation of new superblock with all procfs options for each procfs
mount will fix the ignoring of mount options. The problem is that the
second mount of procfs in the same pid namespace ignores the mount
options. The mount options are ignored without error until procfs is
remounted.
Before:
proc /proc proc rw,relatime,hidepid=2 0 0
mount("proc", "/tmp/proc", "proc", 0, "hidepid=1") = 0
+++ exited with 0 +++
proc /proc proc rw,relatime,hidepid=2 0 0
proc /tmp/proc proc rw,relatime,hidepid=2 0 0
proc /proc proc rw,relatime,hidepid=1 0 0
proc /tmp/proc proc rw,relatime,hidepid=1 0 0
After:
proc /proc proc rw,relatime,hidepid=ptraceable 0 0
proc /proc proc rw,relatime,hidepid=ptraceable 0 0
proc /tmp/proc proc rw,relatime,hidepid=invisible 0 0
Introduced changes:
-------------------
Each mount of procfs creates a separate procfs instance with its own
mount options.
This series adds few new mount options:
* New 'hidepid=ptraceable' or 'hidepid=4' mount option to show only ptraceable
processes in the procfs. This allows to support lightweight sandboxes in
Embedded Linux, also solves the case for LSM where now with this mount option,
we make sure that they have a ptrace path in procfs.
* 'subset=pid' that allows to hide non-pid inodes from procfs. It can be used
in containers and sandboxes, as these are already trying to hide and block
access to procfs inodes anyway.
ChangeLog:
----------
* Rebase on top of v5.7-rc1.
* Fix a resource leak if proc is not mounted or if proc is simply reconfigured.
* Add few selftests.
* After a discussion with Eric W. Biederman, the numerical values for hidepid
parameter have been removed from uapi.
* Remove proc_self and proc_thread_self from the pid_namespace struct.
* I took into account the comment of Kees Cook.
* Update Reviewed-by tags.
* 'subset=pidfs' renamed to 'subset=pid' as suggested by Alexey Dobriyan.
* Include Reviewed-by tags.
* Rebase on top of Eric W. Biederman's procfs changes.
* Add human readable values of 'hidepid' as suggested by Andy Lutomirski.
* Started using RCU lock to clean dcache entries as suggested by Linus Torvalds.
* 'pidonly=1' renamed to 'subset=pidfs' as suggested by Alexey Dobriyan.
* HIDEPID_* moved to uapi/ as they are user interface to mount().
Suggested-by Alexey Dobriyan <adobriyan@gmail.com>
* 'hidepid=' and 'gid=' mount options are moved from pid namespace to superblock.
* 'newinstance' mount option removed as suggested by Eric W. Biederman.
Mount of procfs always creates a new instance.
* 'limit_pids' renamed to 'hidepid=3'.
* I took into account the comment of Linus Torvalds [7].
* Documentation added.
* Fixed a bug that caused a problem with the Fedora boot.
* The 'pidonly' option is visible among the mount options.
* Renamed mount options to 'newinstance' and 'pids='
Suggested-by: Andy Lutomirski <luto@kernel.org>
* Fixed order of commit, Suggested-by: Andy Lutomirski <luto@kernel.org>
* Many bug fixes.
* Removed 'unshared' mount option and replaced it with 'limit_pids'
which is attached to the current procfs mount.
Suggested-by Andy Lutomirski <luto@kernel.org>
* Do not fill dcache with pid entries that we can not ptrace.
* Many bug fixes.
References:
-----------
[1] https://lists.linuxfoundation.org/pipermail/ksummit-discuss/2017-January/004215.html
[2] http://www.openwall.com/lists/kernel-hardening/2017/10/05/5
[3] https://lwn.net/Articles/689539/
[4] http://lxr.free-electrons.com/source/Documentation/filesystems/devpts.txt?v=3.14
[5] https://lkml.org/lkml/2017/5/2/407
[6] https://lkml.org/lkml/2017/5/3/357
[7] https://lkml.org/lkml/2018/5/11/505
Alexey Gladkov (7):
proc: rename struct proc_fs_info to proc_fs_opts
proc: allow to mount many instances of proc in one pid namespace
proc: instantiate only pids that we can ptrace on 'hidepid=4' mount
option
proc: add option to mount only a pids subset
docs: proc: add documentation for "hidepid=4" and "subset=pid" options
and new mount behavior
proc: use human-readable values for hidepid
proc: use named enums for better readability
Documentation/filesystems/proc.rst | 92 +++++++++---
fs/proc/base.c | 48 +++++--
fs/proc/generic.c | 9 ++
fs/proc/inode.c | 30 +++-
fs/proc/root.c | 131 +++++++++++++-----
fs/proc/self.c | 6 +-
fs/proc/thread_self.c | 6 +-
fs/proc_namespace.c | 14 +-
include/linux/pid_namespace.h | 12 --
include/linux/proc_fs.h | 30 +++-
tools/testing/selftests/proc/.gitignore | 2 +
tools/testing/selftests/proc/Makefile | 2 +
.../selftests/proc/proc-fsconfig-hidepid.c | 50 +++++++
.../selftests/proc/proc-multiple-procfs.c | 48 +++++++
14 files changed, 384 insertions(+), 96 deletions(-)
create mode 100644 tools/testing/selftests/proc/proc-fsconfig-hidepid.c
create mode 100644 tools/testing/selftests/proc/proc-multiple-procfs.c
Link: https://lore.kernel.org/lkml/20200419141057.621356-1-gladkov.alexey@gmail.com/
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Oleg pointed out that in the unlikely event the kernel is compiled
with CONFIG_PROC_FS unset that release_task will now leak the pid.
Move the put_pid out of proc_flush_pid into release_task to fix this
and to guarantee I don't make that mistake again.
When possible it makes sense to keep get and put in the same function
so it can easily been seen how they pair up.
Fixes: 7bc3e6e55a ("proc: Use a list of inodes to flush from proc")
Reported-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl6jKZkQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpkqsEACnY1xBZfO3tw0x+XqIQW1qqtls8/buMKen
Iqo2XOJZNMgjMO6T5naPblh1f3JxUVihR8NE3PSm8ZERIl6Xq9YesXATFsC1C+sH
giR0O4ae7lkYRrlNNvo+K9BmS90AwzTYb73imDFmt+/BuySY67rysN4Gv0q+ySWZ
1zDdyK8R7v/WX33h0nrP9g2zG4yrYtpWXyeR26aK/BtdVv/rJqu9EiD6Kaz3oHgh
JI2XLmuDB4d9evUfL9rW0lGd+R0uQUBVj2r9J8x9Ff176OjVhr1cPcbU2Dc/Ldnd
0Qe1mJ3LcSEvjHrJ84J4C0wRyFiArqbFw8Fy560VDtpgS/44V8j0W5Edh6zNGehY
xS0NxZfTPaqM5sGKafnaqBfOnrhlZOCcqrDAGe7djsGARGrbzsERpzv4TuBOE+gJ
hxf9MDYZdIW5QVWmKpTIqAJZfCg3h+Lv/EHhp0Dqv2lIPkWmEHDF3mggej/vcfJ1
1YEvfIM1TdeEfQPcauqggR8Yo0vUXIfobaJw99R+BwEmowNYbvE4/jH183PgjzSn
R9xojcDOxo2x1ITCp2YkF+GQ6k2ZXL5v4mEf9zY9C2QiCkhOdzOtecfvQ/wL4/r3
JZlPpNd+Tw2bXRtIu6ZNq1q1/l93byv4ps6NPvEeGna1klzzCiAZnO71Ln5bWUdu
2YJoHRfI2g==
=L+dP
-----END PGP SIGNATURE-----
Merge tag 'io_uring-5.7-2020-04-24' of git://git.kernel.dk/linux-block
Pull io_uring fix from Jens Axboe:
"Single fixup for a change that went into -rc2"
* tag 'io_uring-5.7-2020-04-24' of git://git.kernel.dk/linux-block:
io_uring: only restore req->work for req that needs do completion
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEqG5UsNXhtOCrfGQP+7dXa6fLC2sFAl6jDR0ACgkQ+7dXa6fL
C2ttIg//Zz6bEpu7BAdvrXmUCfcYbI4gbVRPEFcAz4/z8c05UJXdkps2oVj1sKmb
hLRBIxArRo7tcdziIdwwk8fckaW1i60wXfsiaAEyxPBuW+oB6fEUqoEmshUjw36u
lzseygJnyKNKNX8B6MSYz3NQv5kaVefD6UoQ84+3m7Me/AJx9s+LZEUTrvlz5Myy
BbE19Jnx5SlgqkVyuis6FQ0u+cXUdVleIm3LFzzbaP9syLlsleAJjXU3EPM3/mzK
BcV77DhMGJhKZ0DhFuUkKE1EUslR4vJiV7gDMdyJKuSTlIU+1IGYWiI6XPyk/BLH
trpSDHe8DuCCGPmQCQPM4XxfQJVlnKej+sFoUeqCShndkK9ayTuYot5eARbqGj4x
SEVQ6PWgnLcWtSuxQDIWJBBWZPJZ8/v3yDld0ij95wbGqAywnsiVBt85XPK4Ccje
ew3urAK52wlQxwy2U+Rn39hzLi6vCx0Z3ncJ/ak5TarcL8txQhCOcKukTB7Wa4Ie
MKW+IANoYvLgFmbnXLlsBBxpcewNwxQhklMkSx5G+3EnWxXIOqRPumOPxV2UfYrA
Mgv3F1PZo9Q3SU6eb8lGIYyeho0+6qV/OZzmcy6Xl8nNHeJXZ9eGsSYSlKYUQ7WI
rum/g7UPBxni7wkyJxrn90yxirFG81Dm4216ThKGSQ6Mu5pDmBA=
=tTG+
-----END PGP SIGNATURE-----
Merge tag 'afs-fixes-20200424' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
Pull misc AFS fixes from David Howells:
"Three miscellaneous fixes to the afs filesystem:
- Remove some struct members that aren't used, aren't set or aren't
read, plus a wake up that nothing ever waits for.
- Actually set the AFS_SERVER_FL_HAVE_EPOCH flag so that the code
that depends on it can work.
- Make a couple of waits uninterruptible if they're done for an
operation that isn't supposed to be interruptible"
* tag 'afs-fixes-20200424' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
afs: Make record checking use TASK_UNINTERRUPTIBLE when appropriate
afs: Fix to actually set AFS_SERVER_FL_HAVE_EPOCH
afs: Remove some unused bits
When an operation is meant to be done uninterruptibly (such as
FS.StoreData), we should not be allowing volume and server record checking
to be interrupted.
Fixes: d2ddc776a4 ("afs: Overhaul volume and server record caching and fileserver rotation")
Signed-off-by: David Howells <dhowells@redhat.com>
AFS keeps track of the epoch value from the rxrpc protocol to note (a) when
a fileserver appears to have restarted and (b) when different endpoints of
a fileserver do not appear to be associated with the same fileserver
(ie. all probes back from a fileserver from all of its interfaces should
carry the same epoch).
However, the AFS_SERVER_FL_HAVE_EPOCH flag that indicates that we've
received the server's epoch is never set, though it is used.
Fix this to set the flag when we first receive an epoch value from a probe
sent to the filesystem client from the fileserver.
Fixes: 3bf0fb6f33 ("afs: Probe multiple fileservers simultaneously")
Signed-off-by: David Howells <dhowells@redhat.com>
Remove three bits:
(1) afs_server::no_epoch is neither set nor used.
(2) afs_server::have_result is set and a wakeup is applied to it, but
nothing looks at it or waits on it.
(3) afs_vl_dump_edestaddrreq() prints afs_addr_list::probed, but nothing
sets it for VL servers.
Signed-off-by: David Howells <dhowells@redhat.com>
f2fs_quota_sync() uses f2fs_lock_op() before flushing dirty pages, but
f2fs_write_data_page() returns EAGAIN.
Likewise dentry blocks, we can just bypass getting the lock, since quota
blocks are also maintained by checkpoint.
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
'count' is how much you want written, not the final position.
Moreover, it can legitimately be less than the current position...
Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
- Address several use-after-free and memory leak bugs
- Prevent a backchannel livelock
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQIcBAABAgAGBQJemdwYAAoJEDNqszNvZn+XU1oQAKOm9vypO6w252kXdhFSxAlB
3tMxXALNDrFP3PXsKCa/sKKMRvkUkx+9pdnTuXDPvffd3ZgyB8DzJilryEtiqT4Y
JsuoWHg2QyNeKUFGmtZ5AsefPaR8WL/aiYPTi1PUqnq4rNPjAgOGgLUv+LME2jFU
Yx773d5CNHXDq6zv1Au0128URnQZDy/7URdfgX1FhLA8aQWjiG08fhBEGncXjV/X
mo3RMCwE2uzNRruW7OJyCehb8d+IKBDZ0LEeZDW/ve4hNtL+Ke5eCEoemYtUN07e
U3gRMB8Pt+55L+ZFP8KJYOtfRx2SkOTMcbASC2z/WECq5vumGmn4WovSSVJFGIUN
5WVf8ADM2w3RmTFh11Jl5mZnziGRNY/4hAW7PrR4ZDhJxjdKA+iLLd7571kkCE63
II6qxw/WV7Yz3T6v4BoOcDf1DOylnS1JXqmPGYia2aAhyFZgRVasOVIkB0meaaFe
zSKzKsTrir1Ru8/xt5zIgyEQwqATp2rwzkoPuTeQZLOht0fsSIGBpD1ZWXUaMAji
cfojhd4731cvoxMMGG27IMiHTG6rpKneaZ21Z/7/61P+cjHm/ITOLZzzRvhQMQU7
wuskRf3KTs+3k4x6P9E0qQU1DcJkPSYGq+JDdh389Plald4MLTAZYjIK+J3X35oL
QNnUeKzr1YhWWqgchthG
=Zoup
-----END PGP SIGNATURE-----
Merge tag 'nfsd-5.7-rc-1' of git://git.linux-nfs.org/projects/cel/cel-2.6
Pull nfsd fixes from Chuck Lever:
"The first set of 5.7-rc fixes for NFS server issues.
These were all unresolved at the time the 5.7 window opened, and
needed some additional time to ensure they were correctly addressed.
They are ready now.
At the moment I know of one more urgent issue regarding the NFS
server. A fix has been tested and is under review. I expect to send
one more pull request, containing this fix (which now consists of 3
patches).
Fixes:
- Address several use-after-free and memory leak bugs
- Prevent a backchannel livelock"
* tag 'nfsd-5.7-rc-1' of git://git.linux-nfs.org/projects/cel/cel-2.6:
svcrdma: Fix leak of svc_rdma_recv_ctxt objects
svcrdma: Fix trace point use-after-free race
SUNRPC: Fix backchannel RPC soft lockups
SUNRPC/cache: Fix unsafe traverse caused double-free in cache_purge
nfsd: memory corruption in nfsd4_lock()
btrfs_recover_relocation() invokes btrfs_join_transaction(), which joins
a btrfs_trans_handle object into transactions and returns a reference of
it with increased refcount to "trans".
When btrfs_recover_relocation() returns, "trans" becomes invalid, so the
refcount should be decreased to keep refcount balanced.
The reference counting issue happens in one exception handling path of
btrfs_recover_relocation(). When read_fs_root() failed, the refcnt
increased by btrfs_join_transaction() is not decreased, causing a refcnt
leak.
Fix this issue by calling btrfs_end_transaction() on this error path
when read_fs_root() failed.
Fixes: 79787eaab4 ("btrfs: replace many BUG_ONs with proper error handling")
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Xiyu Yang <xiyuyang19@fudan.edu.cn>
Signed-off-by: Xin Tan <tanxin.ctf@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_remove_block_group() invokes btrfs_lookup_block_group(), which
returns a local reference of the block group that contains the given
bytenr to "block_group" with increased refcount.
When btrfs_remove_block_group() returns, "block_group" becomes invalid,
so the refcount should be decreased to keep refcount balanced.
The reference counting issue happens in several exception handling paths
of btrfs_remove_block_group(). When those error scenarios occur such as
btrfs_alloc_path() returns NULL, the function forgets to decrease its
refcnt increased by btrfs_lookup_block_group() and will cause a refcnt
leak.
Fix this issue by jumping to "out_put_group" label and calling
btrfs_put_block_group() when those error scenarios occur.
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Xiyu Yang <xiyuyang19@fudan.edu.cn>
Signed-off-by: Xin Tan <tanxin.ctf@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Dave reported a problem where we were panicing with generic/475 with
misc-5.7. This is because we were doing IO after we had stopped all of
the worker threads, because we do the log tree cleanup on roots at drop
time. Cleaning up the log tree will always need to do reads if we
happened to have evicted the blocks from memory.
Because of this simply add a helper to btrfs_cleanup_transaction() that
will go through and drop all of the log roots. This gets run before we
do the close_ctree() work, and thus we are allowed to do any reads that
we would need. I ran this through many iterations of generic/475 with
constrained memory and I did not see the issue.
general protection fault, probably for non-canonical address 0x6b6b6b6b6b6b6b6b: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
CPU: 2 PID: 12359 Comm: umount Tainted: G W 5.6.0-rc7-btrfs-next-58 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
RIP: 0010:btrfs_queue_work+0x33/0x1c0 [btrfs]
RSP: 0018:ffff9cfb015937d8 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff8eb5e339ed80 RCX: 0000000000000000
RDX: 0000000000000001 RSI: ffff8eb5eb33b770 RDI: ffff8eb5e37a0460
RBP: ffff8eb5eb33b770 R08: 000000000000020c R09: ffffffff9fc09ac0
R10: 0000000000000007 R11: 0000000000000000 R12: 6b6b6b6b6b6b6b6b
R13: ffff9cfb00229040 R14: 0000000000000008 R15: ffff8eb5d3868000
FS: 00007f167ea022c0(0000) GS:ffff8eb5fae00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f167e5e0cb1 CR3: 0000000138c18004 CR4: 00000000003606e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
btrfs_end_bio+0x81/0x130 [btrfs]
__split_and_process_bio+0xaf/0x4e0 [dm_mod]
? percpu_counter_add_batch+0xa3/0x120
dm_process_bio+0x98/0x290 [dm_mod]
? generic_make_request+0xfb/0x410
dm_make_request+0x4d/0x120 [dm_mod]
? generic_make_request+0xfb/0x410
generic_make_request+0x12a/0x410
? submit_bio+0x38/0x160
submit_bio+0x38/0x160
? percpu_counter_add_batch+0xa3/0x120
btrfs_map_bio+0x289/0x570 [btrfs]
? kmem_cache_alloc+0x24d/0x300
btree_submit_bio_hook+0x79/0xc0 [btrfs]
submit_one_bio+0x31/0x50 [btrfs]
read_extent_buffer_pages+0x2fe/0x450 [btrfs]
btree_read_extent_buffer_pages+0x7e/0x170 [btrfs]
walk_down_log_tree+0x343/0x690 [btrfs]
? walk_log_tree+0x3d/0x380 [btrfs]
walk_log_tree+0xf7/0x380 [btrfs]
? plist_requeue+0xf0/0xf0
? delete_node+0x4b/0x230
free_log_tree+0x4c/0x130 [btrfs]
? wait_log_commit+0x140/0x140 [btrfs]
btrfs_free_log+0x17/0x30 [btrfs]
btrfs_drop_and_free_fs_root+0xb0/0xd0 [btrfs]
btrfs_free_fs_roots+0x10c/0x190 [btrfs]
? do_raw_spin_unlock+0x49/0xc0
? _raw_spin_unlock+0x29/0x40
? release_extent_buffer+0x121/0x170 [btrfs]
close_ctree+0x289/0x2e6 [btrfs]
generic_shutdown_super+0x6c/0x110
kill_anon_super+0xe/0x30
btrfs_kill_super+0x12/0x20 [btrfs]
deactivate_locked_super+0x3a/0x70
Reported-by: David Sterba <dsterba@suse.com>
Fixes: 8c38938c7b ("btrfs: move the root freeing stuff into btrfs_put_root")
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When cleaning pinned extents right before deleting an unused block group,
we check if there's still a previous transaction running and if so we
increment its reference count before using it for cleaning pinned ranges
in its pinned extents iotree. However we ended up never decrementing the
reference count after using the transaction, resulting in a memory leak.
Fix it by decrementing the reference count.
Fixes: fe119a6eeb ("btrfs: switch to per-transaction pinned extents")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This patch corrects the SPDX License Identifier style in
header file related to debugfs File System support.
For C header files Documentation/process/license-rules.rst
mandates C-like comments (opposed to C source files where
C++ style should be used).
Changes made by using a script provided by Joe Perches here:
https://lkml.org/lkml/2019/2/7/46.
Suggested-by: Joe Perches <joe@perches.com>
Signed-off-by: Nishad Kamdar <nishadkamdar@gmail.com>
Link: https://lore.kernel.org/r/20200419144852.GA9206@nishad
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The kernfs_node lockdep tracking is being done on kn->active, the
active reference count. The other reference count (kn->count) is not
tracked by lockdep. So change the lockdep name to reflect what it is
tracking.
Signed-off-by: Waiman Long <longman@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20200402171056.27871-1-longman@redhat.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
SMB2_open_init() expects a pre-initialised lease_key when opening a
file with a lease, so set pfid->lease_key prior to calling it in
open_shroot().
This issue was observed when performing some DFS failover tests and
the lease key was never randomly generated.
Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
CC: Stable <stable@vger.kernel.org>
This patch is basically fixing the lookup of tcons (DFS specific) during
reconnect (smb2pdu.c:__smb2_reconnect) to update their prefix paths.
Previously, we relied on the TCP_Server_Info pointer
(misc.c:tcp_super_cb) to determine which tcon to update the prefix path
We could not rely on TCP server pointer to determine which super block
to update the prefix path when reconnecting tcons since it might map
to different tcons that share same TCP connection.
Instead, walk through all cifs super blocks and compare their DFS full
paths with the tcon being updated to.
Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
This disables tcon re-use for DFS shares.
tcon->dfs_path stores the path that the tcon should connect to when
doing failing over.
If that tcon is used multiple times e.g. 2 mounts using it with
different prefixpath, each will need a different dfs_path but there is
only one tcon. The other solution would be to split the tcon in 2
tcons during failover but that is much harder.
tcons could not be shared with DFS in cifs.ko because in a
DFS namespace like:
//domain/dfsroot -> /serverA/dfsroot, /serverB/dfsroot
//serverA/dfsroot/link -> /serverA/target1/aa/bb
//serverA/dfsroot/link2 -> /serverA/target1/cc/dd
you can see that link and link2 are two DFS links that both resolve to
the same target share (/serverA/target1), so cifs.ko will only contain a
single tcon for both link and link2.
The problem with that is, if we (auto)mount "link" and "link2", cifs.ko
will only contain a single tcon for both DFS links so we couldn't
perform failover or refresh the DFS cache for both links because
tcon->dfs_path was set to either "link" or "link2", but not both --
which is wrong.
Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
The hidepid parameter values are becoming more and more and it becomes
difficult to remember what each new magic number means.
Backward compatibility is preserved since it is possible to specify
numerical value for the hidepid parameter. This does not break the
fsconfig since it is not possible to specify a numerical value through
it. All numeric values are converted to a string. The type
FSCONFIG_SET_BINARY cannot be used to indicate a numerical value.
Selftest has been added to verify this behavior.
Suggested-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Alexey Gladkov <gladkov.alexey@gmail.com>
Reviewed-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
This allows to hide all files and directories in the procfs that are not
related to tasks.
Signed-off-by: Alexey Gladkov <gladkov.alexey@gmail.com>
Reviewed-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
If "hidepid=4" mount option is set then do not instantiate pids that
we can not ptrace. "hidepid=4" means that procfs should only contain
pids that the caller can ptrace.
Signed-off-by: Djalal Harouni <tixxdz@gmail.com>
Signed-off-by: Alexey Gladkov <gladkov.alexey@gmail.com>
Reviewed-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
This patch allows to have multiple procfs instances inside the
same pid namespace. The aim here is lightweight sandboxes, and to allow
that we have to modernize procfs internals.
1) The main aim of this work is to have on embedded systems one
supervisor for apps. Right now we have some lightweight sandbox support,
however if we create pid namespacess we have to manages all the
processes inside too, where our goal is to be able to run a bunch of
apps each one inside its own mount namespace without being able to
notice each other. We only want to use mount namespaces, and we want
procfs to behave more like a real mount point.
2) Linux Security Modules have multiple ptrace paths inside some
subsystems, however inside procfs, the implementation does not guarantee
that the ptrace() check which triggers the security_ptrace_check() hook
will always run. We have the 'hidepid' mount option that can be used to
force the ptrace_may_access() check inside has_pid_permissions() to run.
The problem is that 'hidepid' is per pid namespace and not attached to
the mount point, any remount or modification of 'hidepid' will propagate
to all other procfs mounts.
This also does not allow to support Yama LSM easily in desktop and user
sessions. Yama ptrace scope which restricts ptrace and some other
syscalls to be allowed only on inferiors, can be updated to have a
per-task context, where the context will be inherited during fork(),
clone() and preserved across execve(). If we support multiple private
procfs instances, then we may force the ptrace_may_access() on
/proc/<pids>/ to always run inside that new procfs instances. This will
allow to specifiy on user sessions if we should populate procfs with
pids that the user can ptrace or not.
By using Yama ptrace scope, some restricted users will only be able to see
inferiors inside /proc, they won't even be able to see their other
processes. Some software like Chromium, Firefox's crash handler, Wine
and others are already using Yama to restrict which processes can be
ptracable. With this change this will give the possibility to restrict
/proc/<pids>/ but more importantly this will give desktop users a
generic and usuable way to specifiy which users should see all processes
and which users can not.
Side notes:
* This covers the lack of seccomp where it is not able to parse
arguments, it is easy to install a seccomp filter on direct syscalls
that operate on pids, however /proc/<pid>/ is a Linux ABI using
filesystem syscalls. With this change LSMs should be able to analyze
open/read/write/close...
In the new patch set version I removed the 'newinstance' option
as suggested by Eric W. Biederman.
Selftest has been added to verify new behavior.
Signed-off-by: Alexey Gladkov <gladkov.alexey@gmail.com>
Reviewed-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
The timestamp for access_time has double seconds granularity(There is no
10msIncrement field for access_time unlike create/modify_time).
exfat's atimes are restricted to only 2s granularity so after
we set an atime, round it down to the nearest 2s and set the
sub-second component of the timestamp to 0.
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
The s_time_gran superblock field indicates the on-disk nanosecond
granularity of timestamps, and for exfat that seems to be 10ms, so
set s_time_gran to 10000000ns. Without this, in-memory timestamps
change when they get re-read from disk.
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Unify access to boot sector via 'sbi->pbr_bh'.
This fixes vol_flags inconsistency at read failed in fs_set_vol_flags(),
and buffer_head leak in __exfat_fill_super().
Signed-off-by: Tetsuhiro Kohada <Kohada.Tetsuhiro@dc.MitsubishiElectric.co.jp>
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
This adds the necessary MODULE_ALIAS_FS() to exfat so the module gets
automatically loaded when an exfat filesystem is mounted.
Signed-off-by: Thomas Backlund <tmb@mageia.org>
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Discard support was always unconditionally disabled. Now it is disabled
only in the case when blk_queue_discard() returns false.
Signed-off-by: Pali Rohár <pali@kernel.org>
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
If the core_pattern is set to "|" and any process segfaults then we get
a null pointer derefernce while trying to coredump. The call stack shows:
RIP: do_coredump+0x628/0x11c0
When the core_pattern has only "|" there is no use of trying the
coredump and we can check that while formating the corename and exit
with an error.
After this change I get:
format_corename failed
Aborting core
Fixes: 315c69261d ("coredump: split pipe command whitespace before expanding template")
Reported-by: Matthew Ruffell <matthew.ruffell@canonical.com>
Signed-off-by: Sudip Mukherjee <sudipm.mukherjee@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Paul Wise <pabs3@bonedaddy.net>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Neil Horman <nhorman@tuxdriver.com>
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/20200416194612.21418-1-sudipm.mukherjee@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
remap_vmalloc_range() has had various issues with the bounds checks it
promises to perform ("This function checks that addr is a valid
vmalloc'ed area, and that it is big enough to cover the vma") over time,
e.g.:
- not detecting pgoff<<PAGE_SHIFT overflow
- not detecting (pgoff<<PAGE_SHIFT)+usize overflow
- not checking whether addr and addr+(pgoff<<PAGE_SHIFT) are the same
vmalloc allocation
- comparing a potentially wildly out-of-bounds pointer with the end of
the vmalloc region
In particular, since commit fc9702273e ("bpf: Add mmap() support for
BPF_MAP_TYPE_ARRAY"), unprivileged users can cause kernel null pointer
dereferences by calling mmap() on a BPF map with a size that is bigger
than the distance from the start of the BPF map to the end of the
address space.
This could theoretically be used as a kernel ASLR bypass, by using
whether mmap() with a given offset oopses or returns an error code to
perform a binary search over the possible address range.
To allow remap_vmalloc_range_partial() to verify that addr and
addr+(pgoff<<PAGE_SHIFT) are in the same vmalloc region, pass the offset
to remap_vmalloc_range_partial() instead of adding it to the pointer in
remap_vmalloc_range().
In remap_vmalloc_range_partial(), fix the check against
get_vm_area_size() by using size comparisons instead of pointer
comparisons, and add checks for pgoff.
Fixes: 833423143c ("[PATCH] mm: introduce remap_vmalloc_range()")
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: stable@vger.kernel.org
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Martin KaFai Lau <kafai@fb.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: Yonghong Song <yhs@fb.com>
Cc: Andrii Nakryiko <andriin@fb.com>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: KP Singh <kpsingh@chromium.org>
Link: http://lkml.kernel.org/r/20200415222312.236431-1-jannh@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Dax related code already removed from this file.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jianpeng Ma <jianpeng.ma@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We use a spinlock while we are reading and accessing the destination address for a server.
We need to also use this spinlock to protect when we are modifying this address from
reconn_set_ipaddr().
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Use *foo makes the toolchain to think that this is an emphasis, causing
those warnings:
./fs/inode.c:1609: WARNING: Inline emphasis start-string without end-string.
./fs/inode.c:1609: WARNING: Inline emphasis start-string without end-string.
./fs/inode.c:1615: WARNING: Inline emphasis start-string without end-string.
So, use, instead, ``*foo``, in order to mark it as a literal block.
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Link: https://lore.kernel.org/r/e8da46a0e57f2af6d63a0c53665495075698e28a.1586881715.git.mchehab+huawei@kernel.org
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Several references got broken due to txt to ReST conversion.
Several of them can be automatically fixed with:
scripts/documentation-file-ref-check --fix
Reviewed-by: Mathieu Poirier <mathieu.poirier@linaro.org> # hwtracing/coresight/Kconfig
Reviewed-by: Paul E. McKenney <paulmck@kernel.org> # memory-barrier.txt
Acked-by: Alex Shi <alex.shi@linux.alibaba.com> # translations/zh_CN
Acked-by: Federico Vaga <federico.vaga@vaga.pv.it> # translations/it_IT
Acked-by: Marc Zyngier <maz@kernel.org> # kvm/arm64
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Link: https://lore.kernel.org/r/6f919ddb83a33b5f2a63b6b5f0575737bb2b36aa.1586881715.git.mchehab+huawei@kernel.org
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
With arm64 64-bit environments, there should never be a need for automatic
READ_IMPLIES_EXEC, as the architecture has always been execute-bit aware
(as in, the default memory protection should be NX unless a region
explicitly requests to be executable).
Suggested-by: Hector Marco-Gisbert <hecmargi@upv.es>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Link: https://lkml.kernel.org/r/20200327064820.12602-7-keescook@chromium.org
invalidate_partition and bdev_unhash_inode are always paired, and
invalidate_partition already does an icache lookup for the block device
inode. Piggy back on that to remove the inode from the hash.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The gendisk can be trivially deducted from the block_device.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This patch corrects the SPDX License Identifier style in header file
related to Btrfs File System support. For C header files
Documentation/process/license-rules.rst mandates C-like comments
(opposed to C source files where C++ style should be used).
Changes made by using a script provided by Joe Perches here:
https://lkml.org/lkml/2019/2/7/46.
Suggested-by: Joe Perches <joe@perches.com>
Signed-off-by: Nishad Kamdar <nishadkamdar@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
While trying to "dd" to the block device for a USB stick, I
encountered a hung task warning (blocked for > 120 seconds). I
managed to come up with an easy way to reproduce this on my system
(where /dev/sdb is the block device for my USB stick) with:
while true; do dd if=/dev/zero of=/dev/sdb bs=4M; done
With my reproduction here are the relevant bits from the hung task
detector:
INFO: task udevd:294 blocked for more than 122 seconds.
...
udevd D 0 294 1 0x00400008
Call trace:
...
mutex_lock_nested+0x40/0x50
__blkdev_get+0x7c/0x3d4
blkdev_get+0x118/0x138
blkdev_open+0x94/0xa8
do_dentry_open+0x268/0x3a0
vfs_open+0x34/0x40
path_openat+0x39c/0xdf4
do_filp_open+0x90/0x10c
do_sys_open+0x150/0x3c8
...
...
Showing all locks held in the system:
...
1 lock held by dd/2798:
#0: ffffff814ac1a3b8 (&bdev->bd_mutex){+.+.}, at: __blkdev_put+0x50/0x204
...
dd D 0 2798 2764 0x00400208
Call trace:
...
schedule+0x8c/0xbc
io_schedule+0x1c/0x40
wait_on_page_bit_common+0x238/0x338
__lock_page+0x5c/0x68
write_cache_pages+0x194/0x500
generic_writepages+0x64/0xa4
blkdev_writepages+0x24/0x30
do_writepages+0x48/0xa8
__filemap_fdatawrite_range+0xac/0xd8
filemap_write_and_wait+0x30/0x84
__blkdev_put+0x88/0x204
blkdev_put+0xc4/0xe4
blkdev_close+0x28/0x38
__fput+0xe0/0x238
____fput+0x1c/0x28
task_work_run+0xb0/0xe4
do_notify_resume+0xfc0/0x14bc
work_pending+0x8/0x14
The problem appears related to the fact that my USB disk is terribly
slow and that I have a lot of RAM in my system to cache things.
Specifically my writes seem to be happening at ~15 MB/s and I've got
~4 GB of RAM in my system that can be used for buffering. To write 4
GB of buffer to disk thus takes ~4000 MB / ~15 MB/s = ~267 seconds.
The 267 second number is a problem because in __blkdev_put() we call
sync_blockdev() while holding the bd_mutex. Any other callers who
want the bd_mutex will be blocked for the whole time.
The problem is made worse because I believe blkdev_put() specifically
tells other tasks (namely udev) to go try to access the device at right
around the same time we're going to hold the mutex for a long time.
Putting some traces around this (after disabling the hung task detector),
I could confirm:
dd: 437.608600: __blkdev_put() right before sync_blockdev() for sdb
udevd: 437.623901: blkdev_open() right before blkdev_get() for sdb
dd: 661.468451: __blkdev_put() right after sync_blockdev() for sdb
udevd: 663.820426: blkdev_open() right after blkdev_get() for sdb
A simple fix for this is to realize that sync_blockdev() works fine if
you're not holding the mutex. Also, it's not the end of the world if
you sync a little early (though it can have performance impacts).
Thus we can make a guess that we're going to need to do the sync and
then do it without holding the mutex. We still do one last sync with
the mutex but it should be much, much faster.
With this, my hung task warnings for my test case are gone.
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Reviewed-by: Guenter Roeck <groeck@chromium.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
commit 9635453572 ("fuse: reduce allocation size for splice_write")
changed size of bufs array, so BUG_ON which checks the index of the array
shold also be fixed.
[SzM: turn BUG_ON into WARN_ON]
Fixes: 9635453572 ("fuse: reduce allocation size for splice_write")
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
In virtiofs (unlike in regular fuse) processing of async replies is
serialized. This can result in a deadlock in rare corner cases when
there's a circular dependency between the completion of two or more async
replies.
Such a deadlock can be reproduced with xfstests:generic/503 if TEST_DIR ==
SCRATCH_MNT (which is a misconfiguration):
- Process A is waiting for page lock in worker thread context and blocked
(virtio_fs_requests_done_work()).
- Process B is holding page lock and waiting for pending writes to
finish (fuse_wait_on_page_writeback()).
- Write requests are waiting in virtqueue and can't complete because
worker thread is blocked on page lock (process A).
Fix this by creating a unique work_struct for each async reply that can
block (O_DIRECT read).
Fixes: a62a8ef9d9 ("virtio-fs: add virtiofs filesystem")
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
nfs3_set_acl keeps track of the acl it allocated locally to determine if an acl
needs to be released at the end. This results in a memory leak when the
function allocates an acl as well as a default acl. Fix by releasing acls
that differ from the acl originally passed into nfs3_set_acl.
Fixes: b7fa0554cf ("[PATCH] NFS: Add support for NFSv3 ACLs")
Reported-by: Xiyu Yang <xiyuyang19@fudan.edu.cn>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
If the credential returned by pnfs_prepare_layoutreturn()
does not match the credential of the RPC call, then we do
end up calling pnfs_send_layoutreturn() with that credential,
so don't free it!
Fixes: 44ea8dfce0 ("NFS/pnfs: Reference the layout cred in pnfs_prepare_layoutreturn()")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
We require that any outstanding layout return completes before we can
free up the inode so that the layout itself can be freed.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
When testing io_uring IORING_FEAT_FAST_POLL feature, I got below panic:
BUG: kernel NULL pointer dereference, address: 0000000000000030
PGD 0 P4D 0
Oops: 0000 [#1] SMP PTI
CPU: 5 PID: 2154 Comm: io_uring_echo_s Not tainted 5.6.0+ #359
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS rel-1.11.1-0-g0551a4be2c-prebuilt.qemu-project.org 04/01/2014
RIP: 0010:io_wq_submit_work+0xf/0xa0
Code: ff ff ff be 02 00 00 00 e8 ae c9 19 00 e9 58 ff ff ff 66 0f 1f
84 00 00 00 00 00 0f 1f 44 00 00 41 54 49 89 fc 55 53 48 8b 2f <8b>
45 30 48 8d 9d 48 ff ff ff 25 01 01 00 00 83 f8 01 75 07 eb 2a
RSP: 0018:ffffbef543e93d58 EFLAGS: 00010286
RAX: ffffffff84364f50 RBX: ffffa3eb50f046b8 RCX: 0000000000000000
RDX: ffffa3eb0efc1840 RSI: 0000000000000006 RDI: ffffa3eb50f046b8
RBP: 0000000000000000 R08: 00000000fffd070d R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffffa3eb50f046b8
R13: ffffa3eb0efc2088 R14: ffffffff85b69be0 R15: ffffa3eb0effa4b8
FS: 00007fe9f69cc4c0(0000) GS:ffffa3eb5ef40000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000030 CR3: 0000000020410000 CR4: 00000000000006e0
Call Trace:
task_work_run+0x6d/0xa0
do_exit+0x39a/0xb80
? get_signal+0xfe/0xbc0
do_group_exit+0x47/0xb0
get_signal+0x14b/0xbc0
? __x64_sys_io_uring_enter+0x1b7/0x450
do_signal+0x2c/0x260
? __x64_sys_io_uring_enter+0x228/0x450
exit_to_usermode_loop+0x87/0xf0
do_syscall_64+0x209/0x230
entry_SYSCALL_64_after_hwframe+0x49/0xb3
RIP: 0033:0x7fe9f64f8df9
Code: Bad RIP value.
task_work_run calls io_wq_submit_work unexpectedly, it's obvious that
struct callback_head's func member has been changed. After looking into
codes, I found this issue is still due to the union definition:
union {
/*
* Only commands that never go async can use the below fields,
* obviously. Right now only IORING_OP_POLL_ADD uses them, and
* async armed poll handlers for regular commands. The latter
* restore the work, if needed.
*/
struct {
struct callback_head task_work;
struct hlist_node hash_node;
struct async_poll *apoll;
};
struct io_wq_work work;
};
When task_work_run has multiple work to execute, the work that calls
io_poll_remove_all() will do req->work restore for non-poll request
always, but indeed if a non-poll request has been added to a new
callback_head, subsequent callback will call io_async_task_func() to
handle this request, that means we should not do the restore work
for such non-poll request. Meanwhile in io_async_task_func(), we should
drop submit ref when req has been canceled.
Fix both issues.
Fixes: b1f573bd15 ("io_uring: restore req->work when canceling poll request")
Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Use io_double_put_req()
Signed-off-by: Jens Axboe <axboe@kernel.dk>
instead of clockid numbers. The usability nuisance of numbers was noticed
by Michael when polishing the man page.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl6cVQsTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoWBjEAC0dCUHKDLoG0FeyG4tb4FEBW2iTqM8
UFirH26K18s8QSePdvfJlaxtN2SdfNZG7UgYN7wz1fDFQy05zTz7Rek8UrDuu3rh
mVph/UZtUJl+6ypW2Lw9x5RWpT5yzay2iowUyBPnNxU9F/0uRKvXQFju3L83Lo/z
Z4ni7gVEw87dQi5E74tEv6iaydgPuCBpGxoMahotnHyclqMjA0QuAK6nhN5ZTcAn
senoorS/VqkSF5qEvIUwe7+F+kkMbwQryT7merJyNwh/F49xTTXRyBmiys1MF8Og
MTEvldXKy2pCh2UfRa/x84WWwOUVNivTXdIXjhalsblczL0j1z9MsQ8b3AOXOiLf
S+/Ntbb2dGo4qE22jekMwZ54Pm4x5NzChCU8+3pvd6IrPWZKi6vue74Kd0RNHQg/
0kWOlZnIP2ArVW0bFqV6jhMYkjmVdK6gm7cUpFV66L2H8zbfFuc4OlxJYEFYivye
9Yck+rFQmMwA15ZXYIpggkd7Rf/5CGF1CiMBAvP/ILubpgbJqnn6/tGByq8tDKdy
mqXX+NHF0M/7rJd5vr7wP6p3E5nQ9l/41rh9ii9EDLXf4jsWVO3EyobJ7fFHwprs
5tTWGxVJymUQLq/LQPXOVVENGK+ZsXXNGn/4n8IOVroeypxADTGyhtSh122kFFhv
jPcVHqpBUd0g4Q==
=slEk
-----END PGP SIGNATURE-----
Merge tag 'timers-urgent-2020-04-19' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull time namespace fix from Thomas Gleixner:
"An update for the proc interface of time namespaces: Use symbolic
names instead of clockid numbers. The usability nuisance of numbers
was noticed by Michael when polishing the man page"
* tag 'timers-urgent-2020-04-19' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
proc, time/namespace: Show clock symbolic names in /proc/pid/timens_offsets
generic/388 in data=journal mode, removing some BUG_ON's, and cleaning
up some compiler warnings.
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAl6cj80ACgkQ8vlZVpUN
gaOx5Qf/XY7JUEp1nGgcdZyUd8uho3dKkG4TuUU5PvGsiDb4ozGsyU51q2LnOHWF
uzDJaE03z5uc1i8C9mQRLzjzaOC8B8kQZuKfkcQ/xI4CS3cG4qRdeNdHUz5QyfhK
5THDzr2z1tuWDuhlp+jCPjCz1fJowHxva/7ktf1OrMVEErYlZXT8CPLIRBCeuuCX
/07/8tJ5jJoqpI3kmy1jFotMEhIBE0vixf+sfcp2RWjdb0/1LH2JPWCytX+hhSFR
SadWDvTIvVy/rMahLHgc/VyPP47QwLWzBmLm9CdyxmDeUaM4Qwx8Zfog4+8g78wl
IvSuHRDdTYnOO35Qbzjl2wanhzCiQQ==
=qzEh
-----END PGP SIGNATURE-----
Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 fixes from Ted Ts'o:
"Miscellaneous bug fixes and cleanups for ext4, including a fix for
generic/388 in data=journal mode, removing some BUG_ON's, and cleaning
up some compiler warnings"
* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
ext4: convert BUG_ON's to WARN_ON's in mballoc.c
ext4: increase wait time needed before reuse of deleted inode numbers
ext4: remove set but not used variable 'es' in ext4_jbd2.c
ext4: remove set but not used variable 'es'
ext4: do not zeroout extents beyond i_disksize
ext4: fix return-value types in several function comments
ext4: use non-movable memory for superblock readahead
ext4: use matching invalidatepage in ext4_writepage
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAl6b28kACgkQiiy9cAdy
T1EZ+wwAqHCqrIgelrLFiQwHkMg1KQMBnul3mBuCJ6qxGTyzSVLWBYsfHabLqWmC
Ann71PFygGc+5R195CcMZ/RAHGTTEbwJP5s/wGwm3wUfqImLPOpMr/jd8rv9GvE2
atsthBnFlPE+dY5BD9fr7JIWpZxE3yevCtVifyPjA879zzqIoT9lkFcjCNTqV37l
tRe4JyObxKSrPUUELC30XPFoBGT/Cgcoz+I0JFL+gz8Yt9CEBXL2DKdnZJERbIpm
t+yjKAYC9QN5eF7kew8Fide4LohH7jL2EAmllWKUTRH1pHNEKgyMbSMm3F2RzoXG
0R/70stukgXemlsCD2+BSXDZ3smPHwoKq+FftYanHd1pamOQHJMWcQ/tCk8gg9/Z
Qq0wwBBbVP6HOMwoDOOW53/lwiU/hoR2Re3jy7K0DOGJAFNkxo98oXfT7HJfmKeW
q1LQvKR7ch3iFaOUkg/Tv+8o3inUuYLUgegCPvM6RkGkG0Mqs8SEkA9AyyqFmBnG
kY1K83Ct
=G+Rl
-----END PGP SIGNATURE-----
Merge tag '5.7-rc-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs fixes from Steve French:
"Three small smb3 fixes: two debug related (helping network tracing for
SMB2 mounts, and the other removing an unintended debug line on
signing failures), and one fixing a performance problem with 64K
pages"
* tag '5.7-rc-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6:
smb3: remove overly noisy debug line in signing errors
cifs: improve read performance for page size 64KB & cache=strict & vers=2.1+
cifs: dump the session id and keys also for SMB2 sessions
- Fix a partially uninitialized variable.
- Teach the background gc threads to apply for fsfreeze protection.
- Fix some scaling problems when multiple threads try to flush the
filesystem when we're about to hit ENOSPC.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAl6ZSVQACgkQ+H93GTRK
tOuPfg//XQ9HX0VAd4xYM3uAr50gNIUPMfOjrlUdZfnj+DOxJDb7IbN9t6+NXYU7
dfVdUeSPy7vwC/JUyVVgBbTfCX1CnQoeNWtg6EAdEF0msJIlbCH4sm+pI2Vofnqp
1VDT9fU1cmrtz/dtS6teJT49P/uCPCmKRGAcnIJn/E7FZUiDS0je2iwV8jbJtAyo
xfTHO39t5jBxBRBLRSuJUzYYvvW1ix3zheebLUQZMolnKRkKafWPja1I2N2lRt23
VnXwEjgFpqkT2OcDk5jljkJLbImHmNNVTc6J7SomtxZfWZDwvVfIHgMUC1OsyvW3
tJCp/22xAqqkBQS6Gx6qoXQubnqsfka86krq8C/juz5q5Doc7TPClpc4eyY/XZ0+
q3/67K9Z5MbudUQRmDBrNqmBBiI93qVB6DmeDLvQbBIIBDNFcWTRar0WB+/s/i3S
V4BMTyGfwU7u6ZSVzx+W619uLfgwH1mG4uzDg4xk4b4Uia3+/3zjJkh2WzrT98eq
N+jwQr5MbWyxmjbFtcsO6ZUqlh7X5RXmjFXBAZjauVwCQAaSvnHR2SdyAvUrD2bG
V2ujYVJ8dAJjXeS/9ILWW+oo/tQTlmmUE898oP6ZljuSYj/ONLqM4AMUoR4Ie1Vp
BTuRr0VkAoJH2yTK/OTXYe6mBCFSyrp2l3CEC7EDLrCRQQInbRo=
=YkcH
-----END PGP SIGNATURE-----
Merge tag 'xfs-5.7-fixes-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs fixes from Darrick Wong:
"The three commits here fix some livelocks and other clashes with
fsfreeze, a potential corruption problem, and a minor race between
processes freeing and allocating space when the filesystem is near
ENOSPC.
Summary:
- Fix a partially uninitialized variable.
- Teach the background gc threads to apply for fsfreeze protection.
- Fix some scaling problems when multiple threads try to flush the
filesystem when we're about to hit ENOSPC"
* tag 'xfs-5.7-fixes-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: move inode flush to the sync workqueue
xfs: fix partially uninitialized structure in xfs_reflink_remap_extent
xfs: acquire superblock freeze protection on eofblocks scans
free_more_memory func has been completely removed in commit bc48f001de
("buffer: eliminate the need to call free_more_memory() in __getblk_slow()")
So comment and `WB_REASON_FREE_MORE_MEM` reason about free_more_memory
are no longer needed.
Fixes: bc48f001de ("buffer: eliminate the need to call free_more_memory() in __getblk_slow()")
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Zhiqiang Liu <liuzhiqiang26@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pull proc fix from Eric Biederman:
"While running syzbot happened to spot one more oversight in my rework
of proc_flush_task.
The fields proc_self and proc_thread_self were not being reinitialized
when proc was unmounted, which could cause problems if the mount of
proc fails"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
proc: Handle umounts cleanly
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl6ZxtgQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpg97EACvs/Vm50z9qAr9qJQKnWOpxUf9tYLNFhf2
olOe8No4DgDB5kAvUdexozvV/QMRXMN2SI9CpwXJ98+ZTt/VU8dcDt1hM5DooBRL
VWUADVeojRR362ijqdL1x7wt41pMLFt5UiAFE2VdAH04jcTV7VAVl15/ZvEhGSOX
o86xsR06IqjhHPGQnZvY34Qyk3AKYoA9y/doKhIrTyfgaXiHsMMJPZrQhgEuPI9C
D3i1/51FCJdKTm9c0hTz/CkhNxYvRmz91Ywjnm8wyZwXBZJJHm4ZDpDpbXijyLda
clkLdmnnD1fkm1mkId/55sCS//iR8Um9XXsejQ4W6iSaY7OLfqyVXfuct3Rbwi2D
ut85XvZFOiCP9M/5VaB9qFIDb9VF1nGC1qptYEmt8YrmgD+0n+4aPq83/2a+KYAs
7RQcH6twpDZpR/HDLcAcq9zpMz9B1O2QsokgXUgkZz0QVGQqZgGXYeZgMtUud2Rl
i3UGrmtl/Pp23A1z2NT0sPCZPopo0nVLu2OLZwL4t5PAJbV1CrIp2Q64XzOM+56U
3ExibVR7/s0BHjBtSPS//vSphGR6UT1NLzowtEf94jBdxIdvoC5eztycmxcCBtrc
TNcOjKsYDRDXiNS5NDnQbrc8xLXCd4mXhyRphWt+vNp/5TmA4tsqe9bZzUDm+4v/
CYr0nGJf7Q==
=n9hr
-----END PGP SIGNATURE-----
Merge tag 'io_uring-5.7-2020-04-17' of git://git.kernel.dk/linux-block
Pull io_uring fixes from Jens Axboe:
- wrap up the init/setup cleanup (Pavel)
- fix some issues around deferral sequences (Pavel)
- fix splice punt check using the wrong struct file member
- apply poll re-arm logic for pollable retry too
- pollable retry should honor cancelation
- fix setup time error handling syzbot reported crash
- restore work state when poll is canceled
* tag 'io_uring-5.7-2020-04-17' of git://git.kernel.dk/linux-block:
io_uring: don't count rqs failed after current one
io_uring: kill already cached timeout.seq_offset
io_uring: fix cached_sq_head in io_timeout()
io_uring: only post events in io_poll_remove_all() if we completed some
io_uring: io_async_task_func() should check and honor cancelation
io_uring: check for need to re-wait in polled async handling
io_uring: correct O_NONBLOCK check for splice punt
io_uring: restore req->work when canceling poll request
io_uring: move all request init code in one place
io_uring: keep all sqe->flags in req->flags
io_uring: early submission req fail code
io_uring: track mm through current->mm
io_uring: remove obsolete @mm_fault
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl6ZrWEACgkQxWXV+ddt
WDsQohAAhcAaSc/QoJ5g+vI/x5YQbo6KzAVyKbUbJdFlUIzh5uVBjJmpy4IQehcG
QQGoqj5mAO9DaWHH5wGoR9xBRKNDjc5Sh86IjaKrPNNyDoDWMuUKs5bqZojtY819
4zZyZaKUGQ8HD0BwKEMCMM30BWyXjj7MkngJtzO5/qj43cwSyIORDk8a4DDLwImr
FPdArpdUshRlt5aEwosTV4X/zRQ5kfQF8vOYd0TopfXAvKF3g6PZ7YmrHzfmVQGK
hdmqfsKY3gMhcNwi7nCTfaHN6qRd/9Bec+Z3ZVtZPsEoIPMZOyqgw8yU9NRjMj4O
GhmsLA9onbEYYrSAaGP/O7nEYr2M3MS0vJ0KnOobpOJaSMPZFUOfouac7u8l9ZZU
KQ5aSJo2mx9E6/VSesoP19TafKHJYx79J8M71tStVrXFCtT6yLkWzvsxj4gNacJc
2HFNEN/8zvDuWCy9s0JZnSQZ+nv01EuCjZ60IoMuS51lh9EcZORu6kKX33pp7UJS
WOANssZvunc1AaW0HxT0GME4V0RJa8yoKRFIhV2bLZFGGo2dwvom+v2/1kJy+fW/
LyfEA9973lyWuhqedB08r+dTIgEN5MEOwetsxQua2iC/P8VnpmU7rfmBk/LlTg7j
dN+O39+Ms6edsk+K3pxSedRU79XgdJ3muA6fNPmILUJBczQriKU=
=s26U
-----END PGP SIGNATURE-----
Merge tag 'for-5.7-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fix from David Sterba:
"A regression fix for a warning caused by running balance and snapshot
creation in parallel"
* tag 'for-5.7-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: fix setting last_trans for reloc roots
Currently, after the forward channel connection goes away,
backchannel operations are causing soft lockups on the server
because call_transmit_status's SOFTCONN logic ignores ENOTCONN.
Such backchannel Calls are aggressively retried until the client
reconnects.
Backchannel Calls should use RPC_TASK_NOCONNECT rather than
RPC_TASK_SOFTCONN. If there is no forward connection, the server is
not capable of establishing a connection back to the client, thus
that backchannel request should fail before the server attempts to
send it. Commit 58255a4e3c ("NFSD: NFSv4 callback client should
use RPC_TASK_SOFTCONN") was merged several years before
RPC_TASK_NOCONNECT was available.
Because setup_callback_client() explicitly sets NOPING, the NFSv4.0
callback connection depends on the first callback RPC to initiate
a connection to the client. Thus NFSv4.0 needs to continue to use
RPC_TASK_SOFTCONN.
Suggested-by: Trond Myklebust <trondmy@hammerspace.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Cc: <stable@vger.kernel.org> # v4.20+
When a discard_cmd needs to be split due to dpolicy->max_requests, then
for the remaining length it will be either merged into another cmd or a
new discard_cmd will be created. In this case, there is double
accounting of dcc->undiscard_blks for the remaining len, due to which
it shows incorrect value in stats.
Signed-off-by: Sahitya Tummala <stummala@codeaurora.org>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In f2fs_ra_meta_pages(), if f2fs_submit_page_bio() failed, we need to
unlock page, fix it.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In case a discard_cmd is split into several bios, the dc->error
must not be overwritten once an error is reported by a bio. Also,
move it under dc->lock.
Signed-off-by: Sahitya Tummala <stummala@codeaurora.org>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
F2FS already has a default timeout of 5 secs for discards that
can be issued during umount, but it can take more than the 5 sec
timeout if the underlying UFS device queue is already full and there
are no more available free tags to be used. Fix this by submitting a
small batch of discard requests so that it won't cause the device
queue to be full at any time and thus doesn't incur its wait time
in the umount context.
Signed-off-by: Sahitya Tummala <stummala@codeaurora.org>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Added a tracepoint to see iostat of f2fs. Default period of that
is 3 second. This tracepoint can be used to be monitoring
I/O statistics periodically.
Signed-off-by: Daeho Jeong <daehojeong@google.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
No one checks the return value of debugfs_create_u32(), as it's not
needed, so make the return value void, so that no one tries to do so in
the future.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: https://lore.kernel.org/r/20200416145448.GA1380878@kroah.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
I made a mistake with my previous fix, I assumed that we didn't need to
mess with the reloc roots once we were out of the part of relocation where
we are actually moving the extents.
The subtle thing that I missed is that btrfs_init_reloc_root() also
updates the last_trans for the reloc root when we do
btrfs_record_root_in_trans() for the corresponding fs_root. I've added a
comment to make sure future me doesn't make this mistake again.
This showed up as a WARN_ON() in btrfs_copy_root() because our
last_trans didn't == the current transid. This could happen if we
snapshotted a fs root with a reloc root after we set
rc->create_reloc_tree = 0, but before we actually merge the reloc root.
Worth mentioning that the regression produced the following warning
when running snapshot creation and balance in parallel:
BTRFS info (device sdc): relocating block group 30408704 flags metadata|dup
------------[ cut here ]------------
WARNING: CPU: 0 PID: 12823 at fs/btrfs/ctree.c:191 btrfs_copy_root+0x26f/0x430 [btrfs]
CPU: 0 PID: 12823 Comm: btrfs Tainted: G W 5.6.0-rc7-btrfs-next-58 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
RIP: 0010:btrfs_copy_root+0x26f/0x430 [btrfs]
RSP: 0018:ffffb96e044279b8 EFLAGS: 00010202
RAX: 0000000000000009 RBX: ffff9da70bf61000 RCX: ffffb96e04427a48
RDX: ffff9da733a770c8 RSI: ffff9da70bf61000 RDI: ffff9da694163818
RBP: ffff9da733a770c8 R08: fffffffffffffff8 R09: 0000000000000002
R10: ffffb96e044279a0 R11: 0000000000000000 R12: ffff9da694163818
R13: fffffffffffffff8 R14: ffff9da6d2512000 R15: ffff9da714cdac00
FS: 00007fdeacf328c0(0000) GS:ffff9da735e00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000055a2a5b8a118 CR3: 00000001eed78002 CR4: 00000000003606f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
? create_reloc_root+0x49/0x2b0 [btrfs]
? kmem_cache_alloc_trace+0xe5/0x200
create_reloc_root+0x8b/0x2b0 [btrfs]
btrfs_reloc_post_snapshot+0x96/0x5b0 [btrfs]
create_pending_snapshot+0x610/0x1010 [btrfs]
create_pending_snapshots+0xa8/0xd0 [btrfs]
btrfs_commit_transaction+0x4c7/0xc50 [btrfs]
? btrfs_mksubvol+0x3cd/0x560 [btrfs]
btrfs_mksubvol+0x455/0x560 [btrfs]
__btrfs_ioctl_snap_create+0x15f/0x190 [btrfs]
btrfs_ioctl_snap_create_v2+0xa4/0xf0 [btrfs]
? mem_cgroup_commit_charge+0x6e/0x540
btrfs_ioctl+0x12d8/0x3760 [btrfs]
? do_raw_spin_unlock+0x49/0xc0
? _raw_spin_unlock+0x29/0x40
? __handle_mm_fault+0x11b3/0x14b0
? ksys_ioctl+0x92/0xb0
ksys_ioctl+0x92/0xb0
? trace_hardirqs_off_thunk+0x1a/0x1c
__x64_sys_ioctl+0x16/0x20
do_syscall_64+0x5c/0x280
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x7fdeabd3bdd7
Fixes: 2abc726ab4 ("btrfs: do not init a reloc root if we aren't relocating")
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This patch introduces a way to attach REQ_META/FUA explicitly
to all the data writes given temperature.
-> attach REQ_FUA to Hot Data writes
-> attach REQ_FUA to Hot|Warm Data writes
-> attach REQ_FUA to Hot|Warm|Cold Data writes
-> attach REQ_FUA to Hot|Warm|Cold Data writes as well as
REQ_META to Hot Data writes
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
for invalid pointer dereference and uninitialized variable use on
asynchronous create and unlink error paths.
-----BEGIN PGP SIGNATURE-----
iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAl6YkKMTHGlkcnlvbW92
QGdtYWlsLmNvbQAKCRBKf944AhHzi9mfCACM7yEZA3rYEUzoUVO2MfaZOnbPVyFe
0tRZB2Fcu5nzJLibeTMX8e0OKb0KtEpPcJXw8EMIe/IRA4ahUUCHp7cCe+jIoPuX
OB9JLOD0tgQJ1jt7hAd7SZFkN/iCJ/jpF/9kSD/8cLHUmPy2g2QzUtSeEtuRfsXD
8jOxW9heOIFVpysUC8HHsRO+b7yPL8AguG8WXNoDItL9uB1DmrgkxOhh/ijqPxVz
F9Du3WlEPzdOTheU6pxtTAMdds4mq3ltBnUElCevR4qY0og4YaqDwnGf0pJlzSuN
nVvAhSSOGbVdvkjzTaPo2BF5rEYXNm6Hln0HGHsUubnDlFZ200GbFEJk
=b1jf
-----END PGP SIGNATURE-----
Merge tag 'ceph-for-5.7-rc2' of git://github.com/ceph/ceph-client
Pull ceph fixes from Ilya Dryomov:
- a set of patches for a deadlock on "rbd map" error path
- a fix for invalid pointer dereference and uninitialized variable use
on asynchronous create and unlink error paths.
* tag 'ceph-for-5.7-rc2' of git://github.com/ceph/ceph-client:
ceph: fix potential bad pointer deref in async dirops cb's
rbd: don't mess with a page vector in rbd_notify_op_lock()
rbd: don't test rbd_dev->opts in rbd_dev_image_release()
rbd: call rbd_dev_unprobe() after unwatching and flushing notifies
rbd: avoid a deadlock on header_rwsem when flushing notifies
A dump_stack call for signature related errors can be too noisy
and not of much value in debugging such problems.
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Shyam Prasad N <nspmangalore@gmail.com>
Move the inode dirty data flushing to a workqueue so that multiple
threads can take advantage of a single thread's flushing work. The
ratelimiting technique used in bdd4ee4 was not successful, because
threads that skipped the inode flush scan due to ratelimiting would
ENOSPC early, which caused occasional (but noticeable) changes in
behavior and sporadic fstest regressions.
Therefore, make all the writer threads wait on a single inode flush,
which eliminates both the stampeding hordes of flushers and the small
window in which a write could fail with ENOSPC because it lost the
ratelimit race after even another thread freed space.
Fixes: c6425702f2 ("xfs: ratelimit inode flush on buffered write ENOSPC")
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Michael Kerrisk suggested to replace numeric clock IDs with symbolic names.
Now the content of these files looks like this:
$ cat /proc/774/timens_offsets
monotonic 864000 0
boottime 1728000 0
For setting offsets, both representations of clocks (numeric and symbolic)
can be used.
As for compatibility, it is acceptable to change things as long as
userspace doesn't care. The format of timens_offsets files is very new and
there are no userspace tools yet which rely on this format.
But three projects crun, util-linux and criu rely on the interface of
setting time offsets and this is why it's required to continue supporting
the numeric clock IDs on write.
Fixes: 04a8682a71 ("fs/proc: Introduce /proc/pid/timens_offsets")
Suggested-by: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Andrei Vagin <avagin@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Michael Kerrisk <mtk.manpages@gmail.com>
Acked-by: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Dmitry Safonov <0x7f454c46@gmail.com>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20200411154031.642557-1-avagin@gmail.com
syzbot writes:
> KASAN: use-after-free Read in dput (2)
>
> proc_fill_super: allocate dentry failed
> ==================================================================
> BUG: KASAN: use-after-free in fast_dput fs/dcache.c:727 [inline]
> BUG: KASAN: use-after-free in dput+0x53e/0xdf0 fs/dcache.c:846
> Read of size 4 at addr ffff88808a618cf0 by task syz-executor.0/8426
>
> CPU: 0 PID: 8426 Comm: syz-executor.0 Not tainted 5.6.0-next-20200412-syzkaller #0
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
> Call Trace:
> __dump_stack lib/dump_stack.c:77 [inline]
> dump_stack+0x188/0x20d lib/dump_stack.c:118
> print_address_description.constprop.0.cold+0xd3/0x315 mm/kasan/report.c:382
> __kasan_report.cold+0x35/0x4d mm/kasan/report.c:511
> kasan_report+0x33/0x50 mm/kasan/common.c:625
> fast_dput fs/dcache.c:727 [inline]
> dput+0x53e/0xdf0 fs/dcache.c:846
> proc_kill_sb+0x73/0xf0 fs/proc/root.c:195
> deactivate_locked_super+0x8c/0xf0 fs/super.c:335
> vfs_get_super+0x258/0x2d0 fs/super.c:1212
> vfs_get_tree+0x89/0x2f0 fs/super.c:1547
> do_new_mount fs/namespace.c:2813 [inline]
> do_mount+0x1306/0x1b30 fs/namespace.c:3138
> __do_sys_mount fs/namespace.c:3347 [inline]
> __se_sys_mount fs/namespace.c:3324 [inline]
> __x64_sys_mount+0x18f/0x230 fs/namespace.c:3324
> do_syscall_64+0xf6/0x7d0 arch/x86/entry/common.c:295
> entry_SYSCALL_64_after_hwframe+0x49/0xb3
> RIP: 0033:0x45c889
> Code: ad b6 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 7b b6 fb ff c3 66 2e 0f 1f 84 00 00 00 00
> RSP: 002b:00007ffc1930ec48 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5
> RAX: ffffffffffffffda RBX: 0000000001324914 RCX: 000000000045c889
> RDX: 0000000020000140 RSI: 0000000020000040 RDI: 0000000000000000
> RBP: 000000000076bf00 R08: 0000000000000000 R09: 0000000000000000
> R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000003
> R13: 0000000000000749 R14: 00000000004ca15a R15: 0000000000000013
Looking at the code now that it the internal mount of proc is no
longer used it is possible to unmount proc. If proc is unmounted
the fields of the pid namespace that were used for filesystem
specific state are not reinitialized.
Which means that proc_self and proc_thread_self can be pointers to
already freed dentries.
The reported user after free appears to be from mounting and
unmounting proc followed by mounting proc again and using error
injection to cause the new root dentry allocation to fail. This in
turn results in proc_kill_sb running with proc_self and
proc_thread_self still retaining their values from the previous mount
of proc. Then calling dput on either proc_self of proc_thread_self
will result in double put. Which KASAN sees as a use after free.
Solve this by always reinitializing the filesystem state stored
in the struct pid_namespace, when proc is unmounted.
Reported-by: syzbot+72868dd424eb66c6b95f@syzkaller.appspotmail.com
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Fixes: 69879c01a0 ("proc: Remove the now unnecessary internal mount of proc")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
If the in-core buddy bitmap gets corrupted (or out of sync with the
block bitmap), issue a WARN_ON and try to recover. In most cases this
involves skipping trying to allocate out of a particular block group.
We can end up declaring the file system corrupted, which is fair,
since the file system probably should be checked before we proceed any
further.
Link: https://lore.kernel.org/r/20200414035649.293164-1-tytso@mit.edu
Google-Bug-Id: 34811296
Google-Bug-Id: 34639169
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Current wait times have proven to be too short to protect against inode
reuses that lead to metadata inconsistencies.
Now that we will retry the inode allocation if we can't find any
recently deleted inodes, it's a lot safer to increase the recently
deleted time from 5 seconds to a minute.
Link: https://lore.kernel.org/r/20200414023925.273867-1-tytso@mit.edu
Google-Bug-Id: 36602237
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Fix the following gcc warning:
fs/ext4/ext4_jbd2.c:341:30: warning: variable 'es' set but not used [-Wunused-but-set-variable]
struct ext4_super_block *es;
^~
Fixes: 2ea2fc775321 ("ext4: save all error info in save_error_info() and drop ext4_set_errno()")
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Jason Yan <yanaijie@huawei.com>
Link: https://lore.kernel.org/r/20200402034759.29957-1-yanaijie@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Fix the following gcc warning:
fs/ext4/super.c:599:27: warning: variable 'es' set but not used [-Wunused-but-set-variable]
struct ext4_super_block *es;
^~
Fixes: 2ea2fc775321 ("ext4: save all error info in save_error_info() and drop ext4_set_errno()")
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Jason Yan <yanaijie@huawei.com>
Link: https://lore.kernel.org/r/20200402033939.25303-1-yanaijie@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
We do not want to create initialized extents beyond end of file because
for e2fsck it is impossible to distinguish them from a case of corrupted
file size / extent tree and so it complains like:
Inode 12, i_size is 147456, should be 163840. Fix? no
Code in ext4_ext_convert_to_initialized() and
ext4_split_convert_extents() try to make sure it does not create
initialized extents beyond inode size however they check against
inode->i_size which is wrong. They should instead check against
EXT4_I(inode)->i_disksize which is the current inode size on disk.
That's what e2fsck is going to see in case of crash before all dirty
data is written. This bug manifests as generic/456 test failure (with
recent enough fstests where fsx got fixed to properly pass
FALLOC_KEEP_SIZE_FL flags to the kernel) when run with dioread_lock
mount option.
CC: stable@vger.kernel.org
Fixes: 21ca087a38 ("ext4: Do not zero out uninitialized extents beyond i_size")
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Link: https://lore.kernel.org/r/20200331105016.8674-1-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
The documentation comments for ext4_read_block_bitmap_nowait and
ext4_read_inode_bitmap describe them as returning NULL on error, but
they return an ERR_PTR on error; update the documentation to match.
The documentation comment for ext4_wait_block_bitmap describes it as
returning 1 on error, but it returns -errno on error; update the
documentation to match.
Signed-off-by: Josh Triplett <josh@joshtriplett.org>
Reviewed-by: Ritesh Harani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/60a3f4996f4932c45515aaa6b75ca42f2a78ec9b.1585512514.git.josh@joshtriplett.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Since commit a8ac900b81 ("ext4: use non-movable memory for the
superblock") buffers for ext4 superblock were allocated using
the sb_bread_unmovable() helper which allocated buffer heads
out of non-movable memory blocks. It was necessarily to not block
page migrations and do not cause cma allocation failures.
However commit 85c8f176a6 ("ext4: preload block group descriptors")
broke this by introducing pre-reading of the ext4 superblock.
The problem is that __breadahead() is using __getblk() underneath,
which allocates buffer heads out of movable memory.
It resulted in page migration failures I've seen on a machine
with an ext4 partition and a preallocated cma area.
Fix this by introducing sb_breadahead_unmovable() and
__breadahead_gfp() helpers which use non-movable memory for buffer
head allocations and use them for the ext4 superblock readahead.
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Fixes: 85c8f176a6 ("ext4: preload block group descriptors")
Signed-off-by: Roman Gushchin <guro@fb.com>
Link: https://lore.kernel.org/r/20200229001411.128010-1-guro@fb.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Run generic/388 with journal data mode sometimes may trigger the warning
in ext4_invalidatepage. Actually, we should use the matching invalidatepage
in ext4_writepage.
Signed-off-by: yangerkun <yangerkun@huawei.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20200226041002.13914-1-yangerkun@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Found a read performance issue when linux kernel page size is 64KB.
If linux kernel page size is 64KB and mount options cache=strict &
vers=2.1+, it does not support cifs_readpages(). Instead, it is using
cifs_readpage() and cifs_read() with maximum read IO size 16KB, which is
much slower than read IO size 1MB when negotiated SMB 2.1+. Since modern
SMB server supported SMB 2.1+ and Max Read Size can reach more than 64KB
(for example 1MB ~ 8MB), this patch check max_read instead of maxBuf to
determine whether server support readpages() and improve read performance
for page size 64KB & cache=strict & vers=2.1+, and for SMB1 it is more
cleaner to initialize server->max_read to server->maxBuf.
The client is a linux box with linux kernel 4.2.8,
page size 64KB (CONFIG_ARM64_64K_PAGES=y),
cpu arm 1.7GHz, and use mount.cifs as smb client.
The server is another linux box with linux kernel 4.2.8,
share a file '10G.img' with size 10GB,
and use samba-4.7.12 as smb server.
The client mount a share from the server with different
cache options: cache=strict and cache=none,
mount -tcifs //<server_ip>/Public /cache_strict -overs=3.0,cache=strict,username=<xxx>,password=<yyy>
mount -tcifs //<server_ip>/Public /cache_none -overs=3.0,cache=none,username=<xxx>,password=<yyy>
The client download a 10GbE file from the server across 1GbE network,
dd if=/cache_strict/10G.img of=/dev/null bs=1M count=10240
dd if=/cache_none/10G.img of=/dev/null bs=1M count=10240
Found that cache=strict (without patch) is slower read throughput and
smaller read IO size than cache=none.
cache=strict (without patch): read throughput 40MB/s, read IO size is 16KB
cache=strict (with patch): read throughput 113MB/s, read IO size is 1MB
cache=none: read throughput 109MB/s, read IO size is 1MB
Looks like if page size is 64KB,
cifs_set_ops() would use cifs_addr_ops_smallbuf instead of cifs_addr_ops,
/* check if server can support readpages */
if (cifs_sb_master_tcon(cifs_sb)->ses->server->maxBuf <
PAGE_SIZE + MAX_CIFS_HDR_SIZE)
inode->i_data.a_ops = &cifs_addr_ops_smallbuf;
else
inode->i_data.a_ops = &cifs_addr_ops;
maxBuf is came from 2 places, SMB2_negotiate() and CIFSSMBNegotiate(),
(SMB2_MAX_BUFFER_SIZE is 64KB)
SMB2_negotiate():
/* set it to the maximum buffer size value we can send with 1 credit */
server->maxBuf = min_t(unsigned int, le32_to_cpu(rsp->MaxTransactSize),
SMB2_MAX_BUFFER_SIZE);
CIFSSMBNegotiate():
server->maxBuf = le32_to_cpu(pSMBr->MaxBufferSize);
Page size 64KB and cache=strict lead to read_pages() use cifs_readpage()
instead of cifs_readpages(), and then cifs_read() using maximum read IO
size 16KB, which is much slower than maximum read IO size 1MB.
(CIFSMaxBufSize is 16KB by default)
/* FIXME: set up handlers for larger reads and/or convert to async */
rsize = min_t(unsigned int, cifs_sb->rsize, CIFSMaxBufSize);
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Jones Syue <jonessyue@qnap.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
We already dump these keys for SMB3, lets also dump it for SMB2
sessions so that we can use the session key in wireshark to check and validate
that the signatures are correct.
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Sparse reports warnings at fsnotify_prepare_user_wait()
and at fsnotify_finish_user_wait()
warning: context imbalance in fsnotify_finish_user_wait()
- wrong count at exit
warning: context imbalance in fsnotify_prepare_user_wait()
- unexpected unlock
The root cause is the missing annotation at fsnotify_finish_user_wait()
and at fsnotify_prepare_user_wait()
fsnotify_prepare_user_wait() has an extra annotation __release()
that only tell Sparse and not GCC to shutdown the warning
Add the missing __acquires(&fsnotify_mark_srcu) annotation
Add the missing __releases(&fsnotify_mark_srcu) annotation
Add the __release(&fsnotify_mark_srcu) annotation.
Link: https://lore.kernel.org/r/20200413214240.15245-1-jbi.octave@gmail.com
Signed-off-by: Jules Irenge <jbi.octave@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
When checking for draining with __req_need_defer(), it tries to match
how many requests were sent before a current one with number of already
completed. Dropped SQEs are included in req->sequence, and they won't
ever appear in CQ. To compensate for that, __req_need_defer() substracts
ctx->cached_sq_dropped.
However, what it should really use is number of SQEs dropped __before__
the current one. In other words, any submitted request shouldn't
shouldn't affect dequeueing from the drain queue of previously submitted
ones.
Instead of saving proper ctx->cached_sq_dropped in each request,
substract from req->sequence it at initialisation, so it includes number
of properly submitted requests.
note: it also changes behaviour of timeouts, but
1. it's already diverge from the description because of using SQ
2. the description is ambiguous regarding dropped SQEs
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
req->timeout.count and req->io->timeout.seq_offset store the same value,
which is sqe->off. Kill the second one
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
io_timeout() can be executed asynchronously by a worker and without
holding ctx->uring_lock
1. using ctx->cached_sq_head there is racy there
2. it should count events from a moment of timeout's submission, but
not execution
Use req->sequence.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl6VskEACgkQxWXV+ddt
WDurrhAAhkxlh6yrdZqr753DcpdEVAQhyHDsJ66GAKWuW8sn7ypTiZhNgKxvEGuz
UhwtXTlzZ7K9/h3TsVeih2iqEj6oc8ick+Th+Wf/7s0jhUXDcWi2OqBjTnIiH2Za
efrwGMiOEAHYqQ7tHjEbZiJGcQ2tE7+2Le4g3aFnv/kRT0jXDikzLTa/viMG73k5
9llSm+GJYl2KQNcUPmxGKrwwiiV5c5xNCGuEuY4lw+3OVn1QU4rayZDB/5GxZ/nC
72Efl9CxoDunBviys2NWxYTt/Ts3R/+yhnGX0kM6BovkN0bo1pA7HuWkADqYPnNN
r8z8X/zFYi7jZBwpPq4alcHW2IaMC7UEseEyZHlj9ce8pK8MnHFlBtfBcUzbvFl5
Wtt23AvAZ9CiQ40Sf5UBt6pliUQhr/BpBz88jatZ619ij1GLxeO++I5bIz3/YFQH
UEP7okhoqpxgKLFGRcpxkw0ggOipp7isFyfss2qaRMPebmNMKnuuUoEy5BDlHs2f
ewxbyuSUVXVBJMB4R6u77Nk5KLrTO67kfiCROaVKkzhYDESpbB4Trdl+kvzPSFb6
p3NYpJoGnkOKngG/vg5MoQGOp1oi4h3RH2Ck1Yes7jmBgYLSCQokCUXkm52PGfId
25P45yOzwS4W7sVFXsR3rygpexXlcNAIGG+2xtiw/AyFIQo5AZ4=
=pkZ2
-----END PGP SIGNATURE-----
Merge tag 'for-5.7-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"We have a few regressions and one fix for stable:
- revert fsync optimization
- fix lost i_size update
- fix a space accounting leak
- build fix, add back definition of a deprecated ioctl flag
- fix search condition for old roots in relocation"
* tag 'for-5.7-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: re-instantiate the removed BTRFS_SUBVOL_CREATE_ASYNC definition
btrfs: fix reclaim counter leak of space_info objects
btrfs: make full fsyncs always operate on the entire file again
btrfs: fix lost i_size update after cloning inline extent
btrfs: check commit root generation in should_ignore_root
We need to drop the inode spinlock while calling nfs4_select_rw_stateid(),
since nfs4_copy_delegation_stateid() could take the delegation lock.
Note that it is safe to do this, since all other calls to
pnfs_update_layout() for that inode will find themselves blocked by
the lock we hold on NFS_LAYOUT_FIRST_LAYOUTGET.
Fixes: fc51b1cf39 ("NFS: Beware when dereferencing the delegation cred")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
The new async dirops callback routines can pass ERR_PTR values to
ceph_mdsc_free_path, which could cause an oops. Make ceph_mdsc_free_path
ignore ERR_PTR values. Also, ensure that the pr_warn messages look sane
even if ceph_mdsc_build_path fails.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
If the request has been marked as canceled, don't try and issue it.
Instead just fill a canceled event and finish the request.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We added this for just the regular poll requests in commit a6ba632d2c
("io_uring: retry poll if we got woken with non-matching mask"), we
should do the same for the poll handler used pollable async requests.
Move the re-wait check and arm into a helper, and call it from
io_async_task_func() as well.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In the reflink extent remap function, it turns out that uirec (the block
mapping corresponding only to the part of the passed-in mapping that got
unmapped) was not fully initialized. Specifically, br_state was not
being copied from the passed-in struct to the uirec. This could lead to
unpredictable results such as the reflinked mapping being marked
unwritten in the destination file.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
The filesystem freeze sequence in XFS waits on any background
eofblocks or cowblocks scans to complete before the filesystem is
quiesced. At this point, the freezer has already stopped the
transaction subsystem, however, which means a truncate or cowblock
cancellation in progress is likely blocked in transaction
allocation. This results in a deadlock between freeze and the
associated scanner.
Fix this problem by holding superblock write protection across calls
into the block reapers. Since protection for background scans is
acquired from the workqueue task context, trylock to avoid a similar
deadlock between freeze and blocking on the write lock.
Fixes: d6b636ebb1 ("xfs: halt auto-reclamation activities while rebuilding rmap")
Reported-by: Paul Furtado <paulfurtado91@gmail.com>
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Chandan Rajendra <chandanrlinux@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Allison Collins <allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
New struct nfsd4_blocked_lock allocated in find_or_allocate_block()
does not initialized nbl_list and nbl_lru.
If conflock allocation fails rollback can call list_del_init()
access uninitialized fields and corrupt memory.
v2: just initialize nbl_list and nbl_lru right after nbl allocation.
Fixes: 76d348fadf ("nfsd: have nfsd4_lock use blocking locks for v4.1+ lock")
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
If a dentry's version is somewhere between invalid_before and the current
directory version, we should be setting it forward to the current version,
not backwards to the invalid_before version. Note that we're only doing
this at all because dentry::d_fsdata isn't large enough on a 32-bit system.
Fix this by using a separate variable for invalid_before so that we don't
accidentally clobber the current dir version.
Fixes: a4ff7401fb ("afs: Keep track of invalid-before version for dentry coherency")
Signed-off-by: David Howells <dhowells@redhat.com>
AFS directories are retained locally as a structured file, with lookup
being effected by a local search of the file contents. When a modification
(such as mkdir) happens, the dir file content is modified locally rather
than redownloading the directory.
The directory contents are accessed in a number of ways, with a number of
different locks schemes:
(1) Download of contents - dvnode->validate_lock/write in afs_read_dir().
(2) Lookup and readdir - dvnode->validate_lock/read in afs_dir_iterate(),
downgrading from (1) if necessary.
(3) d_revalidate of child dentry - dvnode->validate_lock/read in
afs_do_lookup_one() downgrading from (1) if necessary.
(4) Edit of dir after modification - page locks on individual dir pages.
Unfortunately, because (4) uses different locking scheme to (1) - (3),
nothing protects against the page being scanned whilst the edit is
underway. Even download is not safe as it doesn't lock the pages - relying
instead on the validate_lock to serialise as a whole (the theory being that
directory contents are treated as a block and always downloaded as a
block).
Fix this by write-locking dvnode->validate_lock around the edits. Care
must be taken in the rename case as there may be two different dirs - but
they need not be locked at the same time. In any case, once the lock is
taken, the directory version must be rechecked, and the edit skipped if a
later version has been downloaded by revalidation (there can't have been
any local changes because the VFS holds the inode lock, but there can have
been remote changes).
Fixes: 63a4681ff3 ("afs: Locally edit directory data for mkdir/create/unlink/...")
Signed-off-by: David Howells <dhowells@redhat.com>
Fix the length of the dump of a bad YFSFetchStatus record. The function
was copied from the AFS version, but the YFS variant contains bigger fields
and extra information, so expand the dump to match.
Signed-off-by: David Howells <dhowells@redhat.com>
The afs_deliver_fs_rename() and yfs_deliver_fs_rename() functions both only
decode the second file status returned unless the parent directories are
different - unfortunately, this means that the xdr pointer isn't advanced
and the volsync record will be read incorrectly in such an instance.
Fix this by always decoding the second status into the second
status/callback block which wasn't being used if the dirs were the same.
The afs_update_dentry_version() calls that update the directory data
version numbers on the dentries can then unconditionally use the second
status record as this will always reflect the state of the destination dir
(the two records will be identical if the destination dir is the same as
the source dir)
Fixes: 260a980317 ("[AFS]: Add "directory write" support.")
Fixes: 30062bd13e ("afs: Implement YFS support in the fs client")
Signed-off-by: David Howells <dhowells@redhat.com>
If we're decoding an AFSFetchStatus record and we see that the version is 1
and the abort code is set and we're expecting inline errors, then we store
the abort code and ignore the remaining status record (which is correct),
but we don't set the flag to say we got a valid abort code.
This can affect operation of YFS.RemoveFile2 when removing a file and the
operation of {,Y}FS.InlineBulkStatus when prospectively constructing or
updating of a set of inodes during a lookup.
Fix this to indicate the reception of a valid abort code.
Fixes: a38a75581e ("afs: Fix unlink to handle YFS.RemoveFile2 better")
Signed-off-by: David Howells <dhowells@redhat.com>
If we receive a status record that has VNOVNODE set in the abort field,
xdr_decode_AFSFetchStatus() and xdr_decode_YFSFetchStatus() don't advance
the XDR pointer, thereby corrupting anything subsequent decodes from the
same block of data.
This has the potential to affect AFS.InlineBulkStatus and
YFS.InlineBulkStatus operation, but probably doesn't since the status
records are extracted as individual blocks of data and the buffer pointer
is reset between blocks.
It does affect YFS.RemoveFile2 operation, corrupting the volsync record -
though that is not currently used.
Other operations abort the entire operation rather than returning an error
inline, in which case there is no decoding to be done.
Fix this by unconditionally advancing the xdr pointer.
Fixes: 684b0f68cf ("afs: Fix AFSFetchStatus decoder to provide OpenAFS compatibility")
Signed-off-by: David Howells <dhowells@redhat.com>
The splice file punt check uses file->f_mode to check for O_NONBLOCK,
but it should be checking file->f_flags. This leads to punting even
for files that have O_NONBLOCK set, which isn't necessary. This equates
to checking for FMODE_PATH, which will never be set on the fd in
question.
Fixes: 7d67af2c01 ("io_uring: add splice(2) support")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAl6SeJgACgkQiiy9cAdy
T1GqsQwAiZ/fNCrCFNHT3il6KfYVk4/meU1AYovjXjQ0//xIvn3FqG/sJCOEtWDL
bH1D9xsvVz7sDT9l3GNeMtR17O3Lkc4VDVy1LtNEu4a2oA1MxksPLuJVI9S+Wcu6
Du7Fb4hzbTrGRj7C+6GL9mx1uOldoaGsbUV21D9dcQeAA0kNodrm6FAUMG/55wZO
qB7H/ODaT2CF6s8H9o4GFV5vcasZSBJwaJeFsgoY/BOeQzMQgNnzyvleJTRZnSHk
BAnqiocK2jg0Q/EjJs402/UKTEjZRy4q3mS/+Rpa2BRax0rPqdfRIXAJx2SAShNU
pvinNge5PdWANZ1fti7hcL+e6n5TyinnkS9AOdzuS38yrpFVfJnrRD1igfUAuuME
L64aODnLiAonnAlGcEk341ESt/FMxWnnXIxEPgEGtlyFiyoH13BI763IQYc4HLXQ
8Xo6GkyirDmnflVpFsfmuydhacLjwIxYLnHCAsDspDPEAjRhiKAXTV9d+VpkFgw+
hBNYP/dm
=tstz
-----END PGP SIGNATURE-----
Merge tag '5.7-rc-smb3-fixes-part2' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs fixes from Steve French:
"Ten cifs/smb fixes:
- five RDMA (smbdirect) related fixes
- add experimental support for swap over SMB3 mounts
- also a fix which improves performance of signed connections"
* tag '5.7-rc-smb3-fixes-part2' of git://git.samba.org/sfrench/cifs-2.6:
smb3: enable swap on SMB3 mounts
smb3: change noisy error message to FYI
smb3: smbdirect support can be configured by default
cifs: smbd: Do not schedule work to send immediate packet on every receive
cifs: smbd: Properly process errors on ib_post_send
cifs: Allocate crypto structures on the fly for calculating signatures of incoming packets
cifs: smbd: Update receive credits before sending and deal with credits roll back on failure before sending
cifs: smbd: Check send queue size before posting a send
cifs: smbd: Merge code to track pending packets
cifs: ignore cached share root handle closing errors
Requests initialisation is scattered across several functions, namely
io_init_req(), io_submit_sqes(), io_submit_sqe(). Put it
in io_init_req() for better data locality and code clarity.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
It's a good idea to not read sqe->flags twice, as it's prone to security
bugs. Instead of passing it around, embeed them in req->flags. It's
already so except for IOSQE_IO_LINK.
1. rename former REQ_F_LINK -> REQ_F_LINK_HEAD
2. introduce and copy REQ_F_LINK, which mimics IO_IOSQE_LINK
And leave req_set_fail_links() using new REQ_F_LINK, because it's more
sensible.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Having only one place for cleaning up a request after a link assembly/
submission failure will play handy in the future. At least it allows
to remove duplicated cleanup sequence.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
As a preparation for extracting request init bits, remove self-coded mm
tracking from io_submit_sqes(), but rely on current->mm. It's more
convenient, than passing this piece of state in other functions.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If io_submit_sqes() can't grab an mm, it fails and exits right away.
There is no need to track the fact of the failure. Remove @mm_fault.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Another brown paper bag moment. pnfs_alloc_ds_commits_list() is leaking
the RCU lock.
Fixes: a9901899b6 ("pNFS: Add infrastructure for cleaning up per-layout commit structures")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Merge yet more updates from Andrew Morton:
- Almost all of the rest of MM (memcg, slab-generic, slab, pagealloc,
gup, hugetlb, pagemap, memremap)
- Various other things (hfs, ocfs2, kmod, misc, seqfile)
* akpm: (34 commits)
ipc/util.c: sysvipc_find_ipc() should increase position index
kernel/gcov/fs.c: gcov_seq_next() should increase position index
fs/seq_file.c: seq_read(): add info message about buggy .next functions
drivers/dma/tegra20-apb-dma.c: fix platform_get_irq.cocci warnings
change email address for Pali Rohár
selftests: kmod: test disabling module autoloading
selftests: kmod: fix handling test numbers above 9
docs: admin-guide: document the kernel.modprobe sysctl
fs/filesystems.c: downgrade user-reachable WARN_ONCE() to pr_warn_once()
kmod: make request_module() return an error when autoloading is disabled
mm/memremap: set caching mode for PCI P2PDMA memory to WC
mm/memory_hotplug: add pgprot_t to mhp_params
powerpc/mm: thread pgprot_t through create_section_mapping()
x86/mm: introduce __set_memory_prot()
x86/mm: thread pgprot_t through init_memory_mapping()
mm/memory_hotplug: rename mhp_restrictions to mhp_params
mm/memory_hotplug: drop the flags field from struct mhp_restrictions
mm/special: create generic fallbacks for pte_special() and pte_mkspecial()
mm/vma: introduce VM_ACCESS_FLAGS
mm/vma: define a default value for VM_DATA_DEFAULT_FLAGS
...
Fix: Christoph Hellwig noticed that some logic I added to
orangefs_file_read_iter introduced a race condition, so he
sent a reversion patch. I had to modify his patch since
reverting at this point broke Orangefs.
Cleanup 1: Christoph Hellwig noticed that we were doing some unnecessary
work in orangefs_flush, so he sent in a patch that removed
the un-needed code.
Cleanup 2: Al Viro told me he had trouble building Orangefs. Orangefs
should be easy to build, even for Al :-). I looked back
at the test server build notes in orangefs.txt, just in case
that's where the trouble really is, and found a couple of
typos and made a couple of clarifications.
Merge Conflict: Stephen Rothwell reported that my modifications to
orangefs.txt caused a merge conflict with orangefs.rst
in Linux Next. I wasn't sure what to do, so I asked,
and Jonathan Corbet said not to worry about it and
just to report it to Linus.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEIGSFVdO6eop9nER2z0QOqevODb4FAl6Qft8ACgkQz0QOqevO
Db6v1A/7BnEalIB30OPy2PUAcobY3Bqtio2Mb7ULquvZAarVvNgTZIyI4UGt0VBN
bc8KOpKF/ZJIJLjgyorjnA7KGYg+FLdWatTu5GkdEz6DxKbbvXsokJI/uZkXVIGF
BS/Ic0KmOBL+wUjqc/xfJQrm5dSLp2woHWfhyWoKuKJtC0Ut9a0ov0Y8T2X/F4Jw
xV4rMhuEy9zBU6nLHOHEumt9RDUAL2TlK4RJ9on0OmV1W/iizin6GU2q20E6fJVb
C0ShqigQEI+Uo6QXKf9w8eqWH4Vq431L0zlwASR63DgnmMXM4osx3iVETTcOoeg7
xcz7mjMOqutAu1R89Y0BIBglVvZaP+Auds+awFRvcGFoM5s19DlolRYofLqiIdzy
OsQd2oT2DBEHBkYnGCRXnqPLj+Q9/sMs4vNAaHUcEz1pn3x57lgxFFPHQEF6+bLy
WX/cEVRgqjHhieLUCXLI9jVA54mhvsGZuyXputLOA7r3Dqjk5LIWPIGIW1ksjA1T
Kl7Ge/C4i3OOAJDtjcr+kl5LoKu/ppPmg+fOaTQ9qmGWRiwXdfoR4GMT3pEF0Uam
bXwR2CZlsUBIlqO9FmWwrdDddt55gMRbjNcqzgqpCyDKQ5zFM+nWgsl5oHFYkTty
QYMZtslM8icZPxGYucrhqLxaqRJ4kJoCcoxmOyPrfN/0cDWuIDc=
=dNWv
-----END PGP SIGNATURE-----
Merge tag 'for-linus-5.7-ofs1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux
Pull orangefs updates from Mike Marshall:
"A fix and two cleanups.
Fix:
- Christoph Hellwig noticed that some logic I added to
orangefs_file_read_iter introduced a race condition, so he sent a
reversion patch. I had to modify his patch since reverting at this
point broke Orangefs.
Cleanups:
- Christoph Hellwig noticed that we were doing some unnecessary work
in orangefs_flush, so he sent in a patch that removed the un-needed
code.
- Al Viro told me he had trouble building Orangefs. Orangefs should
be easy to build, even for Al :-).
I looked back at the test server build notes in orangefs.txt, just
in case that's where the trouble really is, and found a couple of
typos and made a couple of clarifications"
* tag 'for-linus-5.7-ofs1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux:
orangefs: clarify build steps for test server in orangefs.txt
orangefs: don't mess with I_DIRTY_TIMES in orangefs_flush
orangefs: get rid of knob code...
Patch series "seq_file .next functions should increase position index".
In Aug 2018 NeilBrown noticed commit 1f4aace60b ("fs/seq_file.c:
simplify seq_file iteration code and interface")
"Some ->next functions do not increment *pos when they return NULL...
Note that such ->next functions are buggy and should be fixed. A simple
demonstration is dd if=/proc/swaps bs=1000 skip=1 Choose any block size
larger than the size of /proc/swaps. This will always show the whole
last line of /proc/swaps"
Described problem is still actual. If you make lseek into middle of
last output line following read will output end of last line and whole
last line once again.
$ dd if=/proc/swaps bs=1 # usual output
Filename Type Size Used Priority
/dev/dm-0 partition 4194812 97536 -2
104+0 records in
104+0 records out
104 bytes copied
$ dd if=/proc/swaps bs=40 skip=1 # last line was generated twice
dd: /proc/swaps: cannot skip to specified offset
v/dm-0 partition 4194812 97536 -2
/dev/dm-0 partition 4194812 97536 -2
3+1 records in
3+1 records out
131 bytes copied
There are lot of other affected files, I've found 30+ including
/proc/net/ip_tables_matches and /proc/sysvipc/*
I've sent patches into maillists of affected subsystems already, this
patch-set fixes the problem in files related to pstore, tracing, gcov,
sysvipc and other subsystems processed via linux-kernel@ mailing list
directly
https://bugzilla.kernel.org/show_bug.cgi?id=206283
This patch (of 4):
Add debug code to seq_read() to detect missed or out-of-tree incorrect
.next seq_file functions.
[akpm@linux-foundation.org: s/pr_info/pr_info_ratelimited/, per Qian Cai]
https://bugzilla.kernel.org/show_bug.cgi?id=206283
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: NeilBrown <neilb@suse.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Peter Oberparleiter <oberpar@linux.ibm.com>
Cc: Waiman Long <longman@redhat.com>
Link: http://lkml.kernel.org/r/244674e5-760c-86bd-d08a-047042881748@virtuozzo.com
Link: http://lkml.kernel.org/r/7c24087c-e280-e580-5b0c-0cdaeb14cd18@virtuozzo.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
For security reasons I stopped using gmail account and kernel address is
now up-to-date alias to my personal address.
People periodically send me emails to address which they found in source
code of drivers, so this change reflects state where people can contact
me.
[ Added .mailmap entry as per Joe Perches - Linus ]
Signed-off-by: Pali Rohár <pali@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Joe Perches <joe@perches.com>
Link: http://lkml.kernel.org/r/20200307104237.8199-1-pali@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
After request_module(), nothing is stopping the module from being
unloaded until someone takes a reference to it via try_get_module().
The WARN_ONCE() in get_fs_type() is thus user-reachable, via userspace
running 'rmmod' concurrently.
Since WARN_ONCE() is for kernel bugs only, not for user-reachable
situations, downgrade this warning to pr_warn_once().
Keep it printed once only, since the intent of this warning is to detect
a bug in modprobe at boot time. Printing the warning more than once
wouldn't really provide any useful extra information.
Fixes: 41124db869 ("fs: warn in case userspace lied about modprobe return")
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jessica Yu <jeyu@kernel.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Jeff Vander Stoep <jeffv@google.com>
Cc: Jessica Yu <jeyu@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: NeilBrown <neilb@suse.com>
Cc: <stable@vger.kernel.org> [4.13+]
Link: http://lkml.kernel.org/r/20200312202552.241885-3-ebiggers@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linux fallocate(2) with FALLOC_FL_PUNCH_HOLE mode set, its offset can
exceed the inode size. Ocfs2 now doesn't allow that offset beyond inode
size. This restriction is not necessary and violates fallocate(2)
semantics.
If fallocate(2) offset is beyond inode size, just return success and do
nothing further.
Otherwise, ocfs2 will crash the kernel.
kernel BUG at fs/ocfs2//alloc.c:7264!
ocfs2_truncate_inline+0x20f/0x360 [ocfs2]
ocfs2_remove_inode_range+0x23c/0xcb0 [ocfs2]
__ocfs2_change_file_space+0x4a5/0x650 [ocfs2]
ocfs2_fallocate+0x83/0xa0 [ocfs2]
vfs_fallocate+0x148/0x230
SyS_fallocate+0x48/0x80
do_syscall_64+0x79/0x170
Signed-off-by: Changwei Ge <chge@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/20200407082754.17565-1-chge@linux.alibaba.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When removing files containing extended attributes, the hfsplus driver may
remove the wrong entries from the attributes b-tree, causing major
filesystem damage and in some cases even kernel crashes.
To remove a file, all its extended attributes have to be removed as well.
The driver does this by looking up all keys in the attributes b-tree with
the cnid of the file. Each of these entries then gets deleted using the
key used for searching, which doesn't contain the attribute's name when it
should. Since the key doesn't contain the name, the deletion routine will
not find the correct entry and instead remove the one in front of it. If
parent nodes have to be modified, these become corrupt as well. This
causes invalid links and unsorted entries that not even macOS's fsck_hfs
is able to fix.
To fix this, modify the search key before an entry is deleted from the
attributes b-tree by copying the found entry's key into the search key,
therefore ensuring that the correct entry gets removed from the tree.
Signed-off-by: Simon Gander <simon@tuxera.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Anton Altaparmakov <anton@tuxera.com>
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/20200327155541.1521-1-simon@tuxera.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull proc fix from Eric Biederman:
"A brown paper bag slipped through my proc changes, and syzcaller
caught it when the code ended up in your tree.
I have opted to fix it the simplest cleanest way I know how, so there
is no reasonable chance for the bug to repeat"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
proc: Use a dedicated lock in struct pid
Add experimental support for allowing a swap file to be on an SMB3
mount. There are use cases where swapping over a secure network
filesystem is preferable. In some cases there are no local
block devices large enough, and network block devices can be
hard to setup and secure. And in some cases there are no
local block devices at all (e.g. with the recent addition of
remote boot over SMB3 mounts).
There are various enhancements that can be added later e.g.:
- doing a mandatory byte range lock over the swapfile (until
the Linux VFS is modified to notify the file system that an open
is for a swapfile, when the file can be opened "DENY_ALL" to prevent
others from opening it).
- pinning more buffers in the underlying transport to minimize memory
allocations in the TCP stack under the fs
- documenting how to create ACLs (on the server) to secure the
swapfile (or adding additional tools to cifs-utils to make it easier)
Signed-off-by: Steve French <stfrench@microsoft.com>
Acked-by: Pavel Shilovsky <pshilov@microsoft.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl6Pxm0QHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpunoD/9Q5g5G/N14UWQpMMh5SxZ98dZL/ogT2RQt
eCOROX1Iz3MbHAwQbLd8AaeWkzSG9UjRsF8ED7Td1LRkgaDxrBdnlyx19ys+eiPe
ePhM9yoBTpjef3my46/Q7dKUoI1KxWW95QDY0QgRRhP7UQEKRJcppXdNOIs3Bsn4
RstRJi8XulUeS/FUGChydmbJCX4euy/uIB5C8FoWCN70SeF9LiXkhF+YeyUaGYJE
VY9UOTFd1UDcoaZfwaQBxya1hj0mOthax9/1Sw9k21FXwdMDObfpgGtisODRTubQ
z0NyVl17eoPae6oG6YaGht6YRo7t8D98PmjnEN4D38XMl1dzSBFRN0e9e39PvYS+
NN3ICADAJO6xLkSt97GqpE7KPlfYBzap/dQ3Zlcz1LM3nPYEYtCgPp+90D3Ah8+4
rW+/wMb+SOOithY89aE14e50PABDDnTpL1sftbpfxiqculUQrD2HqmwT5Q2V91DQ
61xspJ8H/TYbJ01v1ChYoCaYYNdbJmsJpGcourVErIoRq5rvEH1nCKFS+8UbBboR
CP2K26BxFnQ2ryxdqae99c+E38DJzlZk/51KAeZO3TfIEVKFozaQr/J93FENyKGl
ZUXim5pU7dzGC8v8GECjYzGtY5FA4YjwlsjPRqYzj+hHhuuE/qn3wovH/d7vGvSt
xFZojkj7aA==
=2R2G
-----END PGP SIGNATURE-----
Merge tag 'io_uring-5.7-2020-04-09' of git://git.kernel.dk/linux-block
Pull io_uring fixes from Jens Axboe:
"Here's a set of fixes that either weren't quite ready for the first,
or came about from some intensive testing on memcached with 350K+
sockets.
Summary:
- Fixes for races or deadlocks around poll handling
- Don't double account fixed files against RLIMIT_NOFILE
- IORING_OP_OPENAT LFS fix
- Poll retry handling (Bijan)
- Missing finish_wait() for SQPOLL (Hillf)
- Cleanup/split of io_kiocb alloc vs ctx references (Pavel)
- Fixed file unregistration and init fixes (Xiaoguang)
- Various little fixes (Xiaoguang, Pavel, Colin)"
* tag 'io_uring-5.7-2020-04-09' of git://git.kernel.dk/linux-block:
io_uring: punt final io_ring_ctx wait-and-free to workqueue
io_uring: fix fs cleanup on cqe overflow
io_uring: don't read user-shared sqe flags twice
io_uring: remove req init from io_get_req()
io_uring: alloc req only after getting sqe
io_uring: simplify io_get_sqring
io_uring: do not always copy iovec in io_req_map_rw()
io_uring: ensure openat sets O_LARGEFILE if needed
io_uring: initialize fixed_file_data lock
io_uring: remove redundant variable pointer nxt and io_wq_assign_next call
io_uring: fix ctx refcounting in io_submit_sqes()
io_uring: process requests completed with -EAGAIN on poll list
io_uring: remove bogus RLIMIT_NOFILE check in file registration
io_uring: use io-wq manager as backup task if task is exiting
io_uring: grab task reference for poll requests
io_uring: retry poll if we got woken with non-matching mask
io_uring: add missing finish_wait() in io_sq_thread()
io_uring: refactor file register/unregister/update handling
- Validate the realtime geometry in the superblock when mounting
- Refactor a bunch of tricky flag handling in the log code
- Flush the CIL more judiciously so that we don't wait until there are
millions of log items consuming a lot of memory.
- Throttle transaction commits to prevent the xfs frontend from flooding
the CIL with too many log items.
- Account metadata buffers correctly for memory reclaim.
- Mark slabs properly for memory reclaim. These should help reclaim run
more effectively when XFS is using a lot of memory.
- Don't write a garbage log record at unmount time if we're trying to
trigger summary counter recalculation at next mount.
- Don't block the AIL on locked dquot/inode buffers; instead trigger its
backoff mechanism to give the lock holder a chance to finish up.
- Ratelimit writeback flushing when buffered writes encounter ENOSPC.
- Other minor cleanups.
- Make reflink a synchronous operation when the fs is mounted with wsync
or sync, which means that now we force the log to disk to record the
changes.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAl6LVwIACgkQ+H93GTRK
tOuSaQ/6AmhmSTUU6V5ZP8LptYcmRDzu7ZC8QAINWhVjXhDzbi26bFbEwf/sE0hq
m9yRANKSMBJvcSPB2KdUZLJvgClUaPmrlo1yE3RV8K+FCCj/60VYV8IWuLmeuG/a
xBwAQVBGa/go/6I1gxj+ofiltMCElAYET2nShTyFBX5kVGUsJfW5Cvm94KTtBITi
VP4GmVwyl0Q5rM6+/4wd7IQKfzieBk4DsCPYW26cs+STdVXoHzWgyxeM8AkeM5GW
6SI/UKP3PDHa+3tJ/NucDnCa9ykQsaOykHl6VwVZGDTDb2KFNGqiAqT9fPtx1yUF
30EBPPrhSAxh9wHSmxuU6PpPcMqBomanOe7as66ZLcV4Uij+y1JURWumTDw8Yo3C
edCh/wpWgmKSj95U0D8KPk/ETgMIaBFRVJhQZ2JEgtarPHNxrudsoOcRykftJeTK
l3tPyS5pK8OjVhK927G5rPJd7pqsO0BjjWobno87E3M7D238qyB+A1DAizMIhIid
J9oyu3GGaxg4n228AmOAn7DqV4eGBjTGSLJGh4zUYy0Zk7zK52fKEYh+6z248rxB
Ce0PGB2LXsj9SwbeRMfJAQ49uzXusXv5Ok9ULhk78J+eM8sQi+0KRc2snk++/MSW
c+pVvW77GVihuh1mlBnPIC+OvI+4q/DYPTNttV9yXUoPKchMBHo=
=+M2M
-----END PGP SIGNATURE-----
Merge tag 'xfs-5.7-merge-12' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull more xfs updates from Darrick Wong:
"As promised last week, this batch changes how xfs interacts with
memory reclaim; how the log batches and throttles log items; how hard
writes near ENOSPC will try to squeeze more space out of the
filesystem; and hopefully fix the last of the umount hangs after a
catastrophic failure.
Summary:
- Validate the realtime geometry in the superblock when mounting
- Refactor a bunch of tricky flag handling in the log code
- Flush the CIL more judiciously so that we don't wait until there
are millions of log items consuming a lot of memory.
- Throttle transaction commits to prevent the xfs frontend from
flooding the CIL with too many log items.
- Account metadata buffers correctly for memory reclaim.
- Mark slabs properly for memory reclaim. These should help reclaim
run more effectively when XFS is using a lot of memory.
- Don't write a garbage log record at unmount time if we're trying to
trigger summary counter recalculation at next mount.
- Don't block the AIL on locked dquot/inode buffers; instead trigger
its backoff mechanism to give the lock holder a chance to finish
up.
- Ratelimit writeback flushing when buffered writes encounter ENOSPC.
- Other minor cleanups.
- Make reflink a synchronous operation when the fs is mounted with
wsync or sync, which means that now we force the log to disk to
record the changes"
* tag 'xfs-5.7-merge-12' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (26 commits)
xfs: reflink should force the log out if mounted with wsync
xfs: factor out a new xfs_log_force_inode helper
xfs: fix inode number overflow in ifree cluster helper
xfs: remove redundant variable assignment in xfs_symlink()
xfs: ratelimit inode flush on buffered write ENOSPC
xfs: return locked status of inode buffer on xfsaild push
xfs: trylock underlying buffer on dquot flush
xfs: remove unnecessary ternary from xfs_create
xfs: don't write a corrupt unmount record to force summary counter recalc
xfs: factor inode lookup from xfs_ifree_cluster
xfs: tail updates only need to occur when LSN changes
xfs: factor common AIL item deletion code
xfs: correctly acount for reclaimable slabs
xfs: Improve metadata buffer reclaim accountability
xfs: don't allow log IO to be throttled
xfs: Throttle commits on delayed background CIL push
xfs: Lower CIL flush limit for large logs
xfs: remove some stale comments from the log code
xfs: refactor unmount record writing
xfs: merge xlog_commit_record with xlog_write_done
...
We can't reliably wait in io_ring_ctx_wait_and_kill(), since the
task_works list isn't ordered (in fact it's LIFO ordered). We could
either fix this with a separate task_works list for io_uring work, or
just punt the wait-and-free to async context. This ensures that
task_work that comes in while we're shutting down is processed
correctly. If we don't go async, we could have work past the fput()
work for the ring that depends on work that won't be executed until
after we're done with the wait-and-free. But as this operation is
blocking, it'll never get a chance to run.
This was reproduced with hundreds of thousands of sockets running
memcached, haven't been able to reproduce this synthetically.
Reported-by: Dan Melnic <dmm@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When "ubifs: introduce UBIFS_ATIME_SUPPORT to ubifs" introduced atime
support to ubifs, it also added lazytime support. As far as I can tell
the lazytime support is terminally broken, as it causes
mark_inode_dirty_sync to be called from __writeback_single_inode, which
will then trigger the locking assert in ubifs_dirty_inode. Just remove
the broken lazytime support for now, it can be added back later,
especially as some infrastructure changes should make that easier soon.
Fixes: 8c1c5f2638 ("ubifs: introduce UBIFS_ATIME_SUPPORT to ubifs")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Richard Weinberger <richard@nod.at>
`removed_sized` isn't correctly initialized (as the doc comment
suggests) on memory allocation failures. Fix by moving initialization up
a bit.
Fixes: 0c47383ba3 ("kernfs: Add option to enable user xattrs")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
Signed-off-by: Tejun Heo <tj@kernel.org>
The noisy posix error message in readdir was supposed
to be an FYI (not enabled by default)
CIFS VFS: XXX dev 66306, reparse 0, mode 755
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
- A fix for a crash in machine check handling on pseries (ie. guests)
- A small series to make it possible to disable CONFIG_COMPAT, and turn it off
by default for ppc64le where it's not used.
- A few other miscellaneous fixes and small improvements.
Thanks to:
Alexey Kardashevskiy, Anju T Sudhakar, Arnd Bergmann, Christophe Leroy, Dan
Carpenter, Ganesh Goudar, Geert Uytterhoeven, Geoff Levand, Mahesh Salgaonkar,
Markus Elfring, Michal Suchanek, Nicholas Piggin, Stephen Boyd, Wen Xiong.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCAAxFiEEJFGtCPCthwEv2Y/bUevqPMjhpYAFAl6O8LgTHG1wZUBlbGxl
cm1hbi5pZC5hdQAKCRBR6+o8yOGlgMcYEACbGf+Z9brLSasYajoqU6QdqGPacHEN
1a9TEmUnN+HWgtfkkoEBFbyuYHnhyhYuf7hvNccDjDA91ESVylO+Wq7Q+v/xoz29
LVyb3V6uuVMLHnoqwP5jpr0lS0aOpuu3Nc2SpfBuolDtJqeMxpVEEK3Ln3uATlq6
SnEAxQEKb2x4Y4Tfuq5A3txupj0s/UYrmeR6GAdkN3Oapbb9uQl8Ql2smqrZo0cq
6TLxtoFzbfXkV6NmY2V0OKMTeXt0fhrvdDhFEBckCUpRZLv4Fd7CwPWNER2bUVs6
04kg87BAO8qRyfr3G93oP0mWgi65kpXI8yN6Vt5Lig+5PnxNh7swvEqDpQR1s9se
uVoeN7RBHOKQppZRkpzf6yZbelCcMuwTeIBQ9XOx/jYFAHJ2nUkJm9TYgKckVvNY
E4shiM1eoOVMrEmODFBCUmUkJLyn5jU1+r5mn708v2Nb5E1XgoTejitB6bHyL+Aa
zo/0DdsZO86iNE7th94oHQRgVbx1vtP9kV6vK6BLB5M95RSGEdVAEMo5CTL70wr+
hz7suOaijPi+TeCW9YPrcOHcHOQE9pzTFQoVZHuv8egbvd9Rni3aBAMMqHx/lQdE
Hk4zN7IytOLqQTfINNYgoo0kUKI1ipJQOBfR5jyeLhgg6yp36CnO/m+0QGvrILbi
18tlf2qkD75Z3g==
=72sb
-----END PGP SIGNATURE-----
Merge tag 'powerpc-5.7-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull more powerpc updates from Michael Ellerman:
"The bulk of this is the series to make CONFIG_COMPAT user-selectable,
it's been around for a long time but was blocked behind the
syscall-in-C series.
Plus there's also a few fixes and other minor things.
Summary:
- A fix for a crash in machine check handling on pseries (ie. guests)
- A small series to make it possible to disable CONFIG_COMPAT, and
turn it off by default for ppc64le where it's not used.
- A few other miscellaneous fixes and small improvements.
Thanks to: Alexey Kardashevskiy, Anju T Sudhakar, Arnd Bergmann,
Christophe Leroy, Dan Carpenter, Ganesh Goudar, Geert Uytterhoeven,
Geoff Levand, Mahesh Salgaonkar, Markus Elfring, Michal Suchanek,
Nicholas Piggin, Stephen Boyd, Wen Xiong"
* tag 'powerpc-5.7-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
selftests/powerpc: Always build the tm-poison test 64-bit
powerpc: Improve ppc_save_regs()
Revert "powerpc/64: irq_work avoid interrupt when called with hardware irqs enabled"
powerpc/time: Replace <linux/clk-provider.h> by <linux/of_clk.h>
powerpc/pseries/ddw: Extend upper limit for huge DMA window for persistent memory
powerpc/perf: split callchain.c by bitness
powerpc/64: Make COMPAT user-selectable disabled on littleendian by default.
powerpc/64: make buildable without CONFIG_COMPAT
powerpc/perf: consolidate valid_user_sp -> invalid_user_sp
powerpc/perf: consolidate read_user_stack_32
powerpc: move common register copy functions from signal_32.c to signal.c
powerpc: Add back __ARCH_WANT_SYS_LLSEEK macro
powerpc/ps3: Set CONFIG_UEVENT_HELPER=y in ps3_defconfig
powerpc/ps3: Remove an unneeded NULL check
powerpc/ps3: Remove duplicate error message
powerpc/powernv: Re-enable imc trace-mode in kernel
powerpc/perf: Implement a global lock to avoid races between trace, core and thread imc events.
powerpc/pseries: Fix MCE handling on pseries
selftests/eeh: Skip ahci adapters
powerpc/64s: Fix doorbell wakeup msgclr optimisation
syzbot wrote:
> ========================================================
> WARNING: possible irq lock inversion dependency detected
> 5.6.0-syzkaller #0 Not tainted
> --------------------------------------------------------
> swapper/1/0 just changed the state of lock:
> ffffffff898090d8 (tasklist_lock){.+.?}-{2:2}, at: send_sigurg+0x9f/0x320 fs/fcntl.c:840
> but this lock took another, SOFTIRQ-unsafe lock in the past:
> (&pid->wait_pidfd){+.+.}-{2:2}
>
>
> and interrupts could create inverse lock ordering between them.
>
>
> other info that might help us debug this:
> Possible interrupt unsafe locking scenario:
>
> CPU0 CPU1
> ---- ----
> lock(&pid->wait_pidfd);
> local_irq_disable();
> lock(tasklist_lock);
> lock(&pid->wait_pidfd);
> <Interrupt>
> lock(tasklist_lock);
>
> *** DEADLOCK ***
>
> 4 locks held by swapper/1/0:
The problem is that because wait_pidfd.lock is taken under the tasklist
lock. It must always be taken with irqs disabled as tasklist_lock can be
taken from interrupt context and if wait_pidfd.lock was already taken this
would create a lock order inversion.
Oleg suggested just disabling irqs where I have added extra calls to
wait_pidfd.lock. That should be safe and I think the code will eventually
do that. It was rightly pointed out by Christian that sharing the
wait_pidfd.lock was a premature optimization.
It is also true that my pre-merge window testing was insufficient. So
remove the premature optimization and give struct pid a dedicated lock of
it's own for struct pid things. I have verified that lockdep sees all 3
paths where we take the new pid->lock and lockdep does not complain.
It is my current day dream that one day pid->lock can be used to guard the
task lists as well and then the tasklist_lock won't need to be held to
deliver signals. That will require taking pid->lock with irqs disabled.
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Link: https://lore.kernel.org/lkml/00000000000011d66805a25cd73f@google.com/
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Reported-by: syzbot+343f75cdeea091340956@syzkaller.appspotmail.com
Reported-by: syzbot+832aabf700bc3ec920b9@syzkaller.appspotmail.com
Reported-by: syzbot+f675f964019f884dbd0f@syzkaller.appspotmail.com
Reported-by: syzbot+a9fb1457d720a55d6dc5@syzkaller.appspotmail.com
Fixes: 7bc3e6e55a ("proc: Use a list of inodes to flush from proc")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
If completion queue overflow occurs, __io_cqring_fill_event() will
update req->cflags, which is in a union with req->work and happens to
be aliased to req->work.fs. Following io_free_req() ->
io_req_work_drop_env() may get a bunch of different problems (miscount
fs->users, segfault, etc) on cleaning @fs.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
- support for asynchronous create and unlink (Jeff Layton). Creates
and unlinks are satisfied locally, without waiting for a reply from
the MDS, provided the client has been granted appropriate caps (new
in v15.y.z ("Octopus") release). This can be a big help for metadata
heavy workloads such as tar and rsync. Opt-in with the new nowsync
mount option.
- multiple blk-mq queues for rbd (Hannes Reinecke and myself). When
the driver was converted to blk-mq, we settled on a single blk-mq
queue because of a global lock in libceph and some other technical
debt. These have since been addressed, so allocate a queue per CPU
to enhance parallelism.
- don't hold onto caps that aren't actually needed (Zheng Yan). This
has been our long-standing behavior, but it causes issues with some
active/standby applications (synchronous I/O, stalls if the standby
goes down, etc).
- .snap directory timestamps consistent with ceph-fuse (Luis Henriques)
-----BEGIN PGP SIGNATURE-----
iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAl6OEO4THGlkcnlvbW92
QGdtYWlsLmNvbQAKCRBKf944AhHzi0XNB/wItYipkjlL5fIUBqiRWzYai72DWdPp
CnOZo8LB+O0MQDomPT6DdpU1OMlWZi5HF7zklrZ35LTm21UkRNC9zvccjs9l66PJ
qo9cKJbxxju+hgzIvgvK9PjlDlaiFAc/pkF8lZ/NaOnSsM1vvsFL9IuY2LXS38MY
A/uUTZNUnFy5udam8TPuN+gWwZcUIH48lRWQLWe2I/hNJSweX1l8OHvecOBg+cYH
G+8vb7mLU2V9ky0YT5JJmVxUV3CWA5wH6ZrWWy1ofVDdeSFLPrhgWX6IMjaNq+Gd
xPfxmly47uBviSqON9dMkiThgy0Qj7yi0Pvx+1sAZbD7aj/6A4qg3LX5
=GIX0
-----END PGP SIGNATURE-----
Merge tag 'ceph-for-5.7-rc1' of git://github.com/ceph/ceph-client
Pull ceph updates from Ilya Dryomov:
"The main items are:
- support for asynchronous create and unlink (Jeff Layton).
Creates and unlinks are satisfied locally, without waiting for a
reply from the MDS, provided the client has been granted
appropriate caps (new in v15.y.z ("Octopus") release). This can be
a big help for metadata heavy workloads such as tar and rsync.
Opt-in with the new nowsync mount option.
- multiple blk-mq queues for rbd (Hannes Reinecke and myself).
When the driver was converted to blk-mq, we settled on a single
blk-mq queue because of a global lock in libceph and some other
technical debt. These have since been addressed, so allocate a
queue per CPU to enhance parallelism.
- don't hold onto caps that aren't actually needed (Zheng Yan).
This has been our long-standing behavior, but it causes issues with
some active/standby applications (synchronous I/O, stalls if the
standby goes down, etc).
- .snap directory timestamps consistent with ceph-fuse (Luis
Henriques)"
* tag 'ceph-for-5.7-rc1' of git://github.com/ceph/ceph-client: (49 commits)
ceph: fix snapshot directory timestamps
ceph: wait for async creating inode before requesting new max size
ceph: don't skip updating wanted caps when cap is stale
ceph: request new max size only when there is auth cap
ceph: cleanup return error of try_get_cap_refs()
ceph: return ceph_mdsc_do_request() errors from __get_parent()
ceph: check all mds' caps after page writeback
ceph: update i_requested_max_size only when sending cap msg to auth mds
ceph: simplify calling of ceph_get_fmode()
ceph: remove delay check logic from ceph_check_caps()
ceph: consider inode's last read/write when calculating wanted caps
ceph: always renew caps if mds_wanted is insufficient
ceph: update dentry lease for async create
ceph: attempt to do async create when possible
ceph: cache layout in parent dir on first sync create
ceph: add new MDS req field to hold delegated inode number
ceph: decode interval_sets for delegated inos
ceph: make ceph_fill_inode non-static
ceph: perform asynchronous unlink if we have sufficient caps
ceph: don't take refs to want mask unless we have all bits
...
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCXosx9gAKCRDh3BK/laaZ
PPW3AQDYShgXR4gqVL25lxJ1IKaIikIug8EC7S9/a3DMNMpPQAEA+kCrdkpnaJVc
DESmHyMDw7EtgeZXu+xkRlFfEU8BOgc=
=0WQh
-----END PGP SIGNATURE-----
Merge tag 'ovl-update-5.7' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs
Pull overlayfs update from Miklos Szeredi:
- Fix failure to copy-up files from certain NFSv4 mounts
- Sort out inconsistencies between st_ino and i_ino (used in /proc/locks)
- Allow consistent (POSIX-y) inode numbering in more cases
- Allow virtiofs to be used as upper layer
- Miscellaneous cleanups and fixes
* tag 'ovl-update-5.7' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
ovl: document xino expected behavior
ovl: enable xino automatically in more cases
ovl: avoid possible inode number collisions with xino=on
ovl: use a private non-persistent ino pool
ovl: fix WARN_ON nlink drop to zero
ovl: fix a typo in comment
ovl: replace zero-length array with flexible-array member
ovl: ovl_obtain_alias(): don't call d_instantiate_anon() for old
ovl: strict upper fs requirements for remote upper fs
ovl: check if upper fs supports RENAME_WHITEOUT
ovl: allow remote upper
ovl: decide if revalidate needed on a per-dentry basis
ovl: separate detection of remote upper layer from stacked overlay
ovl: restructure dentry revalidation
ovl: ignore failure to copy up unknown xattrs
ovl: document permission model
ovl: simplify i_ino initialization
ovl: factor out helper ovl_get_root()
ovl: fix out of date comment and unreachable code
ovl: fix value of i_ino for lower hardlink corner case
- Fix a problem in readahead where we can crash if we can't allocate a
full bio due to GFP_NORETRY.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAl6GDloACgkQ+H93GTRK
tOuH4g/+J95SMTPKVoChOwJ/vMS7VsPCrbj7GTgDe0LKtaRq5/dQ1fTd3QuSqAh9
0KqCCvAb34y7oP58SccBlr11lRBsJPuWKVJAbiH02Hs+r+mWC+pbWtQz6PCez9wl
hLN44jfsuLHPwivf3oHrmw8B5SHdFjkm659ocLseOJCI2tyQO0kZ8AUSEt22gx8r
w/Z9pX2NCy7Bhzxpd5jvzh7xY27TTcI2M+i/FW9R4hgznM22Z2pZ8MoKTvDj5gly
20UljmP0sTRgMOMavXs7ME2hHwXDitTIkmKD2JZjNkoeKxaRl4iz+9T0E4xZvdoZ
lH+bbUikDocWlOn9zy9+VNBGqtLLKfBp8kOO/I1q1fkW3G/abEPqGvke2ehSmqFx
65MWReBOS0L+g7A0KakHdJVKRuuK3lvXVls+a/2lzTjvjojXCBfy1YCTvhOP2dhT
qO3d2ZPBEr+l0VKwLK4QQHz6/1zfJdUy2rFZZnagRUigKc9xW7oXXwkXRHd0sMRv
KPOLwxeQu5TSf8hVSH9Z6M53TyEQXX2N0Kifr2GUFXkwoxY91I+hFnuGMSafoKTz
CqjOJGNiBFM8ypT/FYRKSakZXEju6zHfyD0tQ0hnEAb2A2HCOJdGqejqbJRZywW2
OM+loTn80spDyuoNpat3/bJBxQLAWX7Dq7m6zL1rkcQLtKxoxik=
=5vQD
-----END PGP SIGNATURE-----
Merge tag 'iomap-5.7-merge-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull iomap fix from Darrick Wong:
"Fix a problem in readahead where we can crash if we can't allocate a
full bio due to GFP_NORETRY"
* tag 'iomap-5.7-merge-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
iomap: Handle memory allocation failure in readahead
- Add support for region alignment configuration and enforcement to
fix compatibility across architectures and PowerPC page size
configurations.
- Introduce 'zero_page_range' as a dax operation. This facilitates
filesystem-dax operation without a block-device.
- Introduce phys_to_target_node() to facilitate drivers that want to
know resulting numa node if a given reserved address range was
onlined.
- Advertise a persistence-domain for of_pmem and papr_scm. The
persistence domain indicates where cpu-store cycles need to reach in
the platform-memory subsystem before the platform will consider them
power-fail protected.
- Promote numa_map_to_online_node() to a cross-kernel generic facility.
- Save x86 numa information to allow for node-id lookups for reserved
memory ranges, deploy that capability for the e820-pmem driver.
- Pick up some miscellaneous minor fixes, that missed v5.6-final,
including a some smatch reports in the ioctl path and some unit test
compilation fixups.
- Fixup some flexible-array declarations.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEf41QbsdZzFdA8EfZHtKRamZ9iAIFAl6LtIAACgkQHtKRamZ9
iAIwRA/8CLVVuQpgHQ1tqK4h8CZPrISFXh7wy7uhocEU2xrDh6iGVnLztmoLRr2k
5f8T9lRzreSAwIVL5DbGqP1pFncqIt9VMnKsFlaPMBGCBNR+hURY0iBCNjIT+jiq
BOzLd52MR2rqJxeXGTMUbWrBrbmuj4mZPdmGVuFFe7GFRpoaVpCgOo+296eWa/ot
gIOFUTonZY7STYjNvDok0TXCmiCFuJb+P+y5ldfCPShHvZhTiaF53jircja8vAjO
G5dt8ixBKUK0rXRc4SEQsQhAZNcAFHb6Gy5lg4C2QzhTF374xTc9usJZNWbIE9iM
5mipBYvjVuoY+XaCNZDkaRcJIy/jqB15O6l3QIWbZLGaK9m95YPp9LmkPFwd3JpO
e3rO24ML471DxqB9iWIiJCNcBBocLOlnd6qAQTpppWDpGNbudwXvfsmKHmKIScSE
x+IDCdscLmmm+WG2dLmLraWOVPu42xZFccoQCi4M3TTqfeB9pZ9XckFQ37zX62zG
5t+7Ek+t1W4QVt/JQYVKH03XT15sqUpVknvx0Hl4Y5TtbDOkFLkO8RN0/HyExDef
7iegS35kqTsM4EfZQ+9juKbI2JBAjHANcbj0V4dogqaRj6vr3akumBzUtuYqAofv
qU3s9skmLsEemOJC+ns2PT8vl5dyIoeDfH0r2XvGWxYqolMqJpA=
=sY4N
-----END PGP SIGNATURE-----
Merge tag 'libnvdimm-for-5.7' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull libnvdimm and dax updates from Dan Williams:
"There were multiple touches outside of drivers/nvdimm/ this round to
add cross arch compatibility to the devm_memremap_pages() interface,
enhance numa information for persistent memory ranges, and add a
zero_page_range() dax operation.
This cycle I switched from the patchwork api to Konstantin's b4 script
for collecting tags (from x86, PowerPC, filesystem, and device-mapper
folks), and everything looks to have gone ok there. This has all
appeared in -next with no reported issues.
Summary:
- Add support for region alignment configuration and enforcement to
fix compatibility across architectures and PowerPC page size
configurations.
- Introduce 'zero_page_range' as a dax operation. This facilitates
filesystem-dax operation without a block-device.
- Introduce phys_to_target_node() to facilitate drivers that want to
know resulting numa node if a given reserved address range was
onlined.
- Advertise a persistence-domain for of_pmem and papr_scm. The
persistence domain indicates where cpu-store cycles need to reach
in the platform-memory subsystem before the platform will consider
them power-fail protected.
- Promote numa_map_to_online_node() to a cross-kernel generic
facility.
- Save x86 numa information to allow for node-id lookups for reserved
memory ranges, deploy that capability for the e820-pmem driver.
- Pick up some miscellaneous minor fixes, that missed v5.6-final,
including a some smatch reports in the ioctl path and some unit
test compilation fixups.
- Fixup some flexible-array declarations"
* tag 'libnvdimm-for-5.7' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (29 commits)
dax: Move mandatory ->zero_page_range() check in alloc_dax()
dax,iomap: Add helper dax_iomap_zero() to zero a range
dax: Use new dax zero page method for zeroing a page
dm,dax: Add dax zero_page_range operation
s390,dcssblk,dax: Add dax zero_page_range operation to dcssblk driver
dax, pmem: Add a dax operation zero_page_range
pmem: Add functions for reading/writing page to/from pmem
libnvdimm: Update persistence domain value for of_pmem and papr_scm device
tools/test/nvdimm: Fix out of tree build
libnvdimm/region: Fix build error
libnvdimm/region: Replace zero-length array with flexible-array member
libnvdimm/label: Replace zero-length array with flexible-array member
ACPI: NFIT: Replace zero-length array with flexible-array member
libnvdimm/region: Introduce an 'align' attribute
libnvdimm/region: Introduce NDD_LABELING
libnvdimm/namespace: Enforce memremap_compat_align()
libnvdimm/pfn: Prevent raw mode fallback if pfn-infoblock valid
libnvdimm: Out of bounds read in __nd_ioctl()
acpi/nfit: improve bounds checking for 'func'
mm/memremap_pages: Introduce memremap_compat_align()
...
Whenever we add a ticket to a space_info object we increment the object's
reclaim_size counter witht the ticket's bytes, and we decrement it with
the corresponding amount only when we are able to grant the requested
space to the ticket. When we are not able to grant the space to a ticket,
or when the ticket is removed due to a signal (e.g. an application has
received sigterm from the terminal) we never decrement the counter with
the corresponding bytes from the ticket. This leak can result in the
space reclaim code to later do much more work than necessary. So fix it
by decrementing the counter when those two cases happen as well.
Fixes: db161806dc ("btrfs: account ticket size at add/delete time")
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is a revert of commit 0a8068a3dd ("btrfs: make ranged full
fsyncs more efficient"), with updated comment in btrfs_sync_file.
Commit 0a8068a3dd ("btrfs: make ranged full fsyncs more efficient")
made full fsyncs operate on the given range only as it assumed it was safe
when using the NO_HOLES feature, since the hole detection was simplified
some time ago and no longer was a source for races with ordered extent
completion of adjacent file ranges.
However it's still not safe to have a full fsync only operate on the given
range, because extent maps for new extents might not be present in memory
due to inode eviction or extent cloning. Consider the following example:
1) We are currently at transaction N;
2) We write to the file range [0, 1MiB);
3) Writeback finishes for the whole range and ordered extents complete,
while we are still at transaction N;
4) The inode is evicted;
5) We open the file for writing, causing the inode to be loaded to
memory again, which sets the 'full sync' bit on its flags. At this
point the inode's list of modified extent maps is empty (figuring
out which extents were created in the current transaction and were
not yet logged by an fsync is expensive, that's why we set the
'full sync' bit when loading an inode);
6) We write to the file range [512KiB, 768KiB);
7) We do a ranged fsync (such as msync()) for file range [512KiB, 768KiB).
This correctly flushes this range and logs its extent into the log
tree. When the writeback started an extent map for range [512KiB, 768KiB)
was added to the inode's list of modified extents, and when the fsync()
finishes logging it removes that extent map from the list of modified
extent maps. This fsync also clears the 'full sync' bit;
8) We do a regular fsync() (full ranged). This fsync() ends up doing
nothing because the inode's list of modified extents is empty and
no other changes happened since the previous ranged fsync(), so
it just returns success (0) and we end up never logging extents for
the file ranges [0, 512KiB) and [768KiB, 1MiB).
Another scenario where this can happen is if we replace steps 2 to 4 with
cloning from another file into our test file, as that sets the 'full sync'
bit in our inode's flags and does not populate its list of modified extent
maps.
This was causing test case generic/457 to fail sporadically when using the
NO_HOLES feature, as it exercised this later case where the inode has the
'full sync' bit set and has no extent maps in memory to represent the new
extents due to extent cloning.
Fix this by reverting commit 0a8068a3dd ("btrfs: make ranged full fsyncs
more efficient") since there is no easy way to work around it.
Fixes: 0a8068a3dd ("btrfs: make ranged full fsyncs more efficient")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When not using the NO_HOLES feature we were not marking the destination's
file range as written after cloning an inline extent into it. This can
lead to a data loss if the current destination file size is smaller than
the source file's size.
Example:
$ mkfs.btrfs -f -O ^no-holes /dev/sdc
$ mount /mnt/sdc /mnt
$ echo "hello world" > /mnt/foo
$ cp --reflink=always /mnt/foo /mnt/bar
$ rm -f /mnt/foo
$ umount /mnt
$ mount /mnt/sdc /mnt
$ cat /mnt/bar
$
$ stat -c %s /mnt/bar
0
# -> the file is empty, since we deleted foo, the data lost is forever
Fix that by calling btrfs_inode_set_file_extent_range() after cloning an
inline extent.
A test case for fstests will follow soon.
Link: https://lore.kernel.org/linux-btrfs/20200404193846.GA432065@latitude/
Reported-by: Johannes Hirte <johannes.hirte@datenkhaos.de>
Fixes: 9ddc959e80 ("btrfs: use the file extent tree infrastructure")
Tested-by: Johannes Hirte <johannes.hirte@datenkhaos.de>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Previously we would set the reloc root's last snapshot to transid - 1.
However there was a problem with doing this, and we changed it to
setting the last snapshot to the generation of the commit node of the fs
root.
This however broke should_ignore_root(). The assumption is that if we
are in a generation newer than when the reloc root was created, then we
would find the reloc root through normal backref lookups, and thus can
ignore any fs roots we find with an old enough reloc root.
Now that the last snapshot could be considerably further in the past
than before, we'd end up incorrectly ignoring an fs root. Thus we'd
find no nodes for the bytenr we were searching for, and we'd fail to
relocate anything. We'd loop through the relocate code again and see
that there were still used space in that block group, attempt to
relocate those bytenr's again, fail in the same way, and just loop like
this forever. This is tricky in that we have to not modify the fs root
at all during this time, so we need to have a block group that has data
in this fs root that is not shared by any other root, which is why this
has been difficult to reproduce.
Fixes: 054570a1dc ("Btrfs: fix relocation incorrectly dropping data references")
CC: stable@vger.kernel.org # 4.9+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Don't re-read userspace-shared sqe->flags, it can be exploited.
sqe->flags are copied into req->flags in io_submit_sqe(), check them
there instead.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
io_get_req() do two different things: io_kiocb allocation and
initialisation. Move init part out of it and rename into
io_alloc_req(). It's simpler this way and also have better data
locality.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
As io_get_sqe() split into 2 stage get/consume, get an sqe before
allocating io_kiocb, so no free_req*() for a failure case is needed,
and inline back __io_req_do_free(), which has only 1 user.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Make io_get_sqring() care only about sqes themselves, not initialising
the io_kiocb. Also, split it into get + consume, that will be helpful in
the future.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In io_read_prep() or io_write_prep(), io_req_map_rw() takes
struct io_async_rw's fast_iov as argument to call io_import_iovec(),
and if io_import_iovec() uses struct io_async_rw's fast_iov as
valid iovec array, later indeed io_req_map_rw() does not need
to do the memcpy operation, because they are same pointers.
Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
OPENAT2 correctly sets O_LARGEFILE if it has to, but that escaped the
OPENAT opcode. Dmitry reports that his test case that compares openat()
and IORING_OP_OPENAT sees failures on large files:
*** sync openat
openat succeeded
sync write at offset 0
write succeeded
sync write at offset 4294967296
write succeeded
*** sync openat
openat succeeded
io_uring write at offset 0
write succeeded
io_uring write at offset 4294967296
write succeeded
*** io_uring openat
openat succeeded
sync write at offset 0
write succeeded
sync write at offset 4294967296
write failed: File too large
*** io_uring openat
openat succeeded
io_uring write at offset 0
write succeeded
io_uring write at offset 4294967296
write failed: File too large
Ensure we set O_LARGEFILE, if force_o_largefile() is true.
Cc: stable@vger.kernel.org # v5.6
Fixes: 15b71abe7b ("io_uring: add support for IORING_OP_OPENAT")
Reported-by: Dmitry Kadashev <dkadashev@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig noticed that we were doing some unnecessary
work in orangefs_flush:
orangefs_flush just writes out data on every close(2) call. There is
no need to change anything about the dirty state, especially as
orangefs doesn't treat I_DIRTY_TIMES special in any way. The code
seems to come from partially open coding vfs_fsync.
He sent in a patch with the above commit message and also a
patch that was a reversion of another Orangefs patch I had
sent upstream a while ago. I had to fix his reversion patch
so that it would compile which caused his "don't mess with
I_DIRTY_TIMES" patch to fail to apply. So here I have just
remade his patch and applied it after the fixed reversion patch.
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
Christoph Hellwig sent in a reversion of "orangefs: remember count
when reading." because:
->read_iter calls can race with each other and one or
more ->flush calls. Remove the the scheme to store the read
count in the file private data as is is completely racy and
can cause use after free or double free conditions
Christoph's reversion caused Orangefs not to work or to compile. I
added a patch that fixed that, but intel's kbuild test robot pointed
out that sending Christoph's patch followed by my patch upstream, it
would break bisection because of the failure to compile. So I have
combined the reversion plus my patch... here's the commit message
that was in my patch:
Logically, optimal Orangefs "pages" are 4 megabytes. Reading
large Orangefs files 4096 bytes at a time is like trying to
kick a dead whale down the beach. Before Christoph's "Revert
orangefs: remember count when reading." I tried to give users
a knob whereby they could, for example, use "count" in
read(2) or bs with dd(1) to get whatever they considered an
appropriate amount of bytes at a time from Orangefs and fill
as many page cache pages as they could at once.
Without the racy code that Christoph reverted Orangefs won't
even compile, much less work. So this replaces the logic that
used the private file data that Christoph reverted with
a static number of bytes to read from Orangefs.
I ran tests like the following to determine what a
reasonable static number of bytes might be:
dd if=/pvfsmnt/asdf of=/dev/null count=128 bs=4194304
dd if=/pvfsmnt/asdf of=/dev/null count=256 bs=2097152
dd if=/pvfsmnt/asdf of=/dev/null count=512 bs=1048576
.
.
.
dd if=/pvfsmnt/asdf of=/dev/null count=4194304 bs=128
Reads seem faster using the static number, so my "knob code"
wasn't just racy, it wasn't even a good idea...
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
Reported-by: kbuild test robot <lkp@intel.com>
Merge more updates from Andrew Morton:
- a lot more of MM, quite a bit more yet to come: (memcg, pagemap,
vmalloc, pagealloc, migration, thp, ksm, madvise, virtio,
userfaultfd, memory-hotplug, shmem, rmap, zswap, zsmalloc, cleanups)
- various other subsystems (procfs, misc, MAINTAINERS, bitops, lib,
checkpatch, epoll, binfmt, kallsyms, reiserfs, kmod, gcov, kconfig,
ubsan, fault-injection, ipc)
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (158 commits)
ipc/shm.c: make compat_ksys_shmctl() static
ipc/mqueue.c: fix a brace coding style issue
lib/Kconfig.debug: fix a typo "capabilitiy" -> "capability"
ubsan: include bug type in report header
kasan: unset panic_on_warn before calling panic()
ubsan: check panic_on_warn
drivers/misc/lkdtm/bugs.c: add arithmetic overflow and array bounds checks
ubsan: split "bounds" checker from other options
ubsan: add trap instrumentation option
init/Kconfig: clean up ANON_INODES and old IO schedulers options
kernel/gcov/fs.c: replace zero-length array with flexible-array member
gcov: gcc_3_4: replace zero-length array with flexible-array member
gcov: gcc_4_7: replace zero-length array with flexible-array member
kernel/kmod.c: fix a typo "assuems" -> "assumes"
reiserfs: clean up several indentation issues
kallsyms: unexport kallsyms_lookup_name() and kallsyms_on_each_symbol()
samples/hw_breakpoint: drop use of kallsyms_lookup_name()
samples/hw_breakpoint: drop HW_BREAKPOINT_R when reporting writes
fs/binfmt_elf.c: don't free interpreter's ELF pheaders on common path
fs/binfmt_elf.c: allocate less for static executable
...
Highlights include:
Stable fixes:
- Fix a page leak in nfs_destroy_unlinked_subrequests()
- Fix use-after-free issues in nfs_pageio_add_request()
- Fix new mount code constant_table array definitions
- finish_automount() requires us to hold 2 refs to the mount record
Features:
- Improve the accuracy of telldir/seekdir by using 64-bit cookies when
possible.
- Allow one RDMA active connection and several zombie connections to
prevent blocking if the remote server is unresponsive.
- Limit the size of the NFS access cache by default
- Reduce the number of references to credentials that are taken by NFS
- pNFS files and flexfiles drivers now support per-layout segment
COMMIT lists.
- Enable partial-file layout segments in the pNFS/flexfiles driver.
- Add support for CB_RECALL_ANY to the pNFS flexfiles layout type
- pNFS/flexfiles Report NFS4ERR_DELAY and NFS4ERR_GRACE errors from
the DS using the layouterror mechanism.
Bugfixes and cleanups:
- SUNRPC: Fix krb5p regressions
- Don't specify NFS version in "UDP not supported" error
- nfsroot: set tcp as the default transport protocol
- pnfs: Return valid stateids in nfs_layout_find_inode_by_stateid()
- alloc_nfs_open_context() must use the file cred when available
- Fix locking when dereferencing the delegation cred
- Fix memory leaks in O_DIRECT when nfs_get_lock_context() fails
- Various clean ups of the NFS O_DIRECT commit code
- Clean up RDMA connect/disconnect
- Replace zero-length arrays with C99-style flexible arrays
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEESQctxSBg8JpV8KqEZwvnipYKAPIFAl6LhhsACgkQZwvnipYK
APIOJxAAiQOgmIg1CV4mrlcVhkwy09N5JAia6AENtoTmwm08nAYg5Y8REb9uX46a
/MJsM2WG8hBCgI6eYmRY8LTr4Ft9rTQEJM9DRMuwQREXwMWwBhUv/QakCeqY1lHE
lyB1z4hj5XKeUoN/OcfALC/GXFFf56A0UyN05nMzeCkBTdd3+qu+hW8Ge1wkAXcr
f0pyLbzdFZlJuTmI4tr8F93g9p3ezuFBuEroT7XPIVJylAdZVumHqnOnz/Mvb99x
rNTsX2dc44GhSAfRnTzPumU3MT6BOLvUzNH1xzdiqKzJrbOnG8WjFodrGr3JWpfp
HkeyYQxJ+Hnfb2LiZBjvMQE8M7kVMZ1jVbrGJEbCxfSqgTly8lOHboqAeKsFaReK
LStnusizdA1LHQVZxPdvn+oL49RDxnzm9dY+DkrXK1qT0GE+icN1CyTyLLfkSCp8
tYvZSJ/qPk5BNZegqH1nBqXkMDkOJ4eEA7+luXDmajRkdRrZ3IWY2M1DpMEoueJ2
j/zoj/NFr1oErU4o7PV9oolA1Euhn1L3wIDuzsbVtjySmbXJNQTtaVVRFpGw3SsZ
7rbqi4BB0SzOooNhQ4q8mLNi4qT7bl/3D04eL8UVzEM73plexhQ8XiOEz/VrIRX7
L9viXH49g4DHQ0rZIaWefxFueqpgbNvQwnlLZl2uQotG9hwhTts=
=YUcP
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-5.7-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client updates from Trond Myklebust:
"Highlights include:
Stable fixes:
- Fix a page leak in nfs_destroy_unlinked_subrequests()
- Fix use-after-free issues in nfs_pageio_add_request()
- Fix new mount code constant_table array definitions
- finish_automount() requires us to hold 2 refs to the mount record
Features:
- Improve the accuracy of telldir/seekdir by using 64-bit cookies
when possible.
- Allow one RDMA active connection and several zombie connections to
prevent blocking if the remote server is unresponsive.
- Limit the size of the NFS access cache by default
- Reduce the number of references to credentials that are taken by
NFS
- pNFS files and flexfiles drivers now support per-layout segment
COMMIT lists.
- Enable partial-file layout segments in the pNFS/flexfiles driver.
- Add support for CB_RECALL_ANY to the pNFS flexfiles layout type
- pNFS/flexfiles Report NFS4ERR_DELAY and NFS4ERR_GRACE errors from
the DS using the layouterror mechanism.
Bugfixes and cleanups:
- SUNRPC: Fix krb5p regressions
- Don't specify NFS version in "UDP not supported" error
- nfsroot: set tcp as the default transport protocol
- pnfs: Return valid stateids in nfs_layout_find_inode_by_stateid()
- alloc_nfs_open_context() must use the file cred when available
- Fix locking when dereferencing the delegation cred
- Fix memory leaks in O_DIRECT when nfs_get_lock_context() fails
- Various clean ups of the NFS O_DIRECT commit code
- Clean up RDMA connect/disconnect
- Replace zero-length arrays with C99-style flexible arrays"
* tag 'nfs-for-5.7-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (86 commits)
NFS: Clean up process of marking inode stale.
SUNRPC: Don't start a timer on an already queued rpc task
NFS/pnfs: Reference the layout cred in pnfs_prepare_layoutreturn()
NFS/pnfs: Fix dereference of layout cred in pnfs_layoutcommit_inode()
NFS: Beware when dereferencing the delegation cred
NFS: Add a module parameter to set nfs_mountpoint_expiry_timeout
NFS: finish_automount() requires us to hold 2 refs to the mount record
NFS: Fix a few constant_table array definitions
NFS: Try to join page groups before an O_DIRECT retransmission
NFS: Refactor nfs_lock_and_join_requests()
NFS: Reverse the submission order of requests in __nfs_pageio_add_request()
NFS: Clean up nfs_lock_and_join_requests()
NFS: Remove the redundant function nfs_pgio_has_mirroring()
NFS: Fix memory leaks in nfs_pageio_stop_mirroring()
NFS: Fix a request reference leak in nfs_direct_write_clear_reqs()
NFS: Fix use-after-free issues in nfs_pageio_add_request()
NFS: Fix races nfs_page_group_destroy() vs nfs_destroy_unlinked_subrequests()
NFS: Fix a page leak in nfs_destroy_unlinked_subrequests()
NFS: Remove unused FLUSH_SYNC support in nfs_initiate_pgio()
pNFS/flexfiles: Specify the layout segment range in LAYOUTGET
...
In this round, we've mainly focused on fixing bugs and addressing issues in
recently introduced compression support.
Enhancement:
- add zstd support, and set LZ4 by default
- add ioctl() to show # of compressed blocks
- show mount time in debugfs
- replace rwsem with spinlock
- avoid lock contention in DIO reads
Some major bug fixes wrt compression:
- compressed block count
- memory access and leak
- remove obsolete fields
- flag controls
Other bug fixes and clean ups:
- fix overflow when handling .flags in inode_info
- fix SPO issue during resize FS flow
- fix compression with fsverity enabled
- potential deadlock when writing compressed pages
- show missing mount options
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE00UqedjCtOrGVvQiQBSofoJIUNIFAl6L5f0ACgkQQBSofoJI
UNImoQ/7BHKpwgpgH/DuydO9ess0XuUgQPQxyj+LE79l0jdBo8FxQZJVNAekx2+h
ANTDHjsgqry6xuczOJXzFhECoOrCqZuffrkQM1p3owfzOH9Wrx6aiOomFBJyk/WB
kXAY7LDPUwGF5uDB8tvVhM082qLXOlP0coO57f9Ip/OpaG8YOkti+KbcOKJrJ9o3
63IAzu89D/9XJw8834rDSersiEenSJm3jLC12uOxgXMzjDb1Ul1JH0P51gEu6g6O
3df5tiGJcFfaKVVWPHG+UEfav6mvi28+zU9f+dSmL0Wvb8dIqSgUf6ty8KCuRBYw
kQYi+E0G1dcGi99AitCOrFxzY+oEo/A5wq2HDI6RkNlu6krD8qYNyWcMbP7dNHnU
/5BVN+5d78iR1vKZpTo4X6dojf6B21Tn/OmABSZ5d7B6pI2fY5bjy2WgSLSY7YvP
A6sepb9RAAGvpKvvkHI7gYwDKFMel+vfD6em1SKH5iKaDC0rJTUDUy8PTz/qMPBS
vMn396dLx+TzTa0dZUuSF8NNk6sPZEReC3AuMNAIPSKiuD7tatRxvutHeEg5ktrr
ggOQB67MfKjPMBKmgMIm6XMuILcCGIB1MqbPRlyKtC6rjdMPIKKOfeHJlLFmYwfF
gqvCIFlW4DlxHpHH+LbUFKoUA3zofltL91SHUVATJjmiZIT2pqQ=
=FVxq
-----END PGP SIGNATURE-----
Merge tag 'f2fs-for-5.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs
Pull f2fs updates from Jaegeuk Kim:
"In this round, we've mainly focused on fixing bugs and addressing
issues in recently introduced compression support.
Enhancement:
- add zstd support, and set LZ4 by default
- add ioctl() to show # of compressed blocks
- show mount time in debugfs
- replace rwsem with spinlock
- avoid lock contention in DIO reads
Some major bug fixes wrt compression:
- compressed block count
- memory access and leak
- remove obsolete fields
- flag controls
Other bug fixes and clean ups:
- fix overflow when handling .flags in inode_info
- fix SPO issue during resize FS flow
- fix compression with fsverity enabled
- potential deadlock when writing compressed pages
- show missing mount options"
* tag 'f2fs-for-5.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (66 commits)
f2fs: keep inline_data when compression conversion
f2fs: fix to disable compression on directory
f2fs: add missing CONFIG_F2FS_FS_COMPRESSION
f2fs: switch discard_policy.timeout to bool type
f2fs: fix to verify tpage before releasing in f2fs_free_dic()
f2fs: show compression in statx
f2fs: clean up dic->tpages assignment
f2fs: compress: support zstd compress algorithm
f2fs: compress: add .{init,destroy}_decompress_ctx callback
f2fs: compress: fix to call missing destroy_compress_ctx()
f2fs: change default compression algorithm
f2fs: clean up {cic,dic}.ref handling
f2fs: fix to use f2fs_readpage_limit() in f2fs_read_multi_pages()
f2fs: xattr.h: Make stub helpers inline
f2fs: fix to avoid double unlock
f2fs: fix potential .flags overflow on 32bit architecture
f2fs: fix NULL pointer dereference in f2fs_verity_work()
f2fs: fix to clear PG_error if fsverity failed
f2fs: don't call fscrypt_get_encryption_info() explicitly in f2fs_tmpfile()
f2fs: don't trigger data flush in foreground operation
...
- Fix for memory leaks around UBIFS orphan handling
- Fix for memory leaks around UBI fastmap
- Remove zero-length array from ubi-media.h
- Fix for TNC lookup in UBIFS orphan code
-----BEGIN PGP SIGNATURE-----
iQJKBAABCAA0FiEEdgfidid8lnn52cLTZvlZhesYu8EFAl6McBEWHHJpY2hhcmRA
c2lnbWEtc3Rhci5hdAAKCRBm+VmF6xi7wbxzD/0RXX1RmnLUtyWvurC+Ge+wJR9k
T/LvMrT5TAUpbXsDTFmbhQJiWlnu70Pg2Gcqr+yXz/YboAr1+G0G3SonC1TYWNEe
8aGsWMGeuiB1GTfrGJgyj3M/YxRUYHgRz75mUXZ0kQkraWYbIuKDpudsSAVcLV5Q
jO0JQeZmaCB0w3pvaP1CpXEl9+CXnCAnEoheUgAjC9THUPYeAvI3mRbrDJL50Ine
9PDAkP++LtYKRKS0tMhlDtbrjoFWMySmvOG6t7nkWhdj7IdLxCZuYgCV5CvUUiuv
oBglozfzBEPp2sA4NlSghEg3zedrGSauDL/Ns3djPmpSWNJIuE5Q8bHZGO9msqR9
MDoPTuSirY3L4ARGpG6rXc0holGq9NniQpa1ArnWnSUtkIjz22dT1V41dEumquxm
PganYebBWe3tjSmSc8Jm1xNKscwTPYr6kBApLcDWaxdASD56Q95raAVNJLkP/kKz
USb6ZTHunyZQALgqKFgFMc8Nm8Eyu+sXikQscYgR0vJsYy0gaQy52/F3UG2gBpjP
9ULoHTilWU6cx07JdawgJNLfgj37ov3wlM5rW1G50TP7GpzAYlzPl7atozbyogii
WeRA0AwM8LHcTggseqOtA3B8606giVlk5Iw2YJ7DkK1ArSAncThbb2rtm/PPyPWv
JdG2bHXj3PnzSQfMDw==
=RpJX
-----END PGP SIGNATURE-----
Merge tag 'for-linus-5.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/ubifs
Pull UBI and UBIFS updates from Richard Weinberger:
- Fix for memory leaks around UBIFS orphan handling
- Fix for memory leaks around UBI fastmap
- Remove zero-length array from ubi-media.h
- Fix for TNC lookup in UBIFS orphan code
* tag 'for-linus-5.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/ubifs:
ubi: ubi-media.h: Replace zero-length array with flexible-array member
ubifs: Fix out-of-bounds memory access caused by abnormal value of node_len
ubi: fastmap: Only produce the initial anchor PEB when fastmap is used
ubi: fastmap: Free unused fastmap anchor peb during detach
ubifs: ubifs_add_orphan: Fix a memory leak bug
ubifs: ubifs_jnl_write_inode: Fix a memory leak bug
ubifs: Fix ubifs_tnc_lookup() usage in do_kill_orphans()
- New mode for time travel, external via virtio
- Fixes for ubd to make sure no requests can get lost
- Fixes for vector networking
- Allow CONFIG_STATIC_LINK only when possible
- Minor cleanups and fixes
-----BEGIN PGP SIGNATURE-----
iQJKBAABCAA0FiEEdgfidid8lnn52cLTZvlZhesYu8EFAl6MbGYWHHJpY2hhcmRA
c2lnbWEtc3Rhci5hdAAKCRBm+VmF6xi7wSY2D/4k1kb3A5pZ6OEXCkKmRU63j0RC
na0bsa4lztMuABgOWKXP09cqL2ZhJ1rVVRUMV7jgVFKj7rKkJHHGHgdBeEkXOcb8
skOVxln1X/i3T9q9QQ4ofkSk0U8gHCZA3pqrn7TFI9ZmrosOUYwhQKkqcNHvSfPc
XEjKUx1GCS+wA0mw5yLyDZqDGkZgMNSmNezR7Oq3EB9wi8K2n6Racn6//S/uqiS6
I8HHE7R2ci0YfflP+xE8i1qg8/TY2wj2oCP33b9o/XefyyNSndVj7KQUI3KRBmSh
M0k2sbOqegVzSH/l5YFIZ7zbDcqkYeGWopPIuYWo3en7ZmfJfP2KD31c8gPOuElC
HuUvQyS1VDpLn6JBa8Y456e8IrKl/QquXfZDc2qG5HYTR6g9nv9y8VNtx4dSQ+sB
AfgErKofx7x2JQNRfg+0BYKgw/MawGAjiSZm5qVNfvFM3YDWZSUZ9gEAcX6qto/z
P+66Zrhatdt9TaQdy9vbQKDWSJk9ood2mQYU0JJSfzgsotWslyvCsc6ANtwfkc7R
sLxnsa6EA7CYogbMJ7wRxD5spCNZrRZvepHhe5uft/nWG/qGM1jy7Vk16Or03sVH
sScIp6m+yDyhhEjJOT8Mq6WbM3mIfILMb42FyDJQIpJ9JcXSxzbiZu7RSK38yoEG
+WYGOYdTGgzxIWsRmQ==
=WVcL
-----END PGP SIGNATURE-----
Merge tag 'for-linus-5.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/uml
Pull UML updates from Richard Weinberger:
- New mode for time travel, external via virtio
- Fixes for ubd to make sure no requests can get lost
- Fixes for vector networking
- Allow CONFIG_STATIC_LINK only when possible
- Minor cleanups and fixes
* tag 'for-linus-5.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/uml:
um: Remove some unnecessary NULL checks in vector_user.c
um: vector: Avoid NULL ptr deference if transport is unset
um: Make CONFIG_STATIC_LINK actually static
um: Implement cpu_relax() as ndelay(1) for time-travel
um: Implement ndelay/udelay in time-travel mode
um: Implement time-travel=ext
um: virtio: Implement VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS
um: time-travel: Rewrite as an event scheduler
um: Move timer-internal.h to non-shared
hostfs: Use kasprintf() instead of fixed buffer formatting
um: falloc.h needs to be directly included for older libc
um: ubd: Retry buffer read on any kind of error
um: ubd: Prevent buffer overrun on command completion
um: Fix overlapping ELF segments when statically linked
um: Delete never executed timer
um: Don't overwrite ethtool driver version
um: Fix len of file in create_pid_file
um: Don't use console_drivers directly
um: Cleanup CONFIG_IOSCHED_CFQ
smbdirect support (SMB3 over RDMA) should be enabled by
default in many configurations.
It is not experimental and is stable enough and has enough
performance benefits to recommend that it be configured by
default. Change the "If unsure N" to "If unsure Y" in
the description of the configuration parameter.
Acked-by: Aurelien Aptel <aaptel@suse.com>
Reviewed-by: Long Li <longli@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
There are several places where code is indented incorrectly. Fix these.
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20200325135018.113431-1-colin.king@canonical.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Static executables don't need to free NULL pointer.
It doesn't matter really because static executable is not common scenario
but do it anyway out of pedantry.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20200219185330.GA4933@avx2
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
PT_INTERP ELF header can be spared if executable is static.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20200219185012.GB4871@avx2
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
"loc" variable became just a wrapper for PT_INTERP ELF header after main
ELF header was moved to "bprm->buf". Delete it.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20200219184847.GA4871@avx2
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Davidlohr Bueso pointed out that when CONFIG_DEBUG_LOCK_ALLOC is set
ep_poll_safewake() can take several non-raw spinlocks after disabling
interrupts. Since a spinlock can block in the -rt kernel, we can't take a
spinlock after disabling interrupts. So let's re-work how we determine
the nesting level such that it plays nicely with the -rt kernel.
Let's introduce a 'nests' field in struct eventpoll that records the
current nesting level during ep_poll_callback(). Then, if we nest again
we can find the previous struct eventpoll that we were called from and
increase our count by 1. The 'nests' field is protected by
ep->poll_wait.lock.
I've also moved the visited field to reduce the size of struct eventpoll
from 184 bytes to 176 bytes on x86_64 for !CONFIG_DEBUG_LOCK_ALLOC, which
is typical for a production config.
Reported-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Jason Baron <jbaron@akamai.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Davidlohr Bueso <dbueso@suse.de>
Cc: Roman Penyaev <rpenyaev@suse.de>
Cc: Eric Wong <normalperson@yhbt.net>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Link: http://lkml.kernel.org/r/1582739816-13167-1-git-send-email-jbaron@akamai.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It's clearer to just put this inline.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20200317193201.9924-5-adobriyan@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The process maps file was the only user of version (introduced back in
2005). Now that it uses ppos instead, we can remove it.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20200317193201.9924-4-adobriyan@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The ppos is a private cursor, just like m->version. Use the canonical
cursor, not a special one.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20200317193201.9924-3-adobriyan@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Instead of setting m->version in the show method, set it in m_next(),
where it should be. Also remove the fallback code for failing to find a
vma, or version being zero.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20200317193201.9924-2-adobriyan@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Instead of calling vma_stop() from m_start() and m_next(), do its work
in m_stop().
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20200317193201.9924-1-adobriyan@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
top(1) reads all /proc/*/statm files but kernel threads will always have
zeros. Print those zeroes directly without going through
seq_put_decimal_ull().
Speed up reading /proc/2/statm (which is kthreadd) is like 3%.
My system has more kernel threads than normal processes after booting KDE.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20200307154435.GA2788@avx2
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now that "struct proc_ops" exist we can start putting there stuff which
could not fly with VFS "struct file_operations"...
Most of fs/proc/inode.c file is dedicated to make open/read/.../close
reliable in the event of disappearing /proc entries which usually happens
if module is getting removed. Files like /proc/cpuinfo which never
disappear simply do not need such protection.
Save 2 atomic ops, 1 allocation, 1 free per open/read/close sequence for such
"permanent" files.
Enable "permanent" flag for
/proc/cpuinfo
/proc/kmsg
/proc/modules
/proc/slabinfo
/proc/stat
/proc/sysvipc/*
/proc/swaps
More will come once I figure out foolproof way to prevent out module
authors from marking their stuff "permanent" for performance reasons
when it is not.
This should help with scalability: benchmark is "read /proc/cpuinfo R times
by N threads scattered over the system".
N R t, s (before) t, s (after)
-----------------------------------------------------
64 4096 1.582458 1.530502 -3.2%
256 4096 6.371926 6.125168 -3.9%
1024 4096 25.64888 24.47528 -4.6%
Benchmark source:
#include <chrono>
#include <iostream>
#include <thread>
#include <vector>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
const int NR_CPUS = sysconf(_SC_NPROCESSORS_ONLN);
int N;
const char *filename;
int R;
int xxx = 0;
int glue(int n)
{
cpu_set_t m;
CPU_ZERO(&m);
CPU_SET(n, &m);
return sched_setaffinity(0, sizeof(cpu_set_t), &m);
}
void f(int n)
{
glue(n % NR_CPUS);
while (*(volatile int *)&xxx == 0) {
}
for (int i = 0; i < R; i++) {
int fd = open(filename, O_RDONLY);
char buf[4096];
ssize_t rv = read(fd, buf, sizeof(buf));
asm volatile ("" :: "g" (rv));
close(fd);
}
}
int main(int argc, char *argv[])
{
if (argc < 4) {
std::cerr << "usage: " << argv[0] << ' ' << "N /proc/filename R
";
return 1;
}
N = atoi(argv[1]);
filename = argv[2];
R = atoi(argv[3]);
for (int i = 0; i < NR_CPUS; i++) {
if (glue(i) == 0)
break;
}
std::vector<std::thread> T;
T.reserve(N);
for (int i = 0; i < N; i++) {
T.emplace_back(f, i);
}
auto t0 = std::chrono::system_clock::now();
{
*(volatile int *)&xxx = 1;
for (auto& t: T) {
t.join();
}
}
auto t1 = std::chrono::system_clock::now();
std::chrono::duration<double> dt = t1 - t0;
std::cout << dt.count() << '
';
return 0;
}
P.S.:
Explicit randomization marker is added because adding non-function pointer
will silently disable structure layout randomization.
[akpm@linux-foundation.org: coding style fixes]
Reported-by: kbuild test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Joe Perches <joe@perches.com>
Link: http://lkml.kernel.org/r/20200222201539.GA22576@avx2
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Only declare _UFFDIO_WRITEPROTECT if the user specified
UFFDIO_REGISTER_MODE_WP and if all the checks passed. Then when the user
registers regions with shmem/hugetlbfs we won't expose the new ioctl to
them. Even with complete anonymous memory range, we'll only expose the
new WP ioctl bit if the register mode has MODE_WP.
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Bobby Powers <bobbypowers@gmail.com>
Cc: Brian Geffon <bgeffon@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
Cc: Martin Cracauer <cracauer@cons.org>
Cc: Marty McFadden <mcfadden8@llnl.gov>
Cc: Maya Gokhale <gokhale2@llnl.gov>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Shaohua Li <shli@fb.com>
Link: http://lkml.kernel.org/r/20200220163112.11409-18-peterx@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It does not make sense to try to wake up any waiting thread when we're
write-protecting a memory region. Only wake up when resolving a write
protected page fault.
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Bobby Powers <bobbypowers@gmail.com>
Cc: Brian Geffon <bgeffon@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
Cc: Martin Cracauer <cracauer@cons.org>
Cc: Marty McFadden <mcfadden8@llnl.gov>
Cc: Maya Gokhale <gokhale2@llnl.gov>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Shaohua Li <shli@fb.com>
Link: http://lkml.kernel.org/r/20200220163112.11409-16-peterx@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Introduce the new uffd-wp APIs for userspace.
Firstly, we'll allow to do UFFDIO_REGISTER with write protection tracking
using the new UFFDIO_REGISTER_MODE_WP flag. Note that this flag can
co-exist with the existing UFFDIO_REGISTER_MODE_MISSING, in which case the
userspace program can not only resolve missing page faults, and at the
same time tracking page data changes along the way.
Secondly, we introduced the new UFFDIO_WRITEPROTECT API to do page level
write protection tracking. Note that we will need to register the memory
region with UFFDIO_REGISTER_MODE_WP before that.
[peterx@redhat.com: write up the commit message]
[peterx@redhat.com: remove useless block, write commit message, check against
VM_MAYWRITE rather than VM_WRITE when register]
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jerome Glisse <jglisse@redhat.com>
Cc: Bobby Powers <bobbypowers@gmail.com>
Cc: Brian Geffon <bgeffon@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
Cc: Martin Cracauer <cracauer@cons.org>
Cc: Marty McFadden <mcfadden8@llnl.gov>
Cc: Maya Gokhale <gokhale2@llnl.gov>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Shaohua Li <shli@fb.com>
Link: http://lkml.kernel.org/r/20200220163112.11409-14-peterx@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Immediate packets should only be sent to peer when there are new
receive credits made available. New credits show up on freeing
receive buffer, not on receiving data.
Fix this by avoid unnenecessary work schedules.
Signed-off-by: Long Li <longli@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
When processing errors from ib_post_send(), the transport state needs to be
rolled back to the condition before the error.
Refactor the old code to make it easy to roll back on IB errors, and fix this.
Signed-off-by: Long Li <longli@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
CIFS uses pre-allocated crypto structures to calculate signatures for both
incoming and outgoing packets. In this way it doesn't need to allocate crypto
structures for every packet, but it requires a lock to prevent concurrent
access to crypto structures.
Remove the lock by allocating crypto structures on the fly for
incoming packets. At the same time, we can still use pre-allocated crypto
structures for outgoing packets, as they are already protected by transport
lock srv_mutex.
Signed-off-by: Long Li <longli@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Recevie credits should be updated before sending the packet, not
before a work is scheduled. Also, the value needs roll back if
something fails and cannot send.
Signed-off-by: Long Li <longli@microsoft.com>
Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Sometimes the remote peer may return more send credits than the send queue
depth. If all the send credits are used to post senasd, we may overflow the
send queue.
Fix this by checking the send queue size before posting a send.
Signed-off-by: Long Li <longli@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
As an optimization, SMBD tries to track two types of packets: packets with
payload and without payload. There is no obvious benefit or performance gain
to separately track two types of packets.
Just treat them as pending packets and merge the tracking code.
Signed-off-by: Long Li <longli@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Fix tcon use-after-free and NULL ptr deref.
Customer system crashes with the following kernel log:
[462233.169868] CIFS VFS: Cancelling wait for mid 4894753 cmd: 14 => a QUERY DIR
[462233.228045] CIFS VFS: cifs_put_smb_ses: Session Logoff failure rc=-4
[462233.305922] CIFS VFS: cifs_put_smb_ses: Session Logoff failure rc=-4
[462233.306205] CIFS VFS: cifs_put_smb_ses: Session Logoff failure rc=-4
[462233.347060] CIFS VFS: cifs_put_smb_ses: Session Logoff failure rc=-4
[462233.347107] CIFS VFS: Close unmatched open
[462233.347113] BUG: unable to handle kernel NULL pointer dereference at 0000000000000038
...
[exception RIP: cifs_put_tcon+0xa0] (this is doing tcon->ses->server)
#6 [...] smb2_cancelled_close_fid at ... [cifs]
#7 [...] process_one_work at ...
#8 [...] worker_thread at ...
#9 [...] kthread at ...
The most likely explanation we have is:
* When we put the last reference of a tcon (refcount=0), we close the
cached share root handle.
* If closing a handle is interrupted, SMB2_close() will
queue a SMB2_close() in a work thread.
* The queued object keeps a tcon ref so we bump the tcon
refcount, jumping from 0 to 1.
* We reach the end of cifs_put_tcon(), we free the tcon object despite
it now having a refcount of 1.
* The queued work now runs, but the tcon, ses & server was freed in
the meantime resulting in a crash.
THREAD 1
========
cifs_put_tcon => tcon refcount reach 0
SMB2_tdis
close_shroot_lease
close_shroot_lease_locked => if cached root has lease && refcount = 0
smb2_close_cached_fid => if cached root valid
SMB2_close => retry close in a thread if interrupted
smb2_handle_cancelled_close
__smb2_handle_cancelled_close => !! tcon refcount bump 0 => 1 !!
INIT_WORK(&cancelled->work, smb2_cancelled_close_fid);
queue_work(cifsiod_wq, &cancelled->work) => queue work
tconInfoFree(tcon); ==> freed!
cifs_put_smb_ses(ses); ==> freed!
THREAD 2 (workqueue)
========
smb2_cancelled_close_fid
SMB2_close(0, cancelled->tcon, ...); => use-after-free of tcon
cifs_put_tcon(cancelled->tcon); => tcon refcount reach 0 second time
*CRASH*
Fixes: d919131935 ("CIFS: Close cached root handle only if it has a lease")
Signed-off-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
An earlier commit "io_uring: remove @nxt from handlers" removed the
setting of pointer nxt and now it is always null, hence the non-null
check and call to io_wq_assign_next is redundant and can be removed.
Addresses-Coverity: ("'Constant' variable guard")
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Instead of the various open coded calls to set the NFS_INO_STALE bit
and call nfs_zap_caches(), consolidate them into a single function
nfs_set_inode_stale().
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAl6LFJ0ACgkQnJ2qBz9k
QNkSUQgAzwaescnHeVTF7/Zg9Uj2xrfrTJZ1E+Mn9qnd/0/z/asVV+RKfY7Gnu7h
g19inDI4ZESFz2gWz4jwJD1c2/yMZb8vnae4ye3dtCv2yjG/0JxCeue6vjwsWqmO
4jbSgk8YNQqzwEFVMzNp43ZJr3CFooLCIsJcL8q4yYk8Kt4pDUPmQ1vBvAc6k9vK
BKMBvp926tbomP27nq0n0CjvHy7ipDGMl4H6i4vBxHRfbDPih2x9VEklK3JatC1n
4AKS6IYJrkZVdOjli+DrResbcWxyT4db5tPio5MU0RDnVhNZT2cHyNVXf5EpRJqP
72pa7gfPu1Rx1+tU8bDR/daSveou2A==
=fkCV
-----END PGP SIGNATURE-----
Merge tag 'fsnotify_for_v5.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull fsnotify updates from Jan Kara:
"This implements the fanotify FAN_DIR_MODIFY event.
This event reports the name in a directory under which a change
happened and together with the directory filehandle and fstatat()
allows reliable and efficient implementation of directory
synchronization"
* tag 'fsnotify_for_v5.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
fanotify: Fix the checks in fanotify_fsid_equal
fanotify: report name info for FAN_DIR_MODIFY event
fanotify: record name info for FAN_DIR_MODIFY event
fanotify: Drop fanotify_event_has_fid()
fanotify: prepare to report both parent and child fid's
fanotify: send FAN_DIR_MODIFY event flavor with dir inode and name
fanotify: divorce fanotify_path_event and fanotify_fid_event
fanotify: Store fanotify handles differently
fanotify: Simplify create_fd()
fanotify: fix merging marks masks with FAN_ONDIR
fanotify: merge duplicate events on parent and child
fsnotify: replace inode pointer with an object id
fsnotify: simplify arguments passing to fsnotify_parent()
fsnotify: use helpers to access data by data_type
fsnotify: funnel all dirent events through fsnotify_name()
fsnotify: factor helpers fsnotify_dentry() and fsnotify_file()
fsnotify: tidy up FS_ and FAN_ constants
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAl6LE+QACgkQnJ2qBz9k
QNn0HAf/fz+W/WqZFMEMLoAOJB5ALQa4v1DQGqsSIyKb9GIF8OBhzqNCP2wryx/r
Qa37jTOLEQxsdMIQ3LaDH8RVZOj6GUAhk5RZIgTX2XYcWRdqqhws0a27BAJSPnt9
+uujPccf7rBQJ1aN8+kt8QMDkxo4AX8keHtfT4nbl2c1JNsd2nbjsEsWUBx7R1kb
aoblIklUMfMr5NvKcWakhJoMtST0fyNX4vbV3tePJTTOS1uzatQLJ5ox2WmY6M+q
tTEaKpR/zHMGXegVAlBMFBTJ3NVANcS4MoIYcHlkg5DMOO00XwHH/ZtsJlVsAOuj
FURu2KJ3d4RYMxaKlX+wBscliL2GqQ==
=weEs
-----END PGP SIGNATURE-----
Merge tag 'for_v5.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull ext2/udf updates from Jan Kara:
"Cleanups and fixes for ext2 and one cleanup for udf"
* tag 'for_v5.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
ext2: fix empty body warnings when -Wextra is used
ext2: fix debug reference to ext2_xattr_cache
udf: udf_sb.h: Replace zero-length array with flexible-array member
ext2: xattr.h: Replace zero-length array with flexible-array member
ext2: Silence lockdep warning about reclaim under xattr_sem
- Fix read with O_NONBLOCK to allow incomplete read and return immediately
- Rest is just cleanup (indent, unused field in struct, extra semicolon)
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE/IPbcYBuWt0zoYhOq06b7GqY5nAFAl6LCyEACgkQq06b7GqY
5nAAQg//ThE/rojrB74S8yaD3sJlfsT+BoMzGDeos9FGdsHpO8t1ISUiheAcUn/Y
a6igfJxTefxEO5SWqvsGEousRl0sRTyLo8RuQ2CC7+zKN8Ou+eG+4s2QXfDxBD+p
xJGJrsWD2SkQ98/hkxDY6d31BM9iQGvpvVp/c9qLwtvTlnvEMXll0JiACcnmiPRt
z92bx8QkaPlt399qv7LltfHftxCHOMcieQlYAF3wDjLauiEdDWysiMn0NIC2izwY
hPlsVCiV504yhvQFcXVnmbWVJe88THlBVVjuVAi9IwpU8D2+1UK1kM/yiyVkYml+
U4JC7ZBQoqw2obSLPk5mZ6E7nGWfWnIBpt/B9dO7akP8pS9bv+uCjwg65tFEop5/
z9nuyjoQT/dEoThveeoEoxpIbyd690n1vbhcTdGDFDX5nfglrxj0S6bNVW0k6e41
9aPwWzAfU9Ip9GHwE6Ewf/y88mMI7rCP+o/pC8XkUUGiV+foFt3oO5wCcvGbxwgJ
nnZ5oY+Ren/QXOmq/cN3RUWJvTRbc8TDBm6Fa0iGkk7d1OUBuRTWuBUXvxHMpO3A
xHapDRddPcrZ9QQSJZGvMSYwG0ZUZ6MRL2qTU4X/3sIi0giApFkRMpmmo0HjvMAf
PIFl2MG2Ok4A8yWQN98AMXy8vgs6ql6+Wb3W0b5dNFsAGEk7f9U=
=wAe5
-----END PGP SIGNATURE-----
Merge tag '9p-for-5.7' of git://github.com/martinetd/linux
Pull 9p updates from Dominique Martinet:
"Not much new, but a few patches for this cycle:
- Fix read with O_NONBLOCK to allow incomplete read and return
immediately
- Rest is just cleanup (indent, unused field in struct, extra
semicolon)"
* tag '9p-for-5.7' of git://github.com/martinetd/linux:
net/9p: remove unused p9_req_t aux field
9p: read only once on O_NONBLOCK
9pnet: allow making incomplete read requests
9p: Remove unneeded semicolon
9p: Fix Kconfig indentation
Reflink should force the log out to disk if the filesystem was mounted
with wsync, the same as most other operations in xfs.
[Note: XFS_MOUNT_WSYNC is set when the admin mounts the filesystem
with either the 'wsync' or 'sync' mount options, which effectively means
that we're classifying reflink/dedupe as IO operations and making them
synchronous when required.]
Fixes: 3fc9f5e409 ("xfs: remove xfs_reflink_remap_range")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
[darrick: add more to the changelog]
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Create a new helper to force the log up to the last LSN touching an
inode.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Pull vfs pathwalk fix from Al Viro:
"Dumb braino in legitimize_path()..."
* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
fix a braino in legitimize_path()
brown paperbag time... wrong order of arguments ended up confusing
the values to check dentry and mount_lock seqcounts against.
Reported-by: kernel test robot <rong.a.chen@intel.com>
Fixes: 2aa3847085 ("non-RCU analogue of the previous commit")
Tested-by: kernel test robot <rong.a.chen@intel.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
If io_get_req() fails, it drops a ref. Then, awhile keeping @submitted
unmodified, io_submit_sqes() breaks the loop and puts @nr - @submitted
refs. For each submitted req a ref is dropped in io_put_req() and
friends. So, for @nr taken refs there will be
(@nr - @submitted + @submitted + 1) dropped.
Remove ctx refcounting from io_get_req(), that at the same time makes
it clearer.
Fixes: 2b85edfc0c ("io_uring: batch getting pcpu references")
Cc: stable@vger.kernel.org # v5.6
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Commit 9255782f70 ("sysfs: Wrap __compat_only_sysfs_link_entry_to_kobj
function to change the symlink name") made this function a wrapper
around a new non-underscored function, which is a bit odd. The normal
naming convention is the other way around: the underscored function is
the wrappee, and the non-underscored function is the wrapper.
There's only one single user (well, two call-sites in that user) of the
more limited double underscore version of this function, so just remove
the oddly named wrapper entirely and just add the extra NULL argument to
the user.
I considered just doing that in the merge, but that tends to make
history really hard to read.
Link: https://lore.kernel.org/lkml/CAHk-=wgkkmNV5tMzQDmPAQuNJBuMcry--Jb+h8H1o4RA3kF7QQ@mail.gmail.com/
Cc: Sourabh Jain <sourabhjain@linux.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- A large series from Nick for 64-bit to further rework our exception vectors,
and rewrite portions of the syscall entry/exit and interrupt return in C. The
result is much easier to follow code that is also faster in general.
- Cleanup of our ptrace code to split various parts out that had become badly
intertwined with #ifdefs over the years.
- Changes to our NUMA setup under the PowerVM hypervisor which should
hopefully avoid non-sensical topologies which can lead to warnings from the
workqueue code and other problems.
- MAINTAINERS updates to remove some of our old orphan entries and update the
status of others.
- Quite a few other small changes and fixes all over the map.
Thanks to:
Abdul Haleem, afzal mohammed, Alexey Kardashevskiy, Andrew Donnellan, Aneesh
Kumar K.V, Balamuruhan S, Cédric Le Goater, Chen Zhou, Christophe JAILLET,
Christophe Leroy, Christoph Hellwig, Clement Courbet, Daniel Axtens, David
Gibson, Douglas Miller, Fabiano Rosas, Fangrui Song, Ganesh Goudar, Gautham R.
Shenoy, Greg Kroah-Hartman, Greg Kurz, Gustavo Luiz Duarte, Hari Bathini, Ilie
Halip, Jan Kara, Joe Lawrence, Joe Perches, Kajol Jain, Larry Finger,
Laurentiu Tudor, Leonardo Bras, Libor Pechacek, Madhavan Srinivasan, Mahesh
Salgaonkar, Masahiro Yamada, Masami Hiramatsu, Mauricio Faria de Oliveira,
Michael Neuling, Michal Suchanek, Mike Rapoport, Nageswara R Sastry, Nathan
Chancellor, Nathan Lynch, Naveen N. Rao, Nicholas Piggin, Nick Desaulniers,
Oliver O'Halloran, Po-Hsu Lin, Pratik Rajesh Sampat, Rasmus Villemoes, Ravi
Bangoria, Roman Bolshakov, Sam Bobroff, Sandipan Das, Santosh S, Sedat Dilek,
Segher Boessenkool, Shilpasri G Bhat, Sourabh Jain, Srikar Dronamraju, Stephen
Rothwell, Tyrel Datwyler, Vaibhav Jain, YueHaibing.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCAAxFiEEJFGtCPCthwEv2Y/bUevqPMjhpYAFAl6JypATHG1wZUBlbGxl
cm1hbi5pZC5hdQAKCRBR6+o8yOGlgOTyD/0U90tXb3VXlQcc4OFIb8vWIj76k4Zn
ZSZ7RyOuvb5pCISBZjSK79XkR9eMHT77qagX4V41q64k4yQl8nbgLeVnwL76hLLc
IJCs23f4nsO0uqX/MhSCc5dfOOOS2i8V+OQYtsYWsH5QaG95v0cHIqVaHHMlfQxu
507GO/W5W6KTd4x008b5unQOuE51zMKlKvqEJXkT59obQFpaa2S5Wn7OzhsnarCH
YSRNxaC7vtgBKLA9wUnFh8UUbh0FbOwXBCaq4OhHMhgRihdteVBCzlcR/6c+IRbt
EoZxKzfQ0hI1z5f++kJNaRXMtUbSpM8D1HdKKHgiWjpdBSD0eu2X106KQT2R2ZOF
qhX8xPLWNzdBglA6L43AaZUu+4ayd3QrrJIkjDv/K1rCHZjfGOzSQfoZgTEBNLFA
tC0crhEfw8m98e4EwhCtekGQxdczRdLS9YvtC/h6mU2xkpA35yNSwB1/iuVQdkYD
XyrEqImAQ1PJla7NL0hxSy5ZxrBtMeKT4WZZ0BNgKXryemldg8Tuv3AEyach3BHz
eU0pIwpbnPm1JAPyrpDQ1yEf7QsD77gTPfEvilEci60R9DhvIMGAY+pt0qfME3yX
wOLp2yVBEXlRmvHk/y/+r+m4aCsmwSrikbWwmLLwAAA6JehtzFOWxTEfNpACP23V
mZyyZznsHIIE3Q==
=ARdm
-----END PGP SIGNATURE-----
Merge tag 'powerpc-5.7-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc updates from Michael Ellerman:
"Slightly late as I had to rebase mid-week to insert a bug fix:
- A large series from Nick for 64-bit to further rework our exception
vectors, and rewrite portions of the syscall entry/exit and
interrupt return in C. The result is much easier to follow code
that is also faster in general.
- Cleanup of our ptrace code to split various parts out that had
become badly intertwined with #ifdefs over the years.
- Changes to our NUMA setup under the PowerVM hypervisor which should
hopefully avoid non-sensical topologies which can lead to warnings
from the workqueue code and other problems.
- MAINTAINERS updates to remove some of our old orphan entries and
update the status of others.
- Quite a few other small changes and fixes all over the map.
Thanks to: Abdul Haleem, afzal mohammed, Alexey Kardashevskiy, Andrew
Donnellan, Aneesh Kumar K.V, Balamuruhan S, Cédric Le Goater, Chen
Zhou, Christophe JAILLET, Christophe Leroy, Christoph Hellwig, Clement
Courbet, Daniel Axtens, David Gibson, Douglas Miller, Fabiano Rosas,
Fangrui Song, Ganesh Goudar, Gautham R. Shenoy, Greg Kroah-Hartman,
Greg Kurz, Gustavo Luiz Duarte, Hari Bathini, Ilie Halip, Jan Kara,
Joe Lawrence, Joe Perches, Kajol Jain, Larry Finger, Laurentiu Tudor,
Leonardo Bras, Libor Pechacek, Madhavan Srinivasan, Mahesh Salgaonkar,
Masahiro Yamada, Masami Hiramatsu, Mauricio Faria de Oliveira, Michael
Neuling, Michal Suchanek, Mike Rapoport, Nageswara R Sastry, Nathan
Chancellor, Nathan Lynch, Naveen N. Rao, Nicholas Piggin, Nick
Desaulniers, Oliver O'Halloran, Po-Hsu Lin, Pratik Rajesh Sampat,
Rasmus Villemoes, Ravi Bangoria, Roman Bolshakov, Sam Bobroff,
Sandipan Das, Santosh S, Sedat Dilek, Segher Boessenkool, Shilpasri G
Bhat, Sourabh Jain, Srikar Dronamraju, Stephen Rothwell, Tyrel
Datwyler, Vaibhav Jain, YueHaibing"
* tag 'powerpc-5.7-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (158 commits)
powerpc: Make setjmp/longjmp signature standard
powerpc/cputable: Remove unnecessary copy of cpu_spec->oprofile_type
powerpc: Suppress .eh_frame generation
powerpc: Drop -fno-dwarf2-cfi-asm
powerpc/32: drop unused ISA_DMA_THRESHOLD
powerpc/powernv: Add documentation for the opal sensor_groups sysfs interfaces
selftests/powerpc: Fix try-run when source tree is not writable
powerpc/vmlinux.lds: Explicitly retain .gnu.hash
powerpc/ptrace: move ptrace_triggered() into hw_breakpoint.c
powerpc/ptrace: create ppc_gethwdinfo()
powerpc/ptrace: create ptrace_get_debugreg()
powerpc/ptrace: split out ADV_DEBUG_REGS related functions.
powerpc/ptrace: move register viewing functions out of ptrace.c
powerpc/ptrace: split out TRANSACTIONAL_MEM related functions.
powerpc/ptrace: split out SPE related functions.
powerpc/ptrace: split out ALTIVEC related functions.
powerpc/ptrace: split out VSX related functions.
powerpc/ptrace: drop PARAMETER_SAVE_AREA_OFFSET
powerpc/ptrace: drop unnecessary #ifdefs CONFIG_PPC64
powerpc/ptrace: remove unused header includes
...
2) Clean up extent tree handling.
3) Other cleanups and miscellaneous bug fixes
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAl6JWT0ACgkQ8vlZVpUN
gaMTOQgAo7bkkyc5+FCbpIUejle1DvM9Hg32fG1p/I5XH771RJRlGURVcx4FgsAZ
TOiP4R4csADWcfUWnky/MK1pubVPLq2x7c0OyYnyOENM5PsLufeIa1EGjKBrGYeE
6et4zyqlB7bIOUkVarluSp0j3YklxGrg2k45cnVFp572W5FaWiwpIHL/qNcwvbv0
NIDyB+tW16tlmKINT/YLnDMRAOagoJoQwgNYwQPk+qVAvKWOUT3YQH0KXhXIRaMp
M2HU9lUqSXT5F2uNNypeRaqwoFayYV7tVnaSoht30fFk04Eg+RLTyhCUWSVR2EHW
zBjKNy9l3KQFTI8eriguGcLNy8cxig==
=nLff
-----END PGP SIGNATURE-----
Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 updates from Ted Ts'o:
- Replace ext4's bmap and iopoll implementations to use iomap.
- Clean up extent tree handling.
- Other cleanups and miscellaneous bug fixes
* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (31 commits)
ext4: save all error info in save_error_info() and drop ext4_set_errno()
ext4: fix incorrect group count in ext4_fill_super error message
ext4: fix incorrect inodes per group in error message
ext4: don't set dioread_nolock by default for blocksize < pagesize
ext4: disable dioread_nolock whenever delayed allocation is disabled
ext4: do not commit super on read-only bdev
ext4: avoid ENOSPC when avoiding to reuse recently deleted inodes
ext4: unregister sysfs path before destroying jbd2 journal
ext4: check for non-zero journal inum in ext4_calculate_overhead
ext4: remove map_from_cluster from ext4_ext_map_blocks
ext4: clean up ext4_ext_insert_extent() call in ext4_ext_map_blocks()
ext4: mark block bitmap corrupted when found instead of BUGON
ext4: use flexible-array member for xattr structs
ext4: use flexible-array member in struct fname
Documentation: correct the description of FIEMAP_EXTENT_LAST
ext4: move ext4_fiemap to use iomap framework
ext4: make ext4_ind_map_blocks work with fiemap
ext4: move ext4 bmap to use iomap infrastructure
ext4: optimize ext4_ext_precache for 0 depth
ext4: add IOMAP_F_MERGED for non-extent based mapping
...
- Fix EXCHANGE_ID response when NFSD runs in a container
- A battery of new static trace points
- Socket transports now use bio_vec to send Replies
- NFS/RDMA now supports filesystems with no .splice_read method
- Favor memcpy() over DMA mapping for small RPC/RDMA Replies
- Add pre-requisites for supporting multiple Write chunks
- Numerous minor fixes and clean-ups
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQIcBAABAgAGBQJegj9pAAoJEDNqszNvZn+XNGgP/RsRul/UGe70YoPS6AwxI+c1
2JVni5LV83aVGSN1df/xRNdugWh4j8e8stBIJPCnWFzUERFvrzVeVyW0/dlIy37l
SRL1L62EzFUejAL45O+CkF5+KI2kAWMgDCv+rPnFnIuXVa/sThx63F1AJikVMPjB
7We3vd5Kh/CrMeMflebJYuY12xE6di2b3ifkZRO0/yuMaAuqJrDreYf4L6xpA4rC
QnKQcNl7LGlOwGSI2WvDrCLE056PJFhTuzTawI80NKnkXMMFNc6/7NoXJqasVlHG
fiki2mHbJrbYd8isIm3Vl/QkFsM8QjijtpVxC9gd151w0P7DfpMYmSzlZL7nvq/R
Nt6IIqbaxWSS1VULsuS7rDtBwwZpW/LRWaUhEvMwimR2jeOxcwtlDVTX/dRH2mxq
Ume64Hn8xMEhhx9tHCPQ+Rgjqv5m+ZEAvmV6B7RM9nT2z2MSzQQESeMB14VZZmF/
2oH1dDCVdCmb4ZOcD5yxL6Y1hijn45s+YHdts9uIsCudKYPI906vPhogFC+PVJv+
MrOiUf8d40H0ra8VAUFCjAceOulkv90aLhBjoHbPsP4SQOTsRuUXnsKESZpSHY72
nT/uPM23ULv4kQ6tHB8yQ3ordjCBRgb4zIKtotc3Wpi7dhO8u6ptPj4soiflRShO
8/3N5dYfqdt9FRyr7Z8/
=o5G0
-----END PGP SIGNATURE-----
Merge tag 'nfsd-5.7' of git://git.linux-nfs.org/projects/cel/cel-2.6
Pull nfsd updates from Chuck Lever:
- Fix EXCHANGE_ID response when NFSD runs in a container
- A battery of new static trace points
- Socket transports now use bio_vec to send Replies
- NFS/RDMA now supports filesystems with no .splice_read method
- Favor memcpy() over DMA mapping for small RPC/RDMA Replies
- Add pre-requisites for supporting multiple Write chunks
- Numerous minor fixes and clean-ups
[ Chuck is filling in for Bruce this time while he and his family settle
into a new house ]
* tag 'nfsd-5.7' of git://git.linux-nfs.org/projects/cel/cel-2.6: (39 commits)
svcrdma: Fix leak of transport addresses
SUNRPC: Fix a potential buffer overflow in 'svc_print_xprts()'
SUNRPC/cache: don't allow invalid entries to be flushed
nfsd: fsnotify on rmdir under nfsd/clients/
nfsd4: kill warnings on testing stateids with mismatched clientids
nfsd: remove read permission bit for ctl sysctl
NFSD: Fix NFS server build errors
sunrpc: Add tracing for cache events
SUNRPC/cache: Allow garbage collection of invalid cache entries
nfsd: export upcalls must not return ESTALE when mountd is down
nfsd: Add tracepoints for update of the expkey and export cache entries
nfsd: Add tracepoints for exp_find_key() and exp_get_by_name()
nfsd: Add tracing to nfsd_set_fh_dentry()
nfsd: Don't add locks to closed or closing open stateids
SUNRPC: Teach server to use xprt_sock_sendmsg for socket sends
SUNRPC: Refactor xs_sendpages()
svcrdma: Avoid DMA mapping small RPC Replies
svcrdma: Fix double sync of transport header buffer
svcrdma: Refactor chunk list encoders
SUNRPC: Add encoders for list item discriminators
...
When we're sending a layoutreturn, ensure that we reference the
layout cred atomically with the copy of the stateid.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
When we look up the delegation cred, we are usually doing so in
conjunction with a read of the stateid, and we want to ensure
that the look up is atomic with that read.
Fixes: 57f188e047 ("NFSv4: nfs_update_inplace_delegation() should update delegation cred")
[sfr@canb.auug.org.au: Fixed up borken Fixes: line from Trond :-)]
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
A request that completes with an -EAGAIN result after it has been added
to the poll list, will not be removed from that list in io_do_iopoll()
because the f_op->iopoll() will not succeed for that request.
Maintain a retryable local list similar to the done list, and explicity
reissue requests with an -EAGAIN result.
Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Here are 3 SPDX patches for 5.7-rc1.
One fixes up the SPDX tag for a single driver, while the other two go
through the tree and add SPDX tags for all of the .gitignore files as
needed.
Nothing too complex, but you will get a merge conflict with your current
tree, that should be trivial to handle (one file modified by two things,
one file deleted.)
All 3 of these have been in linux-next for a while, with no reported
issues other than the merge conflict.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCXodg5A8cZ3JlZ0Brcm9h
aC5jb20ACgkQMUfUDdst+ykySQCgy9YDrkz7nWq6v3Gohl6+lW/L+rMAnRM4uTZm
m5AuCzO3Azt9KBi7NL+L
=2Lm5
-----END PGP SIGNATURE-----
Merge tag 'spdx-5.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/spdx
Pull SPDX updates from Greg KH:
"Here are three SPDX patches for 5.7-rc1.
One fixes up the SPDX tag for a single driver, while the other two go
through the tree and add SPDX tags for all of the .gitignore files as
needed.
Nothing too complex, but you will get a merge conflict with your
current tree, that should be trivial to handle (one file modified by
two things, one file deleted.)
All three of these have been in linux-next for a while, with no
reported issues other than the merge conflict"
* tag 'spdx-5.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/spdx:
ASoC: MT6660: make spdxcheck.py happy
.gitignore: add SPDX License Identifier
.gitignore: remove too obvious comments
We already checked this limit when the file was opened, and we keep it
open in the file table. Hence when we added unit_inflight to the count
we want to register, we're doubly accounting these files. This results
in -EMFILE for file registration, if we're at half the limit.
Cc: stable@vger.kernel.org # v5.1+
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pull cgroup updates from Tejun Heo:
- Christian extended clone3 so that processes can be spawned into
cgroups directly.
This is not only neat in terms of semantics but also avoids grabbing
the global cgroup_threadgroup_rwsem for migration.
- Daniel added !root xattr support to cgroupfs.
Userland already uses xattrs on cgroupfs for bookkeeping. This will
allow delegated cgroups to support such usages.
- Prateek tried to make cpuset hotplug handling synchronous but that
led to possible deadlock scenarios. Reverted.
- Other minor changes including release_agent_path handling cleanup.
* 'for-5.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
docs: cgroup-v1: Document the cpuset_v2_mode mount option
Revert "cpuset: Make cpuset hotplug synchronous"
cgroupfs: Support user xattrs
kernfs: Add option to enable user xattrs
kernfs: Add removed_size out param for simple_xattr_set
kernfs: kvmalloc xattr value instead of kmalloc
cgroup: Restructure release_agent_path handling
selftests/cgroup: add tests for cloning into cgroups
clone3: allow spawning processes into cgroups
cgroup: add cgroup_may_write() helper
cgroup: refactor fork helpers
cgroup: add cgroup_get_from_file() helper
cgroup: unify attach permission checking
cpuset: Make cpuset hotplug synchronous
cgroup.c: Use built-in RCU list checking
kselftest/cgroup: add cgroup destruction test
cgroup: Clean up css_set task traversal
If the original task is (or has) exited, then the task work will not get
queued properly. Allow for using the io-wq manager task to queue this
work for execution, and ensure that the io-wq manager notices and runs
this work if woken up (or exiting).
Reported-by: Dan Melnic <dmm@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We can have a task exit if it's not the owner of the ring. Be safe and
grab an actual reference to it, to avoid a potential use-after-free.
Reported-by: Dan Melnic <dmm@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If we get woken and the poll doesn't match our mask, re-add the task
to the poll waitqueue and try again instead of completing the request
with a mask of 0.
Reported-by: Dan Melnic <dmm@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We can keep compressed inode's data inline before inline conversion.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
It needs to call f2fs_disable_compressed_file() to disable
compression on directory.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Compression sysfs node should not be shown if f2fs module disables
compression feature.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
While checking discard timeout, we use specified type
UMOUNT_DISCARD_TIMEOUT, so just replace doplicy.timeout with
it, and switch doplicy.timeout to bool type.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In below error path, tpages[i] could be NULL, fix to check it before
releasing it.
- f2fs_read_multi_pages
- f2fs_alloc_dic
- f2fs_free_dic
Fixes: 61fbae2b2b ("f2fs: fix to avoid NULL pointer dereference")
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
fstest reports below message when compression is on:
generic/424 1s ... - output mismatch
--- tests/generic/424.out
+++ results/generic/424.out.bad
@@ -1,2 +1,26 @@
QA output created by 424
+[!] Attribute compressed should be set
+Failed
+stat_test failed
+[!] Attribute compressed should be set
+Failed
+stat_test failed
We missed to set STATX_ATTR_COMPRESSED on compressed inode in getattr(),
fix it.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Add zstd compress algorithm support, use "compress_algorithm=zstd"
mountoption to enable it.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Add a helper dax_ioamp_zero() to zero a range. This patch basically
merges __dax_zero_page_range() and iomap_dax_zero().
Suggested-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20200228163456.1587-7-vgoyal@redhat.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Use new dax native zero page method for zeroing page if I/O is page
aligned. Otherwise fall back to direct_access() + memcpy().
This gets rid of one of the depenendency on block device in dax path.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Link: https://lore.kernel.org/r/20200228163456.1587-6-vgoyal@redhat.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Setting nfs_mountpoint_expiry_timeout() to a negative value stops
mountpoint expiration, while setting it to a positive value restarts
the scheduler.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
We must not return from nfs_d_automount() without holding 2 references
to the mount record. Doing so, will trigger the BUG() in finish_automount().
Also ensure that we don't try to reschedule the automount timer with
a negative or zero timeout value.
Fixes: 22a1ae9a93 ("NFS: If nfs_mountpoint_expiry_timeout < 0, do not expire submounts")
Cc: stable@vger.kernel.org # v5.5+
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
nfs_vers_tokens, nfs_xprt_protocol_tokens, and nfs_secflavor_tokens were
all missing an empty item at the end of the array, allowing
lookup_constant() to potentially walk off the end and trigger and oops.
Reported-by: Olga Kornievskaia <aglo@umich.edu>
Signed-off-by: Scott Mayhew <smayhew@redhat.com>
Fixes: e38bb238ed ("NFS: Convert mount option parsing to use functionality from fs_parser.h")
Cc: stable@vger.kernel.org # v5.6
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Merge updates from Andrew Morton:
"A large amount of MM, plenty more to come.
Subsystems affected by this patch series:
- tools
- kthread
- kbuild
- scripts
- ocfs2
- vfs
- mm: slub, kmemleak, pagecache, gup, swap, memcg, pagemap, mremap,
sparsemem, kasan, pagealloc, vmscan, compaction, mempolicy,
hugetlbfs, hugetlb"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (155 commits)
include/linux/huge_mm.h: check PageTail in hpage_nr_pages even when !THP
mm/hugetlb: fix build failure with HUGETLB_PAGE but not HUGEBTLBFS
selftests/vm: fix map_hugetlb length used for testing read and write
mm/hugetlb: remove unnecessary memory fetch in PageHeadHuge()
mm/hugetlb.c: clean code by removing unnecessary initialization
hugetlb_cgroup: add hugetlb_cgroup reservation docs
hugetlb_cgroup: add hugetlb_cgroup reservation tests
hugetlb: support file_region coalescing again
hugetlb_cgroup: support noreserve mappings
hugetlb_cgroup: add accounting for shared mappings
hugetlb: disable region_add file_region coalescing
hugetlb_cgroup: add reservation accounting for private mappings
mm/hugetlb_cgroup: fix hugetlb_cgroup migration
hugetlb_cgroup: add interface for charge/uncharge hugetlb reservations
hugetlb_cgroup: add hugetlb_cgroup reservation counter
hugetlbfs: Use i_mmap_rwsem to address page fault/truncate race
hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization
mm/memblock.c: remove redundant assignment to variable max_addr
mm: mempolicy: require at least one nodeid for MPOL_PREFERRED
mm: mempolicy: use VM_BUG_ON_VMA in queue_pages_test_walk()
...
- Fix a hard to trigger race between iclog error checking and log shutdown.
- Strengthen the AGF verifier.
- Ratelimit some of the more spammy error messages.
- Remove the icdinode uid/gid members and just use the ones in the vfs inode.
- Hold ILOCK across insert/collapse range.
- Clean up the extended attribute interfaces.
- Clean up the attr flags mess.
- Restore PF_MEMALLOC after exiting xfsaild thread to avoid triggering
warnings in the process accounting code.
- Remove the flexibly-sized array from struct xfs_agfl to eliminate
compiler warnings about unaligned pointers and packed structures.
- Various macro and typedef removals.
- Stale metadata buffers if we decide they're corrupt outside of a
verifier.
- Check directory data/block/free block owners.
- Fix a UAF when aborting inactivation of a corrupt xattr fork.
- Teach online scrub to report failed directory and attr name lookups
as a metadata corruption instead of a runtime error.
- Avoid potential buffer overflows in sysfs files by using scnprintf.
- Fix a regression in getdents lookups due to a mistake in pointer
arithmetic.
- Refactor btree cursor private data structures to use anonymous
unions.
- Cleanups in the log unmounting code.
- Fix a potential mishandling of ENOMEM errors on multi-block directory
buffer lookups.
- Fix an incorrect test in the block allocation code.
- Cleanups and name prefix shortening in the scrub code.
- Introduce btree bulk loading code for online repair and scrub.
- Fix a quotaoff log item leak (and hang) when the fs goes down midway
through a quotaoff operation.
- Remove di_version from the incore inode.
- Refactor some of the log shutdown checking code.
- Record the forcing of the log unmount records in the log force
counters.
- Fix a longstanding bug where quotacheck would purge the
administrator's default quota grace interval and warning limits.
- Reduce memory usage when scrubbing directory and xattr trees.
- Don't let fsfreeze race with GETFSMAP or online scrub.
- Handle bio_add_page failures more gracefully in xlog_write_iclog.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAl58z3AACgkQ+H93GTRK
tOukrRAAhJmowV5+Req5YMYawRjafkIbCDH3WlFy9AdpFFA6pXSfX6YCtIKwKfq8
+yRj/BFRGoMc6SouXo+J0i3YMS6yQZTjcmVWrQPVnj/+DGVjh+Y70gKExtz2CyjO
ItGGxpRwOhpw49zVYmcH6Mrw8sBztHR0VsM0cq6YfJrkNcm0BsnAC+W6zQNaDG24
UO1ivehBOooVh0C8pv0smVcPtBL2N+RRyS3XRT5hGFozUJgLLGDqnHAl1d+KOrWp
hPQhUlDw9luiHPBxWkxUuFDr79gjUi7kyHILNt7TIkByyRcTUO9jhS2VpZd4oXlj
/J3i1AS+9lhP1yGVxw2RHQhKMvdYBQiLADSCpzkA1dMma99cFGyzMMA6rG0WRMJ4
erXxhAEoM4um3gxDka6+HJxySLOT8E22FesJbn6YIv4QSAkXDBPWz/9hPbjJuJQm
6Y/YkFOZLp3c+xJM0tpCWxWaWW7A+t2OMRIFISSsXesrySpalpbkVXkHwz3NwO6L
3SeTnLWqnADbjl2qsuyF0uYHqURygVz7g+r4X7AO5D1IRyCCkmtDOuwumxERiQ3p
3vZMQrWh+y3SgRiF8brDG5KTshhxcinKdHEYXrwq3XgaHZg4mtLI4XjOyZlJruoX
MGWhZjga6+RGysH0RKjZbHaMr/f4m3X00SHa/5Ibcp6Q21TIx6M=
=8iJB
-----END PGP SIGNATURE-----
Merge tag 'xfs-5.7-merge-8' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs updates from Darrick Wong:
"There's a lot going on this cycle with cleanups in the log code, the
btree code, and the xattr code.
We're tightening of metadata validation and online fsck checking, and
introducing a common btree rebuilding library so that we can refactor
xfs_repair and introduce online repair in a future cycle.
We also fixed a few visible bugs -- most notably there's one in
getdents that we introduced in 5.6; and a fix for hangs when disabling
quotas.
This series has been running fstests & other QA in the background for
over a week and looks good so far.
I anticipate sending a second pull request next week. That batch will
change how xfs interacts with memory reclaim; how the log batches and
throttles log items; how hard writes near ENOSPC will try to squeeze
more space out of the filesystem; and hopefully fix the last of the
umount hangs after a catastrophic failure. That should ease a lot of
problems when running at the limits, but for now I'm leaving that in
for-next for another week to make sure we got all the subtleties
right.
Summary:
- Fix a hard to trigger race between iclog error checking and log
shutdown.
- Strengthen the AGF verifier.
- Ratelimit some of the more spammy error messages.
- Remove the icdinode uid/gid members and just use the ones in the
vfs inode.
- Hold ILOCK across insert/collapse range.
- Clean up the extended attribute interfaces.
- Clean up the attr flags mess.
- Restore PF_MEMALLOC after exiting xfsaild thread to avoid
triggering warnings in the process accounting code.
- Remove the flexibly-sized array from struct xfs_agfl to eliminate
compiler warnings about unaligned pointers and packed structures.
- Various macro and typedef removals.
- Stale metadata buffers if we decide they're corrupt outside of a
verifier.
- Check directory data/block/free block owners.
- Fix a UAF when aborting inactivation of a corrupt xattr fork.
- Teach online scrub to report failed directory and attr name lookups
as a metadata corruption instead of a runtime error.
- Avoid potential buffer overflows in sysfs files by using scnprintf.
- Fix a regression in getdents lookups due to a mistake in pointer
arithmetic.
- Refactor btree cursor private data structures to use anonymous
unions.
- Cleanups in the log unmounting code.
- Fix a potential mishandling of ENOMEM errors on multi-block
directory buffer lookups.
- Fix an incorrect test in the block allocation code.
- Cleanups and name prefix shortening in the scrub code.
- Introduce btree bulk loading code for online repair and scrub.
- Fix a quotaoff log item leak (and hang) when the fs goes down
midway through a quotaoff operation.
- Remove di_version from the incore inode.
- Refactor some of the log shutdown checking code.
- Record the forcing of the log unmount records in the log force
counters.
- Fix a longstanding bug where quotacheck would purge the
administrator's default quota grace interval and warning limits.
- Reduce memory usage when scrubbing directory and xattr trees.
- Don't let fsfreeze race with GETFSMAP or online scrub.
- Handle bio_add_page failures more gracefully in xlog_write_iclog"
* tag 'xfs-5.7-merge-8' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (108 commits)
xfs: prohibit fs freezing when using empty transactions
xfs: shutdown on failure to add page to log bio
xfs: directory bestfree check should release buffers
xfs: drop all altpath buffers at the end of the sibling check
xfs: preserve default grace interval during quotacheck
xfs: remove xlog_state_want_sync
xfs: move the ioerror check out of xlog_state_clean_iclog
xfs: refactor xlog_state_clean_iclog
xfs: remove the aborted parameter to xlog_state_done_syncing
xfs: simplify log shutdown checking in xfs_log_release_iclog
xfs: simplify the xfs_log_release_iclog calling convention
xfs: factor out a xlog_wait_on_iclog helper
xfs: merge xlog_cil_push into xlog_cil_push_work
xfs: remove the di_version field from struct icdinode
xfs: simplify a check in xfs_ioctl_setattr_check_cowextsize
xfs: simplify di_flags2 inheritance in xfs_ialloc
xfs: only check the superblock version for dinode size calculation
xfs: add a new xfs_sb_version_has_v3inode helper
xfs: fix unmount hang and memory leak on shutdown during quotaoff
xfs: factor out quotaoff intent AIL removal and memory free
...
- Fix a regression where we broke the userspace hibernation driver by
disallowing writes to the swap device.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAl542q0ACgkQ+H93GTRK
tOun8A//QIAvIuMQl9k/S4lDqvVNAmSMJDdp0v3x+BOMBDmbqJeDO+D9u59nVWAP
zun1Zp3weO7v8kMBPDyvTVhKP0Z9v8ogQj4yT22W0YiBsKgsaqM9tupc3NPm036V
oPusLFC44RRXyLZSjBhNr3xYTBqeGmJMKBUGrnwYeQK2g87o3gi8s9KmVq3olp/L
W/ZvFgmTl4FpbA1aNaMtZ1YBawu9wyQDvmtZtnD7xuXGKGsQjGUt20P7yuFu2Mb8
vmUHNcCBG29j8Fwd+6Gub2Jg25BhLGBSjftLHcGdG8aRN4Y5DQ3w+rBwUD7fHQmi
u0DXMnPIP8twsQPKwWabfZ3PMqyfUiz5rSnJGGd+T7uPP5xYvpKhYGm8IBPQb699
2LY4NZKQqp9IWSbwmU7jSwCEl0x/GDMflF3frpfTmvCDvpW7TUQf8lJsVsF6OcNP
uGJPz3AoE5ebt2XD+IWCurWrfn/nGnxp9ZEKjK69nm3BFXI0GRdqBq6lueBsh6re
zKUoFp7IHIb4ET6V21JPq9iyKUlKLqgb+rpqcA4CwA4tvJZkZXVlYOdwi1CWse3o
8o9xaDmW1murvc0XrnQu8Z8way1nkUYBBkhRJsHCdy8Qn2xA3fyVumZGichNNccO
Mzu8+IKttCTBK8ZY6iAXsjbL61eR1vrr3GGbz4kh6dZy5fBX99c=
=y7GN
-----END PGP SIGNATURE-----
Merge tag 'vfs-5.7-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull hibernation fix from Darrick Wong:
"Fix a regression where we broke the userspace hibernation driver by
disallowing writes to the swap device"
* tag 'vfs-5.7-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
hibernate: Allow uswsusp to write to swap
- Fix a broken tracepoint
- Fix a broken comment
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAl5yOLYACgkQ+H93GTRK
tOueKA/+O83EV/a8oI5/EEM95uoe5JRC4VUaCj3iDyMVEefQC8Phl3pUraQoG/z3
d511bTvca70rBSYj9HaSoa3Wt3MTJTPy9RIxBIKaaumzTXMe4erfXZ/JRJIf5DSl
JZs+Ujng5ythLR6ederr/Kn7LXg5MGax5I4j41JHCDiRp5xFaqRN+4Rib18jj4Lf
WNlbwblixWzCeEeHhs+CNm9G7cy+wt84Qv0CTJjC/nThtDc2tOVFkI3EODX0oZH5
R+KEbEStHebI9AZ0NIfDagoHQ926ROjAEpJn4VS/tOYbN1favrSE5RTnaNZ0asln
c6XnGIPvpHcjsFztNFEuPXS9rP6aGNLic+P+V5TADjzY3M0TCXnIsYXFhEV23HNe
2hprg6T7+KvcAaTdq3t5k2jjYC9AEiaxBLIJRGyTwYKmJPobvTXU9QTyJpA3KX3u
mCAo8jJzl3XHcMrCcXA3xBVS8e23cAkQN5gwpvMIfOeGRr006iFAJS5MZ23Rqyk8
k3w0V4vVWW3QnLYwJI+cGKanuhFrZ7PNulnqC4AmUdYpeyhmWoH2e1mYgU3JKNm9
vF4zad/MYhp0xfEUopsLEE+F0sj4v1yUKEvPSoXxGYSycf9/KFa3/MQvF3akaOeX
kQu29pzDeErFsdLzm+qQmt/xFHuSA5WP6ShO+0XC9uzhN/ezRNg=
=2wnO
-----END PGP SIGNATURE-----
Merge tag 'iomap-5.7-merge-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull iomap updates from Darrick Wong:
"We're fixing tracepoints and comments in this cycle, so there
shouldn't be any surprises here.
I anticipate sending a second pull request next week with a single bug
fix for readahead, but it's still undergoing QA.
Summary:
- Fix a broken tracepoint
- Fix a broken comment"
* tag 'iomap-5.7-merge-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
iomap: fix comments in iomap_dio_rw
iomap: Remove pgoff from tracepoints
Pull vfs pathwalk sanitizing from Al Viro:
"Massive pathwalk rewrite and cleanups.
Several iterations have been posted; hopefully this thing is getting
readable and understandable now. Pretty much all parts of pathname
resolutions are affected...
The branch is identical to what has sat in -next, except for commit
message in "lift all calls of step_into() out of follow_dotdot/
follow_dotdot_rcu", crediting Qian Cai for reporting the bug; only
commit message changed there."
* 'work.dotdot1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (69 commits)
lookup_open(): don't bother with fallbacks to lookup+create
atomic_open(): no need to pass struct open_flags anymore
open_last_lookups(): move complete_walk() into do_open()
open_last_lookups(): lift O_EXCL|O_CREAT handling into do_open()
open_last_lookups(): don't abuse complete_walk() when all we want is unlazy
open_last_lookups(): consolidate fsnotify_create() calls
take post-lookup part of do_last() out of loop
link_path_walk(): sample parent's i_uid and i_mode for the last component
__nd_alloc_stack(): make it return bool
reserve_stack(): switch to __nd_alloc_stack()
pick_link(): take reserving space on stack into a new helper
pick_link(): more straightforward handling of allocation failures
fold path_to_nameidata() into its only remaining caller
pick_link(): pass it struct path already with normal refcounting rules
fs/namei.c: kill follow_mount()
non-RCU analogue of the previous commit
helper for mount rootwards traversal
follow_dotdot(): be lazy about changing nd->path
follow_dotdot_rcu(): be lazy about changing nd->path
follow_dotdot{,_rcu}(): massage loops
...
Pull exec/proc updates from Eric Biederman:
"This contains two significant pieces of work: the work to sort out
proc_flush_task, and the work to solve a deadlock between strace and
exec.
Fixing proc_flush_task so that it no longer requires a persistent
mount makes improvements to proc possible. The removal of the
persistent mount solves an old regression that that caused the hidepid
mount option to only work on remount not on mount. The regression was
found and reported by the Android folks. This further allows Alexey
Gladkov's work making proc mount options specific to an individual
mount of proc to move forward.
The work on exec starts solving a long standing issue with exec that
it takes mutexes of blocking userspace applications, which makes exec
extremely deadlock prone. For the moment this adds a second mutex with
a narrower scope that handles all of the easy cases. Which makes the
tricky cases easy to spot. With a little luck the code to solve those
deadlocks will be ready by next merge window"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (25 commits)
signal: Extend exec_id to 64bits
pidfd: Use new infrastructure to fix deadlocks in execve
perf: Use new infrastructure to fix deadlocks in execve
proc: io_accounting: Use new infrastructure to fix deadlocks in execve
proc: Use new infrastructure to fix deadlocks in execve
kernel/kcmp.c: Use new infrastructure to fix deadlocks in execve
kernel: doc: remove outdated comment cred.c
mm: docs: Fix a comment in process_vm_rw_core
selftests/ptrace: add test cases for dead-locks
exec: Fix a deadlock in strace
exec: Add exec_update_mutex to replace cred_guard_mutex
exec: Move exec_mmap right after de_thread in flush_old_exec
exec: Move cleanup of posix timers on exec out of de_thread
exec: Factor unshare_sighand out of de_thread and call it separately
exec: Only compute current once in flush_old_exec
pid: Improve the comment about waiting in zap_pid_ns_processes
proc: Remove the now unnecessary internal mount of proc
uml: Create a private mount of proc for mconsole
uml: Don't consult current to find the proc_mnt in mconsole_proc
proc: Use a list of inodes to flush from proc
...
hugetlbfs page faults can race with truncate and hole punch operations.
Current code in the page fault path attempts to handle this by 'backing
out' operations if we encounter the race. One obvious omission in the
current code is removing a page newly added to the page cache. This is
pretty straight forward to address, but there is a more subtle and
difficult issue of backing out hugetlb reservations. To handle this
correctly, the 'reservation state' before page allocation needs to be
noted so that it can be properly backed out. There are four distinct
possibilities for reservation state: shared/reserved, shared/no-resv,
private/reserved and private/no-resv. Backing out a reservation may
require memory allocation which could fail so that needs to be taken
into account as well.
Instead of writing the required complicated code for this rare
occurrence, just eliminate the race. i_mmap_rwsem is now held in read
mode for the duration of page fault processing. Hold i_mmap_rwsem in
write mode when modifying i_size. In this way, truncation can not
proceed when page faults are being processed. In addition, i_size
will not change during fault processing so a single check can be made
to ensure faults are not beyond (proposed) end of file. Faults can
still race with hole punch, but that race is handled by existing code
and the use of hugetlb_fault_mutex.
With this modification, checks for races with truncation in the page
fault path can be simplified and removed. remove_inode_hugepages no
longer needs to take hugetlb_fault_mutex in the case of truncation.
Comments are expanded to explain reasoning behind locking.
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
Link: http://lkml.kernel.org/r/20200316205756.146666-3-mike.kravetz@oracle.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "hugetlbfs: use i_mmap_rwsem for more synchronization", v2.
While discussing the issue with huge_pte_offset [1], I remembered that
there were more outstanding hugetlb races. These issues are:
1) For shared pmds, huge PTE pointers returned by huge_pte_alloc can become
invalid via a call to huge_pmd_unshare by another thread.
2) hugetlbfs page faults can race with truncation causing invalid global
reserve counts and state.
A previous attempt was made to use i_mmap_rwsem in this manner as
described at [2]. However, those patches were reverted starting with [3]
due to locking issues.
To effectively use i_mmap_rwsem to address the above issues it needs to be
held (in read mode) during page fault processing. However, during fault
processing we need to lock the page we will be adding. Lock ordering
requires we take page lock before i_mmap_rwsem. Waiting until after
taking the page lock is too late in the fault process for the
synchronization we want to do.
To address this lock ordering issue, the following patches change the lock
ordering for hugetlb pages. This is not too invasive as hugetlbfs
processing is done separate from core mm in many places. However, I don't
really like this idea. Much ugliness is contained in the new routine
hugetlb_page_mapping_lock_write() of patch 1.
The only other way I can think of to address these issues is by catching
all the races. After catching a race, cleanup, backout, retry ... etc,
as needed. This can get really ugly, especially for huge page
reservations. At one time, I started writing some of the reservation
backout code for page faults and it got so ugly and complicated I went
down the path of adding synchronization to avoid the races. Any other
suggestions would be welcome.
[1] https://lore.kernel.org/linux-mm/1582342427-230392-1-git-send-email-longpeng2@huawei.com/
[2] https://lore.kernel.org/linux-mm/20181222223013.22193-1-mike.kravetz@oracle.com/
[3] https://lore.kernel.org/linux-mm/20190103235452.29335-1-mike.kravetz@oracle.com
[4] https://lore.kernel.org/linux-mm/1584028670.7365.182.camel@lca.pw/
[5] https://lore.kernel.org/lkml/20200312183142.108df9ac@canb.auug.org.au/
This patch (of 2):
While looking at BUGs associated with invalid huge page map counts, it was
discovered and observed that a huge pte pointer could become 'invalid' and
point to another task's page table. Consider the following:
A task takes a page fault on a shared hugetlbfs file and calls
huge_pte_alloc to get a ptep. Suppose the returned ptep points to a
shared pmd.
Now, another task truncates the hugetlbfs file. As part of truncation, it
unmaps everyone who has the file mapped. If the range being truncated is
covered by a shared pmd, huge_pmd_unshare will be called. For all but the
last user of the shared pmd, huge_pmd_unshare will clear the pud pointing
to the pmd. If the task in the middle of the page fault is not the last
user, the ptep returned by huge_pte_alloc now points to another task's
page table or worse. This leads to bad things such as incorrect page
map/reference counts or invalid memory references.
To fix, expand the use of i_mmap_rwsem as follows:
- i_mmap_rwsem is held in read mode whenever huge_pmd_share is called.
huge_pmd_share is only called via huge_pte_alloc, so callers of
huge_pte_alloc take i_mmap_rwsem before calling. In addition, callers
of huge_pte_alloc continue to hold the semaphore until finished with
the ptep.
- i_mmap_rwsem is held in write mode whenever huge_pmd_unshare is called.
One problem with this scheme is that it requires taking i_mmap_rwsem
before taking the page lock during page faults. This is not the order
specified in the rest of mm code. Handling of hugetlbfs pages is mostly
isolated today. Therefore, we use this alternative locking order for
PageHuge() pages.
mapping->i_mmap_rwsem
hugetlb_fault_mutex (hugetlbfs specific page fault mutex)
page->flags PG_locked (lock_page)
To help with lock ordering issues, hugetlb_page_mapping_lock_write() is
introduced to write lock the i_mmap_rwsem associated with a page.
In most cases it is easy to get address_space via vma->vm_file->f_mapping.
However, in the case of migration or memory errors for anon pages we do
not have an associated vma. A new routine _get_hugetlb_page_mapping()
will use anon_vma to get address_space in these cases.
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
Link: http://lkml.kernel.org/r/20200316205756.146666-2-mike.kravetz@oracle.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Userfaultfd fault path was by default killable even if the caller does not
have FAULT_FLAG_KILLABLE. That makes sense before in that when with gup
we don't have FAULT_FLAG_KILLABLE properly set before. Now after previous
patch we've got FAULT_FLAG_KILLABLE applied even for gup code so it should
also make sense to let userfaultfd to honor the FAULT_FLAG_KILLABLE.
Because we're unconditionally setting FAULT_FLAG_KILLABLE in gup code
right now, this patch should have no functional change. It also cleaned
the code a little bit by introducing some helpers.
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Brian Geffon <bgeffon@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Bobby Powers <bobbypowers@gmail.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
Cc: Martin Cracauer <cracauer@cons.org>
Cc: Marty McFadden <mcfadden8@llnl.gov>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Maya Gokhale <gokhale2@llnl.gov>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Link: http://lkml.kernel.org/r/20200220160300.9941-1-peterx@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
handle_userfaultfd() is currently the only one place in the kernel page
fault procedures that can respond to non-fatal userspace signals. It was
trying to detect such an allowance by checking against USER & KILLABLE
flags, which was "un-official".
In this patch, we introduced a new flag (FAULT_FLAG_INTERRUPTIBLE) to show
that the fault handler allows the fault procedure to respond even to
non-fatal signals. Meanwhile, add this new flag to the default fault
flags so that all the page fault handlers can benefit from the new flag.
With that, replacing the userfault check to this one.
Since the line is getting even longer, clean up the fault flags a bit too
to ease TTY users.
Although we've got a new flag and applied it, we shouldn't have any
functional change with this patch so far.
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Brian Geffon <bgeffon@google.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Bobby Powers <bobbypowers@gmail.com>
Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
Cc: Martin Cracauer <cracauer@cons.org>
Cc: Marty McFadden <mcfadden8@llnl.gov>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Maya Gokhale <gokhale2@llnl.gov>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Link: http://lkml.kernel.org/r/20200220195348.16302-1-peterx@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch removes the risk path in handle_userfault() then we will be
sure that the callers of handle_mm_fault() will know that the VMAs might
have changed. Meanwhile with previous patch we don't lose responsiveness
as well since the core mm code now can handle the nonfatal userspace
signals even if we return VM_FAULT_RETRY.
Suggested-by: Andrea Arcangeli <aarcange@redhat.com>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Brian Geffon <bgeffon@google.com>
Reviewed-by: Jerome Glisse <jglisse@redhat.com>
Cc: Bobby Powers <bobbypowers@gmail.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
Cc: Martin Cracauer <cracauer@cons.org>
Cc: Marty McFadden <mcfadden8@llnl.gov>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Maya Gokhale <gokhale2@llnl.gov>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Link: http://lkml.kernel.org/r/20200220160234.9646-1-peterx@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Rename (__)memcg_kmem_(un)charge() into (__)memcg_kmem_(un)charge_page()
to better reflect what they are actually doing:
1) call __memcg_kmem_(un)charge_memcg() to actually charge or uncharge
the current memcg
2) set or clear the PageKmemcg flag
Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Link: http://lkml.kernel.org/r/20200109202659.752357-4-guro@fb.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>