Pull driver core updates from Greg KH:
"Here is the big set of driver core changes for 5.16-rc1.
All of these have been in linux-next for a while now with no reported
problems.
Included in here are:
- big update and cleanup of the sysfs abi documentation files and
scripts from Mauro. We are almost at the place where we can
properly check that the running kernel's sysfs abi is documented
fully.
- firmware loader updates
- dyndbg updates
- kernfs cleanups and fixes from Christoph
- device property updates
- component fix
- other minor driver core cleanups and fixes"
* tag 'driver-core-5.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (122 commits)
device property: Drop redundant NULL checks
x86/build: Tuck away built-in firmware under FW_LOADER
vmlinux.lds.h: wrap built-in firmware support under FW_LOADER
firmware_loader: move struct builtin_fw to the only place used
x86/microcode: Use the firmware_loader built-in API
firmware_loader: remove old DECLARE_BUILTIN_FIRMWARE()
firmware_loader: formalize built-in firmware API
component: do not leave master devres group open after bind
dyndbg: refine verbosity 1-4 summary-detail
gpiolib: acpi: Replace custom code with device_match_acpi_handle()
i2c: acpi: Replace custom function with device_match_acpi_handle()
driver core: Provide device_match_acpi_handle() helper
dyndbg: fix spurious vNpr_info change
dyndbg: no vpr-info on empty queries
dyndbg: vpr-info on remove-module complete, not starting
device property: Add missed header in fwnode.h
Documentation: dyndbg: Improve cli param examples
dyndbg: Remove support for ddebug_query param
dyndbg: make dyndbg a known cli param
dyndbg: show module in vpr-info in dd-exec-queries
...
ext4_abort will eventually call ext4_errno_to_code, which translates the
errno to an EXT4_ERR specific error. This means that ext4_abort expects
an errno. By using EXT4_ERR_ here, it gets misinterpreted (as an errno),
and ends up saving EXT4_ERR_EBUSY on the superblock during an abort,
which makes no sense.
ESHUTDOWN will get properly translated to EXT4_ERR_SHUTDOWN, so use that
instead.
Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Link: https://lore.kernel.org/r/20211026173302.84000-1-krisman@collabora.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Since there are no blocks in an inline data inode, there's no point in
fixing iblocks field in fast commit replay path for this inode.
Similarly, there's no point in fixing any block bitmaps / global block
counters with respect to such an inode. Just bail out from these
functions if an inline data inode is encountered.
Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Link: https://lore.kernel.org/r/20211015182513.395917-2-harshads@google.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
During the commit phase in fast commits if an inode with inline data
is being committed, also commit the inline data along with
inode. Since recovery code just blindly copies entire content found in
inode TLV, there is no change needed on the recovery path. Thus, this
change is backward compatiable.
Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Link: https://lore.kernel.org/r/20211015182513.395917-1-harshads@google.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
As commit 6920b39132 ("ext4: add new helper interface
ext4_try_to_trim_range()") moves some code into the separate function
ext4_try_to_trim_range(), the use of the variable ret within that
function is more limited and can be adjusted as well.
Scope the use of the variable ret locally and drop dead assignments.
No functional change.
Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Link: https://lore.kernel.org/r/20210820120853.23134-1-lukas.bulwahn@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
The 'enable_quota' variable is only used in an CONFIG_QUOTA.
With CONFIG_QUOTA=n, compiler causes a harmless warning:
fs/ext4/super.c: In function ‘ext4_remount’:
fs/ext4/super.c:5840:6: warning: variable ‘enable_quota’ set but not used
[-Wunused-but-set-variable]
int enable_quota = 0;
^~~~~
Move 'enable_quota' into the same #ifdef CONFIG_QUOTA block
to remove an unused variable warning.
Signed-off-by: Austin Kim <austindh.kim@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20210824034929.GA13415@raspberrypi
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
In ext4_get_inode_loc(), we may skip IO and get an zero && uptodate
inode buffer when the inode monopolize an inode block for performance
reason. For most cases, ext4_mark_iloc_dirty() will fill the inode
buffer to make it fine, but we could miss this call if something bad
happened. Finally, __ext4_get_inode_loc_noinmem() may probably get an
empty inode buffer and trigger ext4 error.
For example, if we remove a nonexistent xattr on inode A,
ext4_xattr_set_handle() will return ENODATA before invoking
ext4_mark_iloc_dirty(), it will left an uptodate but zero buffer. We
will get checksum error message in ext4_iget() when getting inode again.
EXT4-fs error (device sda): ext4_lookup:1784: inode #131074: comm cat: iget: checksum invalid
Even worse, if we allocate another inode B at the same inode block, it
will corrupt the inode A on disk when write back inode B.
So this patch initialize the inode buffer by filling the in-mem inode
contents if we skip read I/O, ensure that the buffer is really uptodate.
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20210901020955.1657340-4-yi.zhang@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
In the most error path of current extents updating operations are not
roll back partial updates properly when some bad things happens(.e.g in
ext4_ext_insert_extent()). So we may get an inconsistent extents tree
if journal has been aborted due to IO error, which may probability lead
to BUGON later when we accessing these extent entries in errors=continue
mode. This patch drop extent buffer's verify flag before updatng the
contents in ext4_ext_get_access(), and reset it after updating in
__ext4_ext_dirty(). After this patch we could force to check the extent
buffer if extents tree updating was break off, make sure the extents are
consistent.
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Link: https://lore.kernel.org/r/20210908120850.4012324-4-yi.zhang@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Now that we can check out overlapping extents in leaf block and
out-of-order index extents in index block. But the .ee_block in the
first extent of one leaf block should equal to the .ei_block in it's
parent index extent entry. This patch add a check to verify such
inconsistent between the index and leaf block.
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://lore.kernel.org/r/20210908120850.4012324-3-yi.zhang@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
After commit 5946d08937 ("ext4: check for overlapping extents in
ext4_valid_extent_entries()"), we can check out the overlapping extent
entry in leaf extent blocks. But the out-of-order extent entry in index
extent blocks could also trigger bad things if the filesystem is
inconsistent. So this patch add a check to figure out the out-of-order
index extents and return error.
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Link: https://lore.kernel.org/r/20210908120850.4012324-2-yi.zhang@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
After we drop i_data sem, we need to reload the ext4_ext_path
structure since the extent tree can change once i_data_sem is
released.
This addresses the BUG:
[52117.465187] ------------[ cut here ]------------
[52117.465686] kernel BUG at fs/ext4/extents.c:1756!
...
[52117.478306] Call Trace:
[52117.478565] ext4_ext_shift_extents+0x3ee/0x710
[52117.479020] ext4_fallocate+0x139c/0x1b40
[52117.479405] ? __do_sys_newfstat+0x6b/0x80
[52117.479805] vfs_fallocate+0x151/0x4b0
[52117.480177] ksys_fallocate+0x4a/0xa0
[52117.480533] __x64_sys_fallocate+0x22/0x30
[52117.480930] do_syscall_64+0x35/0x80
[52117.481277] entry_SYSCALL_64_after_hwframe+0x44/0xae
[52117.481769] RIP: 0033:0x7fa062f855ca
Cc: stable@kernel.org
Link: https://lore.kernel.org/r/20210903062748.4118886-4-yangerkun@huawei.com
Signed-off-by: yangerkun <yangerkun@huawei.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Ext4 file system has default lazy inode table initialization setup once
it is mounted. However, it has issue on computing the next schedule time
that makes the timeout same amount in jiffies but different real time in
secs if with various HZ values. Therefore, fix by measuring the current
time in a more granular unit nanoseconds and make the next schedule time
independent of the HZ value.
Fixes: bfff68738f ("ext4: add support for lazy inode table initialization")
Signed-off-by: Shaoying Xu <shaoyi@amazon.com>
Cc: stable@vger.kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Link: https://lore.kernel.org/r/20210902164412.9994-2-shaoyi@amazon.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
This reverts commit 948ca5f30e.
Two crash reports from users running variations on 5.15-rc4 kernels
suggest that it is premature to enforce the state assertion in the
original commit. Both crashes were triggered by BUG calls in that
code, indicating that under some rare circumstance the buffer head
state did not match a delayed allocated block at the time the
block was written out. No reproducer is available. Resolving this
problem will require more time than remains in the current release
cycle, so reverting the original patch for the time being is necessary
to avoid any instability it may cause.
Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Link: https://lore.kernel.org/r/20211012171901.5352-1-enwlinux@gmail.com
Fixes: 948ca5f30e ("ext4: enforce buffer head state assertion in ext4_da_map_blocks")
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
This regression can be reproduced with ntfs-3g and overlayfs:
mkdir lower upper work overlay
dd if=/dev/zero of=ntfs.raw bs=1M count=2
mkntfs -F ntfs.raw
mount ntfs.raw lower
touch lower/file.txt
mount -t overlay -o lowerdir=lower,upperdir=upper,workdir=work - overlay
mv overlay/file.txt overlay/file2.txt
mv fails and (misleadingly) prints
mv: cannot move 'overlay/file.txt' to a subdirectory of itself, 'overlay/file2.txt'
The reason is that ovl_copy_fileattr() is triggered due to S_NOATIME being
set on all inodes (by fuse) regardless of fileattr.
ovl_copy_fileattr() tries to retrieve file attributes from lower file, but
that fails because filesystem does not support this ioctl (this should fail
with ENOTTY, but ntfs-3g return EINVAL instead). This failure is
propagated to origial operation (in this case rename) that triggered the
copy-up.
The fix is to ignore ENOTTY and EINVAL errors from fileattr_get() in copy
up. This also requires turning the internal ENOIOCTLCMD into ENOTTY.
As a further measure to prevent unnecessary failures, only try the
fileattr_get/set on upper if there are any flags to copy up.
Side note: a number of filesystems set S_NOATIME (and sometimes other inode
flags) irrespective of fileattr flags. This causes unnecessary calls
during copy up, which might lead to a performance issue, especially if
latency is high. To fix this, the kernel would need to differentiate
between the two cases. E.g. introduce SB_NOATIME_UPDATE, a per-sb variant
of S_NOATIME. SB_NOATIME doesn't work, because that's interpreted as
"filesystem doesn't store an atime attribute"
Reported-and-tested-by: Kevin Locke <kevin@kevinlocke.name>
Fixes: 72db82115d ("ovl: copy up sync/noatime fileattr flags")
Cc: <stable@vger.kernel.org> # v5.15
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Syzbot triggered the following warning in ovl_workdir_create() ->
ovl_create_real():
if (!err && WARN_ON(!newdentry->d_inode)) {
The reason is that the cgroup2 filesystem returns from mkdir without
instantiating the new dentry.
Weird filesystems such as this will be rejected by overlayfs at a later
stage during setup, but to prevent such a warning, call ovl_mkdir_real()
directly from ovl_workdir_create() and reject this case early.
Reported-and-tested-by: syzbot+75eab84fd0af9e8bf66b@syzkaller.appspotmail.com
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
This patch removes a list_first_entry() call which is already done by
the previous con_next_wq() call.
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
Pull per signal_struct coredumps from Eric Biederman:
"Current coredumps are mixed up with the exit code, the signal handling
code, and the ptrace code making coredumps much more complicated than
necessary and difficult to follow.
This series of changes starts with ptrace_stop and cleans it up,
making it easier to follow what is happening in ptrace_stop. Then
cleans up the exec interactions with coredumps. Then cleans up the
coredump interactions with exit. Finally the coredump interactions
with the signal handling code is cleaned up.
The first and last changes are bug fixes for minor bugs.
I believe the fact that vfork followed by execve can kill the process
the called vfork if exec fails is sufficient justification to change
the userspace visible behavior.
In previous discussions some of these changes were organized
differently and individually appeared to make the code base worse. As
currently written I believe they all stand on their own as cleanups
and bug fixes.
Which means that even if the worst should happen and the last change
needs to be reverted for some unimaginable reason, the code base will
still be improved.
If the worst does not happen there are a more cleanups that can be
made. Signals that generate coredumps can easily become eligible for
short circuit delivery in complete_signal. The entire rendezvous for
generating a coredump can move into get_signal. The function
force_sig_info_to_task be written in a way that does not modify the
signal handling state of the target task (because coredumps are
eligible for short circuit delivery). Many of these future cleanups
can be done another way but nothing so cleanly as if coredumps become
per signal_struct"
* 'per_signal_struct_coredumps-for-v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
coredump: Limit coredumps to a single thread group
coredump: Don't perform any cleanups before dumping core
exit: Factor coredump_exit_mm out of exit_mm
exec: Check for a pending fatal signal instead of core_state
ptrace: Remove the unnecessary arguments from arch_ptrace_stop
signal: Remove the bogus sigkill_pending in ptrace_stop
During umount, the session slot tables are freed. If there are
outstanding FREE_STATEID tasks, a use-after-free and slab corruption can
occur when rpc_exit_task calls rpc_call_done -> nfs41_sequence_done ->
nfs4_sequence_process/nfs41_sequence_free_slot.
Prevent that from happening by taking a reference on the nfs_client in
nfs41_free_stateid and putting it in nfs41_free_stateid_release.
Signed-off-by: Scott Mayhew <smayhew@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
After the assignment, only exit path with label 'err' uses ret as
return value. However,before exiting through this path with label 'err',
ret is assigned with the return value of io_wq_max_workers(). Hence, the
initial assignment is redundant and can be removed.
Signed-off-by: Nghia Le <nghialm78@gmail.com>
Link: https://lore.kernel.org/r/20211102190521.28291-1-nghialm78@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pull xfs updates from Darrick Wong:
"This cycle we've worked on fixing bugs and improving XFS' memory
footprint.
The most notable fixes include: fixing a corruption warning (and free
space accounting skew) if copy on write fails; fixing slab cache
misuse if SLOB is enabled, which apparently was broken for years
without anybody noticing; and fixing a potential race with online
shrinkfs.
Otherwise, the bulk of the changes here involve setting up separate
slab caches for frequently used items such as btree cursors and log
intent items, and compacting the structures to reduce memory usage of
those items substantially. This also sets us up to support larger
btrees in future kernels. We also switch parts of online fsck to
allocate scrub context information from the heap instead of using
stack space.
Summary:
- Bug fixes and cleanups for kernel memory allocation usage, this
time without touching the mm code.
- Refactor the log recovery mechanism that preserves held resources
across a transaction roll so that it uses the exact same mechanism
that we use for that during regular runtime.
- Fix bugs and tighten checking around btree heights.
- Remove more old typedefs.
- Fix perag reference leaks when racing with growfs.
- Remove unused fields from xfs_btree_cur.
- Allocate various scrub structures on the heap to reduce stack
usage.
- Pack xfs_btree_cur fields and rearrange to support arbitrary
heights.
- Compute maximum possible heights for each btree height, and use
that to set up slab caches for each btree type.
- Finally remove kmem_zone_t, since these have always been struct
kmem_cache on Linux.
- Compact the structures used to coordinate work intent items.
- Set up slab caches for each work intent item type.
- Rename the "bmap_add_free" function to "free_extent_later", which
more accurately describes what it does.
- Fix corruption warning on unmount when a CoW preallocation covers a
data fork delalloc reservation but then the CoW fails.
- Add some more minor code improvements"
* tag 'xfs-5.16-merge-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (45 commits)
xfs: use swap() to make code cleaner
xfs: Remove duplicated include in xfs_super
xfs: punch out data fork delalloc blocks on COW writeback failure
xfs: remove unused parameter from refcount code
xfs: reduce the size of struct xfs_extent_free_item
xfs: rename xfs_bmap_add_free to xfs_free_extent_later
xfs: create slab caches for frequently-used deferred items
xfs: compact deferred intent item structures
xfs: rename _zone variables to _cache
xfs: remove kmem_zone typedef
xfs: use separate btree cursor cache for each btree type
xfs: compute absolute maximum nlevels for each btree type
xfs: kill XFS_BTREE_MAXLEVELS
xfs: compute the maximum height of the rmap btree when reflink enabled
xfs: clean up xfs_btree_{calc_size,compute_maxlevels}
xfs: compute maximum AG btree height for critical reservation calculation
xfs: rename m_ag_maxlevels to m_allocbt_maxlevels
xfs: dynamically allocate cursors based on maxlevels
xfs: encode the max btree height in the cursor
xfs: refactor btree cursor allocation function
...
Pull AFS updates from David Howells:
- Split the readpage handler for symlinks from the one for files. The
symlink readpage isn't given a file pointer, so the handling has to
be special-cased.
This has been posted as part of a patchset to foliate netfs, afs,
etc.[1] but I've moved it to this one as it's not actually doing
foliation but is more of a pre-cleanup.
- Fix file creation to set the mtime from the client's clock to keep
make happy if the server's clock isn't quite in sync.[2]
Link: https://lore.kernel.org/r/163005742570.2472992.7800423440314043178.stgit@warthog.procyon.org.uk/ [1]
Link: http://lists.infradead.org/pipermail/linux-afs/2021-October/004395.html [2]
* tag 'afs-next-20211102' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
afs: Set mtime from the client for yfs create operations
afs: Sort out symlink reading
This patch fixes the following crash by receiving a invalid message:
[ 160.672220] ==================================================================
[ 160.676206] BUG: KASAN: user-memory-access in dlm_user_add_ast+0xc3/0x370
[ 160.679659] Read of size 8 at addr 00000000deadbeef by task kworker/u32:13/319
[ 160.681447]
[ 160.681824] CPU: 10 PID: 319 Comm: kworker/u32:13 Not tainted 5.14.0-rc2+ #399
[ 160.683472] Hardware name: Red Hat KVM/RHEL-AV, BIOS 1.14.0-1.module+el8.6.0+12648+6ede71a5 04/01/2014
[ 160.685574] Workqueue: dlm_recv process_recv_sockets
[ 160.686721] Call Trace:
[ 160.687310] dump_stack_lvl+0x56/0x6f
[ 160.688169] ? dlm_user_add_ast+0xc3/0x370
[ 160.689116] kasan_report.cold.14+0x116/0x11b
[ 160.690138] ? dlm_user_add_ast+0xc3/0x370
[ 160.690832] dlm_user_add_ast+0xc3/0x370
[ 160.691502] _receive_unlock_reply+0x103/0x170
[ 160.692241] _receive_message+0x11df/0x1ec0
[ 160.692926] ? rcu_read_lock_sched_held+0xa1/0xd0
[ 160.693700] ? rcu_read_lock_bh_held+0xb0/0xb0
[ 160.694427] ? lock_acquire+0x175/0x400
[ 160.695058] ? do_purge.isra.51+0x200/0x200
[ 160.695744] ? lock_acquired+0x360/0x5d0
[ 160.696400] ? lock_contended+0x6a0/0x6a0
[ 160.697055] ? lock_release+0x21d/0x5e0
[ 160.697686] ? lock_is_held_type+0xe0/0x110
[ 160.698352] ? lock_is_held_type+0xe0/0x110
[ 160.699026] ? ___might_sleep+0x1cc/0x1e0
[ 160.699698] ? dlm_wait_requestqueue+0x94/0x140
[ 160.700451] ? dlm_process_requestqueue+0x240/0x240
[ 160.701249] ? down_write_killable+0x2b0/0x2b0
[ 160.701988] ? do_raw_spin_unlock+0xa2/0x130
[ 160.702690] dlm_receive_buffer+0x1a5/0x210
[ 160.703385] dlm_process_incoming_buffer+0x726/0x9f0
[ 160.704210] receive_from_sock+0x1c0/0x3b0
[ 160.704886] ? dlm_tcp_shutdown+0x30/0x30
[ 160.705561] ? lock_acquire+0x175/0x400
[ 160.706197] ? rcu_read_lock_sched_held+0xa1/0xd0
[ 160.706941] ? rcu_read_lock_bh_held+0xb0/0xb0
[ 160.707681] process_recv_sockets+0x32/0x40
[ 160.708366] process_one_work+0x55e/0xad0
[ 160.709045] ? pwq_dec_nr_in_flight+0x110/0x110
[ 160.709820] worker_thread+0x65/0x5e0
[ 160.710423] ? process_one_work+0xad0/0xad0
[ 160.711087] kthread+0x1ed/0x220
[ 160.711628] ? set_kthread_struct+0x80/0x80
[ 160.712314] ret_from_fork+0x22/0x30
The issue is that we received a DLM message for a user lock but the
destination lock is a kernel lock. Note that the address which is trying
to derefence is 00000000deadbeef, which is in a kernel lock
lkb->lkb_astparam, this field should never be derefenced by the DLM
kernel stack. In case of a user lock lkb->lkb_astparam is lkb->lkb_ua
(memory is shared by a union field). The struct lkb_ua will be handled
by the DLM kernel stack but on a kernel lock it will contain invalid
data and ends in most likely crashing the kernel.
It can be reproduced with two cluster nodes.
node 2:
dlm_tool join test
echo "862 fooobaar 1 2 1" > /sys/kernel/debug/dlm/test_locks
echo "862 3 1" > /sys/kernel/debug/dlm/test_waiters
node 1:
dlm_tool join test
python:
foo = DLM(h_cmd=3, o_nextcmd=1, h_nodeid=1, h_lockspace=0x77222027, \
m_type=7, m_flags=0x1, m_remid=0x862, m_result=0xFFFEFFFE)
newFile = open("/sys/kernel/debug/dlm/comms/2/rawmsg", "wb")
newFile.write(bytes(foo))
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>