We clear the bit marking the ctx task_work as active after having run
the queued work, but we really should be clearing it before. Otherwise
we can hit a tiny race ala:
CPU0 CPU1
io_task_work_add() tctx_task_work()
run_work
add_to_list
test_and_set_bit
clear_bit
already set
and CPU0 will return thinking the task_work is queued, while in reality
it's already being run. If we hit the condition after __tctx_task_work()
found no more work, but before we've cleared the bit, then we'll end up
thinking it's queued and will be run. In reality it is queued, but we
didn't queue the ctx task_work to ensure that it gets run.
Fixes: 7cbf1722d5 ("io_uring: provide FIFO ordering for task_work")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If we put the io-wq from io_uring, we really want it to exit. Provide
a helper that does that for us. Couple that with not having the manager
hold a reference to the 'wq' and the normal SQPOLL exit will tear down
the io-wq context appropriate.
On the io-wq side, our wq context is per task, so only the task itself
is manipulating ->manager and hence it's safe to check and clear without
any extra locking. We just need to ensure that the manager task stays
around, in case it exits.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We want to reuse this completion, and a single complete should do just
fine. Ensure that we park ourselves first if requested, as that is what
lead to the initial deadlock in this area. If we've got someone attempting
to park us, then we can't proceed without having them finish first.
Fixes: 37d1e2e364 ("io_uring: move SQPOLL thread io-wq forked worker")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
io_uring_try_cancel_requests() matches not only current's requests, but
also of other exiting tasks, so we need to actively cancel them and not
just wait, especially since the function can be called on flush during
do_exit() -> exit_files().
Even if it's not a problem for now, it's much nicer to know that the
function tries to cancel everything it can.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If we fail to fork an SQPOLL worker, we can hit cancel, and hence
attempted thread stop, with the thread already being stopped. Ensure
we check for that.
Also guard thread stop fully by the sqd mutex, just like we do for
park.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We are already freeing the wq struct in both spots, so don't put it and
get it freed twice.
Reported-by: syzbot+7bf785eedca35ca05501@syzkaller.appspotmail.com
Fixes: 4fb6ac3262 ("io-wq: improve manager/worker handling over exec")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The manager waits for the workers, hence the manager is always valid if
workers are running. Now also have wq destroy wait for the manager on
exit, so we now everything is gone.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This is a leftover from a different use cases, it's used to wait for
the manager to startup. Rename it as such.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If we're in the process of shutting down the async context, then don't
create new workers if we already have at least the fixed one.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Instead of having to wait separately on workers and manager, just have
the manager wait on the workers. We use an atomic_t for the reference
here, as we need to start at 0 and allow increment from that. Since the
number of workers is naturally capped by the allowed nr of processes,
and that uses an int, there is no risk of overflow.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We need to have our worker count updated before continuing, to avoid
cases where we repeatedly think we need a new worker, but a fork is
already in progress.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
- Restore a disused sysctl control knob that was inadvertently dropped
during the merge window to avoid fstests regressions.
- Don't speculatively release freed blocks from the busy list until
we're actually allocating them, which fixes a rare log recovery
regression.
- Don't nest transactions when scanning for free space.
- Add an idiot^Wmaintainer light to detect nested transactions. ;)
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAmA3zW8ACgkQ+H93GTRK
tOtT4xAAmZ5BdQ6V3yUeT/N++L6Ax62T2VzEryZvVK/ZFyVBRYKi9LOL1exq1cja
HXINPuYWAD8TbGVU9/lZR1yUX/y1VvJR0EPly8EN6WpGFeErSxLs++YzP1Q8iv5i
ZtniscpGE6JvCcDeRH5kBfklGpyzTf3t6Xe8x+6+/aawf34ChNlM/gQcAyKvYYU5
Jb9j7BqbRAnhvPEfa554yxIIoZhmTDYY7Wx7VMKCMcOP1lfriC+I1iuiZIMONIQJ
mMgz9XnHVo256+YvkvwRKp294r+MEkuJL5EBXrs01r3PwVdaigo13qTk8l1ZC3zS
VYkC/sRoiyMwnJvKEUNtnM3/8Zu/DvPp9iqXiWc60UBGqpBkm8Jgv+W6H7u1FinP
0M0Wt2wHC7e51uW5G/8QwUXZv+n8IZHyZkkYbjyXRkhfyFlexYwTVchZz9q/RB/A
HEZ9jcIke8Rwkav4f0kJ00Y/7FQSPn6ItapXf92rl00z3Z5S2sqBaT5kIotsW0Ke
634yPknkLuBDQg4j8l3A88ik2SNFRQQfBXsjt27He/s2wV0Dj8RjDnLWfoV7P5to
Sc2lx3HhL4OCojAXXAFP3MDKz0nqcuUTPoPCeS6QKQGcjTzVvoI7ZutXODcxi67k
Q7AK+gIqHRWA8F+4wciYDwAHMES1rRAa7/iuYmtCtT1sBdXp9NU=
=g9K3
-----END PGP SIGNATURE-----
Merge tag 'xfs-5.12-merge-6' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull more xfs updates from Darrick Wong:
"The most notable fix here prevents premature reuse of freed metadata
blocks, and adding the ability to detect accidental nested
transactions, which are not allowed here.
- Restore a disused sysctl control knob that was inadvertently
dropped during the merge window to avoid fstests regressions.
- Don't speculatively release freed blocks from the busy list until
we're actually allocating them, which fixes a rare log recovery
regression.
- Don't nest transactions when scanning for free space.
- Add an idiot^Wmaintainer light to detect nested transactions. ;)"
* tag 'xfs-5.12-merge-6' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: use current->journal_info for detecting transaction recursion
xfs: don't nest transactions when scanning for eofblocks
xfs: don't reuse busy extents on extent trim
xfs: restore speculative_cow_prealloc_lifetime sysctl
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmA6njIQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgprolD/9zWti9LsZvA7yE+PhVwrwF3CsNzLfQlClw
99HaA7HxtAc/VLJrnD/SubhCAPdBC5B2xPv6faajdwF2iUR3Rr1Uc93CQ3uP2KKq
kvm6ALTpzPTMI6YSABhY74sg9BkkoDbMo54JQYVQPleiE+5eDLbuFZck6ObfUHyY
a4aaImlndWp/t14GzrClL4hucF+5KJy846P+QCVclkh0yl8xSsqZ5LIFU7tu3iQb
HpZ5HKLT/2ma/EOr3wknnsIe97AUZQU0q5aMparhYlm+qR511eop3QXx850FL/oC
tEGceKLij6qazmkiocKVzML8Fs+Y9/a4vCMjLCScWJmzDlmKdlH2uudeahN6b9Hm
15qRQHOjl1Hc2bdr5ZVn87nq9RWhSm18C+SRMwOKHCOnEhwxqM3RjRfAgj4BJ6QB
PFbFqdY+8Y1YLPFmn9hph72ePaEcN4L2IXW6TI/WX8mot8ODAnkq9Hr38dKwzO+i
0mon6DVyJKKho6XwvVu5IYurkR2beQprjeVUxwZjjT6DxUgsc+J6itK5LDHFSkeZ
qZlXn5Di8MkiXg0DFJYDQiFXnO0Z5GlRWOGPVfBaOr3x+1dqzDdHGw4oz1oGqvnr
GNNYCsYIpDGm7eauX5lqL5MUFpjqRCceXy5JSHPhnWWw617nYkr4H9jdsV9HiTX1
tQFx05QW3w==
=ccMs
-----END PGP SIGNATURE-----
Merge tag 'block-5.12-2021-02-27' of git://git.kernel.dk/linux-block
Pull more block updates from Jens Axboe:
"A few stragglers (and one due to me missing it originally), and fixes
for changes in this merge window mostly. In particular:
- blktrace cleanups (Chaitanya, Greg)
- Kill dead blk_pm_* functions (Bart)
- Fixes for the bio alloc changes (Christoph)
- Fix for the partition changes (Christoph, Ming)
- Fix for turning off iopoll with polled IO inflight (Jeffle)
- nbd disconnect fix (Josef)
- loop fsync error fix (Mauricio)
- kyber update depth fix (Yang)
- max_sectors alignment fix (Mikulas)
- Add bio_max_segs helper (Matthew)"
* tag 'block-5.12-2021-02-27' of git://git.kernel.dk/linux-block: (21 commits)
block: Add bio_max_segs
blktrace: fix documentation for blk_fill_rw()
block: memory allocations in bounce_clone_bio must not fail
block: remove the gfp_mask argument to bounce_clone_bio
block: fix bounce_clone_bio for passthrough bios
block-crypto-fallback: use a bio_set for splitting bios
block: fix logging on capacity change
blk-settings: align max_sectors on "logical_block_size" boundary
block: reopen the device in blkdev_reread_part
block: don't skip empty device in in disk_uevent
blktrace: remove debugfs file dentries from struct blk_trace
nbd: handle device refs for DESTROY_ON_DISCONNECT properly
kyber: introduce kyber_depth_updated()
loop: fix I/O error on fsync() in detached loop devices
block: fix potential IO hang when turning off io_poll
block: get rid of the trace rq insert wrapper
blktrace: fix blk_rq_merge documentation
blktrace: fix blk_rq_issue documentation
blktrace: add blk_fill_rwbs documentation comment
block: remove superfluous param in blk_fill_rwbs()
...
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmA4JRkQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpoWqD/9dbbqe8L701U6May1A/4hRsqL4THTA2flx
vNCNRBl6XV3l/wBCtL6waKy6tyO4lyM8XdUdEvo3Kxl2kGPb8eVfpyYL/+77HqyH
ctT4RMrs+84Mxn+5N6cM97hS1qVI2moTxxyvOEl/JTB7BYrutz9gvAoeY3/Dto47
J66oSaPeuqJ32TyihxfQHVxQopJcqFzDjyoYHGDu6ATio1PXfaIdTu8ywVYSECAh
pWI4rwnqdurGuHMNpxyL1bA6CT/jC7s+sqU7bUYUCgtYI3eG0u3V0bp5gAQQIgl9
5sxxE3DidYGAkYZsosrelshBtzGddLdz4Qrt2ungMYv8RsGNpFQ095jDPKDwFaZj
bSvSsfplCo7iFsJByb1TtpNEOW8eAwi81PmBDVQ9Oq5P5ygTYno9GBDc/20ql0Fk
q6wcX28coE3IBw44ne0hIwvBOtXV4WJyluG/gqOxfbTH+kOy3pDsN8lWcY/P4X0U
yzdU2MLHe8BNMyYlUiBF47Amzt4ltr85P4XD3WZ4bX71iwri6HvrdGWLuuKwX+Ie
66QiIDDQIYZQ6NMMJWS9DGW3y3DBizpSXGxONbOw1J2bQdNmtToR0D2UnK/9UnKp
msnvkUNk8fkYGS4aptpJ6HxbmjMEG5YtbiGlPj6fz5/7MTvhRjPxt7A0LWrUIdqR
f88+sHUMqg==
=oc8u
-----END PGP SIGNATURE-----
Merge tag 'io_uring-worker.v3-2021-02-25' of git://git.kernel.dk/linux-block
Pull io_uring thread rewrite from Jens Axboe:
"This converts the io-wq workers to be forked off the tasks in question
instead of being kernel threads that assume various bits of the
original task identity.
This kills > 400 lines of code from io_uring/io-wq, and it's the worst
part of the code. We've had several bugs in this area, and the worry
is always that we could be missing some pieces for file types doing
unusual things (recent /dev/tty example comes to mind, userfaultfd
reads installing file descriptors is another fun one... - both of
which need special handling, and I bet it's not the last weird oddity
we'll find).
With these identical workers, we can have full confidence that we're
never missing anything. That, in itself, is a huge win. Outside of
that, it's also more efficient since we're not wasting space and code
on tracking state, or switching between different states.
I'm sure we're going to find little things to patch up after this
series, but testing has been pretty thorough, from the usual
regression suite to production. Any issue that may crop up should be
manageable.
There's also a nice series of further reductions we can do on top of
this, but I wanted to get the meat of it out sooner rather than later.
The general worry here isn't that it's fundamentally broken. Most of
the little issues we've found over the last week have been related to
just changes in how thread startup/exit is done, since that's the main
difference between using kthreads and these kinds of threads. In fact,
if all goes according to plan, I want to get this into the 5.10 and
5.11 stable branches as well.
That said, the changes outside of io_uring/io-wq are:
- arch setup, simple one-liner to each arch copy_thread()
implementation.
- Removal of net and proc restrictions for io_uring, they are no
longer needed or useful"
* tag 'io_uring-worker.v3-2021-02-25' of git://git.kernel.dk/linux-block: (30 commits)
io-wq: remove now unused IO_WQ_BIT_ERROR
io_uring: fix SQPOLL thread handling over exec
io-wq: improve manager/worker handling over exec
io_uring: ensure SQPOLL startup is triggered before error shutdown
io-wq: make buffered file write hashed work map per-ctx
io-wq: fix race around io_worker grabbing
io-wq: fix races around manager/worker creation and task exit
io_uring: ensure io-wq context is always destroyed for tasks
arch: ensure parisc/powerpc handle PF_IO_WORKER in copy_thread()
io_uring: cleanup ->user usage
io-wq: remove nr_process accounting
io_uring: flag new native workers with IORING_FEAT_NATIVE_WORKERS
net: remove cmsg restriction from io_uring based send/recvmsg calls
Revert "proc: don't allow async path resolution of /proc/self components"
Revert "proc: don't allow async path resolution of /proc/thread-self components"
io_uring: move SQPOLL thread io-wq forked worker
io-wq: make io_wq_fork_thread() available to other users
io-wq: only remove worker from free_list, if it was there
io_uring: remove io_identity
io_uring: remove any grabbing of context
...
Pull misc vfs updates from Al Viro:
"Assorted stuff pile - no common topic here"
* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
whack-a-mole: don't open-code iminor/imajor
9p: fix misuse of sscanf() in v9fs_stat2inode()
audit_alloc_mark(): don't open-code ERR_CAST()
fs/inode.c: make inode_init_always() initialize i_ino to 0
vfs: don't unnecessarily clone write access for writable fds
It's often inconvenient to use BIO_MAX_PAGES due to min() requiring the
sign to be the same. Introduce bio_max_segs() and change BIO_MAX_PAGES to
be unsigned to make it easier for the users.
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmA4bocACgkQiiy9cAdy
T1HuMwv/bmZ53ilDwhCph/UKoLJm/bRyvCp+GBECtLLS/C/4qz5IBLpPPr2yhOyH
gmkeCZZWhj0nzGAYxhVDAdBz9IPEA7bae503IaOuk5uXaCXC8htsq/Cd7qpJmHlf
5vCdBTmHBiUt02dUcZ9A3bm855xJLEINHH9YdJM157ysqLgttIibLQB0F/gJ49DR
QIIdq7sZNJXcTgRsUzJZNnrWLDi2oVIoUlq5M6d8ypmZC0ArPNfrSafjW6h5rqpj
UYBwtUDNwQiS0lgwR4mji4PCen0GGwMFtyVDOpdJLJq3fO995yse2BRk0BFHVH1i
xfAskQjkxAHEcfzQC1cM4ouT/WYu8nHaLK1vp/1lVr93mo8KqSX+SW/bLsXfpQkm
w//xMy94HdM2pgyM6J1pYnKPb7s/DG19RYPktQ5oYn0fR5qYlqALAmd02JRjO3xV
cbjbmWXXzFFsFJc5MJmM6wVLJzRb4a50SN1W37aHNXWi8ktpsaNzz33LGw1pF2OT
K6P2DqJe
=RQo/
-----END PGP SIGNATURE-----
Merge tag '5.12-smb3-part1' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs updates from Steve French:
- improvements to mode bit conversion, chmod and chown when using
cifsacl mount option
- two new mount options for controlling attribute caching
- improvements to crediting and reconnect, improved debugging
- reconnect fix
- add SMB3.1.1 dialect to default dialects for vers=3
* tag '5.12-smb3-part1' of git://git.samba.org/sfrench/cifs-2.6: (27 commits)
cifs: update internal version number
cifs: use discard iterator to discard unneeded network data more efficiently
cifs: introduce helper for finding referral server to improve DFS target resolution
cifs: check all path components in resolved dfs target
cifs: fix DFS failover
cifs: fix nodfs mount option
cifs: fix handling of escaped ',' in the password mount argument
cifs: Add new parameter "acregmax" for distinct file and directory metadata timeout
cifs: convert revalidate of directories to using directory metadata cache timeout
cifs: Add new mount parameter "acdirmax" to allow caching directory metadata
cifs: If a corrupted DACL is returned by the server, bail out.
cifs: minor simplification to smb2_is_network_name_deleted
TCON Reconnect during STATUS_NETWORK_NAME_DELETED
cifs: cleanup a few le16 vs. le32 uses in cifsacl.c
cifs: Change SIDs in ACEs while transferring file ownership.
cifs: Retain old ACEs when converting between mode bits and ACL.
cifs: Fix cifsacl ACE mask for group and others.
cifs: clarify hostname vs ip address in /proc/fs/cifs/DebugData
cifs: change confusing field serverName (to ip_addr)
cifs: Fix inconsistent IS_ERR and PTR_ERR
...
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmA4I1IQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpvt+D/sGmXUzjHagIicAGyiJEo8mQhTFgnWrqEbU
2bfJaN5gz7sf/9SO1ZnjohqEEBjuCngCsvrgnRYfrZSCmRrA00xj8UY7bRzLVZZT
HskXfoXonqbsI9Lg5o3StMWIL/Btq1sWhyhMbiHElS0ESGh2+tP3hIZOSeCNIEgL
ROlGvxoPXjXVgmeqOsYNWKKjpEjJSSFkD4aZ/s8uzJNNMziVb5a4oXHDtFnKV5jY
SY+9lvwKkIcX6V7bMCLI9Et6CeqTt45hN9rfQ+hUEdISij04ZwZU9rqfsAGlm9TK
wagckYKh2y9/aH50f1arkHHqS/S9X2gFAAfKGMyFL2pdQut4OAeJSwucAMP7g/gd
P4wR3eID2y9lP8cYuu5S0KjnySa1QMriQGKBfjLMFNFdRtfo4n5p0PJa5fj7S7Wa
GwXgwfk5s4z70Ra301lU4+zzBQrgRsffx5/aV3jEk9CfrwHM4VeKWBk4h859Koyw
2kruKwRiH2pm+RhAumNfLd7D6hD4xmFuQkhoaCDgxCJ65bumZNWPTQSfe1ikrCsB
Rd/92gunKICETiFOCKCBh3/eDLvIi/Z/M6ZDIyD0ywEEH2R1o1OoCWB8kphM69F5
UrXGNLuA9b5STlk2BR8nrAWbf0aNVyjA/W1JLSXRfFh5ZZrEkbWK2eMKPObvnGT1
d1fDg+6qYQ==
=Vgca
-----END PGP SIGNATURE-----
Merge tag 'for-5.12/io_uring-2021-02-25' of git://git.kernel.dk/linux-block
Pull more io_uring updates from Jens Axboe:
"A collection of later fixes that we should get into this release:
- Series of submission cleanups (Pavel)
- A few fixes for issues from earlier this merge window (Pavel, me)
- IOPOLL resubmission fix
- task_work locking fix (Hao)"
* tag 'for-5.12/io_uring-2021-02-25' of git://git.kernel.dk/linux-block: (25 commits)
Revert "io_uring: wait potential ->release() on resurrect"
io_uring: fix locked_free_list caches_free()
io_uring: don't attempt IO reissue from the ring exit path
io_uring: clear request count when freeing caches
io_uring: run task_work on io_uring_register()
io_uring: fix leaving invalid req->flags
io_uring: wait potential ->release() on resurrect
io_uring: keep generic rsrc infra generic
io_uring: zero ref_node after killing it
io_uring: make the !CONFIG_NET helpers a bit more robust
io_uring: don't hold uring_lock when calling io_run_task_work*
io_uring: fail io-wq submission from a task_work
io_uring: don't take uring_lock during iowq cancel
io_uring: fail links more in io_submit_sqe()
io_uring: don't do async setup for links' heads
io_uring: do io_*_prep() early in io_submit_sqe()
io_uring: split sqe-prep and async setup
io_uring: don't submit link on error
io_uring: move req link into submit_state
io_uring: move io_init_req() into io_submit_sqe()
...
Merge more updates from Andrew Morton:
"118 patches:
- The rest of MM.
Includes kfence - another runtime memory validator. Not as thorough
as KASAN, but it has unmeasurable overhead and is intended to be
usable in production builds.
- Everything else
Subsystems affected by this patch series: alpha, procfs, sysctl,
misc, core-kernel, MAINTAINERS, lib, bitops, checkpatch, init,
coredump, seq_file, gdb, ubsan, initramfs, and mm (thp, cma,
vmstat, memory-hotplug, mlock, rmap, zswap, zsmalloc, cleanups,
kfence, kasan2, and pagemap2)"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (118 commits)
MIPS: make userspace mapping young by default
initramfs: panic with memory information
ubsan: remove overflow checks
kgdb: fix to kill breakpoints on initmem after boot
scripts/gdb: fix list_for_each
x86: fix seq_file iteration for pat/memtype.c
seq_file: document how per-entry resources are managed.
fs/coredump: use kmap_local_page()
init/Kconfig: fix a typo in CC_VERSION_TEXT help text
init: clean up early_param_on_off() macro
init/version.c: remove Version_<LINUX_VERSION_CODE> symbol
checkpatch: do not apply "initialise globals to 0" check to BPF progs
checkpatch: don't warn about colon termination in linker scripts
checkpatch: add kmalloc_array_node to unnecessary OOM message check
checkpatch: add warning for avoiding .L prefix symbols in assembly files
checkpatch: improve TYPECAST_INT_CONSTANT test message
checkpatch: prefer ftrace over function entry/exit printks
checkpatch: trivial style fixes
checkpatch: ignore warning designated initializers using NR_CPUS
checkpatch: improve blank line after declaration test
...
In dump_user_range() there is no reason for the mapping to be global. Use
kmap_local_page() rather than kmap.
Link: https://lkml.kernel.org/r/20210203223328.558945-1-ira.weiny@intel.com
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since
sysctl: pass kernel pointers to ->proc_handler
we have been pre-allocating a buffer to copy the data from the proc
handlers into, and then copying that to userspace. The problem is this
just blindly kzalloc()'s the buffer size passed in from the read, which in
the case of our 'cat' binary was 64kib. Order-4 allocations are not
awesome, and since we can potentially allocate up to our maximum order, so
use kvzalloc for these buffers.
[willy@infradead.org: changelog tweaks]
Link: https://lkml.kernel.org/r/6345270a2c1160b89dd5e6715461f388176899d1.1612972413.git.josef@toxicpanda.com
Fixes: 32927393dc ("sysctl: pass kernel pointers to ->proc_handler")
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
CC: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
To resolve the symbol fuction name for wchan, use the printk format
specifier %ps instead of manually looking up the symbol function name
via lookup_symbol_name().
Link: https://lkml.kernel.org/r/20201217165413.GA1959@ls3530.fritz.box
Signed-off-by: Helge Deller <deller@gmx.de>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Enhance mapping_seek_hole_data() to handle partially uptodate pages and
convert the iomap seek code to call it.
Link: https://lkml.kernel.org/r/20201112212641.27837-9-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: William Kucharski <william.kucharski@oracle.com>
Cc: Yang Shi <yang.shi@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- New Features:
- Support for eager writes, and the write=eager and write=wait mount options
- Other Bugfixes and Cleanups:
- Fix typos in some comments
- Fix up fall-through warnings for Clang
- Cleanups to the NFS readpage codepath
- Remove FMR support in rpcrdma_convert_iovs()
- Various other cleanups to xprtrdma
- Fix xprtrdma pad optimization for servers that don't support RFC 8797
- Improvements to rpcrdma tracepoints
- Fix up nfs4_bitmask_adjust()
- Optimize sparse writes past the end of files
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAmAwOLwACgkQ18tUv7Cl
QOsUfw//W2KoJ+2IQohQNFcoi+bG1OQE7jnqHtQ+tsKfpJKemcDcu8wQEAqrwALg
vXioG1Ye0QU7P5PZtNxCorylqSTVGvJSIOrfa3lTdn/PDbI7NIgN52w56TzzfeXn
pJ4gDwZzPwUFUblF0LBQUIhJv5IQvOXVgUsMqezbIbMXSiuLR/bjnZ96Q/woKpoL
eg2IZ5EO9Jb0QjuQ1e9U303X7c2qOl1jzpxyQLQfD7ONnWBx3HnJk1l+3JJRi8JV
smnae3I0L3nUZ7rBqoqsvK7YUjUchCEBvkmEMsnHT94D5tI9mxxX5OquREee6QHn
NuJRSNbsIiCD3Ne27fkCut78d6SetoMko7jZ97T6smhyijtXJiLG/6dycMPV9rt/
bVIudWMm9/A9AsXyY2YP5LC6Y6W6dhQRXygUjVgEPBl6kVsb2Eca8IA9QZghF9IL
+XSEulASvxo2rWPylJJ+3aLynfqoHrowVN/Tu61svDnJWTcb+FCxQ5zyLox7erEH
mUhraf1D0uoX9odH1069toN6favZFE6SIDvlUk1QTOjr6p3Jxmkuyl6PNs5t66/S
550z5JVb2deIHOPQxOie7xz/Dk6dnRoaFhTNq/Ootkt9GNe0A+NqSUdoRA5XxN5m
wW11ecLSZSehDksuXjyFmkHtkagLreFxLsHbVnaAtwEm7h/thRI=
=Dssn
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-5.12-1' of git://git.linux-nfs.org/projects/anna/linux-nfs
Pull NFS Client Updates from Anna Schumaker:
"New Features:
- Support for eager writes, and the write=eager and write=wait mount
options
- Other Bugfixes and Cleanups:
- Fix typos in some comments
- Fix up fall-through warnings for Clang
- Cleanups to the NFS readpage codepath
- Remove FMR support in rpcrdma_convert_iovs()
- Various other cleanups to xprtrdma
- Fix xprtrdma pad optimization for servers that don't support
RFC 8797
- Improvements to rpcrdma tracepoints
- Fix up nfs4_bitmask_adjust()
- Optimize sparse writes past the end of files"
* tag 'nfs-for-5.12-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (27 commits)
NFS: Support the '-owrite=' option in /proc/self/mounts and mountinfo
NFS: Set the stable writes flag when initialising the super block
NFS: Add mount options supporting eager writes
NFS: Add support for eager writes
NFS: 'flags' field should be unsigned in struct nfs_server
NFS: Don't set NFS_INO_INVALID_XATTR if there is no xattr cache
NFS: Always clear an invalid mapping when attempting a buffered write
NFS: Optimise sparse writes past the end of file
NFS: Fix documenting comment for nfs_revalidate_file_size()
NFSv4: Fixes for nfs4_bitmask_adjust()
xprtrdma: Clean up rpcrdma_prepare_readch()
rpcrdma: Capture bytes received in Receive completion tracepoints
xprtrdma: Pad optimization, revisited
rpcrdma: Fix comments about reverse-direction operation
xprtrdma: Refactor invocations of offset_in_page()
xprtrdma: Simplify rpcrdma_convert_kvec() and frwr_map()
xprtrdma: Remove FMR support in rpcrdma_convert_iovs()
NFS: Add nfs_pageio_complete_read() and remove nfs_readpage_async()
NFS: Call readpage_async_filler() from nfs_readpage_async()
NFS: Refactor nfs_readpage() and nfs_readpage_async() to use nfs_readdesc
...
The iterator, ITER_DISCARD, that can only be used in READ mode and
just discards any data copied to it, was added to allow a network
filesystem to discard any unwanted data sent by a server.
Convert cifs_discard_from_socket() to use this.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Some servers seem to mistakenly report different values for
capabilities and share flags, so we can't always rely on those values
to decide whether the resolved target can handle any new DFS
referrals.
Add a new helper is_referral_server() to check if all resolved targets
can handle new DFS referrals by directly looking at the
GET_DFS_REFERRAL.ReferralHeaderFlags value as specified in MS-DFSC
2.2.4 RESP_GET_DFS_REFERRAL in addition to is_tcon_dfs().
Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Cc: stable@vger.kernel.org # 5.11
Signed-off-by: Steve French <stfrench@microsoft.com>
Handle the case where a resolved target share is like
//server/users/dir, and the user "foo" has no read permission to
access the parent folder "users" but has access to the final path
component "dir".
is_path_remote() already implements that, so call it directly.
Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Cc: stable@vger.kernel.org # 5.11
Signed-off-by: Steve French <stfrench@microsoft.com>
In do_dfs_failover(), the mount_get_conns() function requires the full
fs context in order to get new connection to server, so clone the
original context and change it accordingly when retrying the DFS
targets in the referral.
If failover was successful, then update original context with the new
UNC, prefix path and ip address.
Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Cc: stable@vger.kernel.org # 5.11
Signed-off-by: Steve French <stfrench@microsoft.com>
Skip DFS resolving when mounting with 'nodfs' even if
CONFIG_CIFS_DFS_UPCALL is enabled.
Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Cc: stable@vger.kernel.org # 5.11
Reviewed-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Passwords can contain ',' which are also used as the separator between
mount options. Mount.cifs will escape all ',' characters as the string ",,".
Update parsing of the mount options to detect ",," and treat it as a single
'c' character.
Fixes: 24e0a1eff9 ("cifs: switch to new mount api")
Cc: stable@vger.kernel.org # 5.11
Reported-by: Simon Taylor <simon@simon-taylor.me.uk>
Tested-by: Simon Taylor <simon@simon-taylor.me.uk>
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
cycle...
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAmA3qgEACgkQ8vlZVpUN
gaMHqwf+IGiMeyB3BnIbQZjcwgyxRYMMmod0tcGv4yK4bf85+rTZZMcPZm5Ioqwm
+XLnuRW+x3Do6lnzNZj9p+qnPNwX2YGupwGeAoT5puMDKl0J7HhvdqPcNi4WgucW
K+LCAxbPjVFUAkpD2gwgNgXej4Us1QP/93GGG8YhTd+jiYS7axrRbAlE+4SmCsye
83dEWpXSFDE0qXHoaPwLo26LMPySPjWM/UuiWD8ozQ+mcL0Doecw1tWJottSo+Ds
+cJtxiOPVPTFwZUJOt/qoxfPnALrqAeazS+3lGmqK5c3xbvsc9CiYqP15FUbQF08
eCDMo9bpqQdRs4WmaRSvBzV8v+/TXg==
=+amQ
-----END PGP SIGNATURE-----
Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 updates from Ted Ts'o:
"Miscellaneous ext4 cleanups and bug fixes. Pretty boring this cycle..."
* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
ext4: add .kunitconfig fragment to enable ext4-specific tests
ext: EXT4_KUNIT_TESTS should depend on EXT4_FS instead of selecting it
ext4: reset retry counter when ext4_alloc_file_blocks() makes progress
ext4: fix potential htree index checksum corruption
ext4: factor out htree rep invariant check
ext4: Change list_for_each* to list_for_each_entry*
ext4: don't try to processed freed blocks until mballoc is initialized
ext4: use DEFINE_MUTEX() for mutex lock
The new optional mount parameter "acregmax" allows a different
timeout for file metadata ("acdirmax" now allows controlling timeout
for directory metadata). Setting "actimeo" still works as before,
and changes timeout for both files and directories, but
specifying "acregmax" or "acdirmax" allows overriding the
default more granularly which can be a big performance benefit
on some workloads. "acregmax" is already used by NFS as a mount
parameter (albeit with a larger default and thus looser caching).
Suggested-by: Tom Talpey <tom@talpey.com>
Reviewed-By: Tom Talpey <tom@talpey.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
The new optional mount parm, "acdirmax" allows caching the metadata
for a directory longer than file metadata, which can be very helpful
for performance. Convert cifs_inode_needs_reval to check acdirmax
for revalidating directory metadata.
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Reviewed-By: Tom Talpey <tom@talpey.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
nfs and cifs on Linux currently have a mount parameter "actimeo" to control
metadata (attribute) caching but cifs does not have additional mount
parameters to allow distinguishing between caching directory metadata
(e.g. needed to revalidate paths) and that for files.
Add new mount parameter "acdirmax" to allow caching metadata for
directories more loosely than file data. NFS adjusts metadata
caching from acdirmin to acdirmax (and another two mount parms
for files) but to reduce complexity, it is safer to just introduce
the one mount parm to allow caching directories longer. The
defaults for acdirmax and actimeo (for cifs.ko) are conservative,
1 second (NFS defaults acdirmax to 60 seconds). For many workloads,
setting acdirmax to a higher value is safe and will improve
performance. This patch leaves unchanged the default values
for caching metadata for files and directories but gives the
user more flexibility in adjusting them safely for their workload
via the new mount parm.
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Reviewed-By: Tom Talpey <tom@talpey.com>
Just like the changes for io-wq, ensure that we re-fork the SQPOLL
thread if the owner execs. Mark the ctx sq thread as sqo_exec if
it dies, and the ring as needing a wakeup which will force the task
to enter the kernel. When it does, setup the new thread and proceed
as usual.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
exec will cancel any threads, including the ones that io-wq is using. This
isn't a problem, in fact we'd prefer it to be that way since it means we
know that any async work cancels naturally without having to handle it
proactively.
But it does mean that we need to setup a new manager, as the manager and
workers are gone. Handle this at queue time, and cancel work if we fail.
Since the manager can go away without us noticing, ensure that the manager
itself holds a reference to the 'wq' as well. Rename io_wq_destroy() to
io_wq_put() to reflect that.
In the future we can now simplify exec cancelation handling, for now just
leave it the same.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
syzbot reports the following hang:
INFO: task syz-executor.0:12538 can't die for more than 143 seconds.
task:syz-executor.0 state:D stack:28352 pid:12538 ppid: 8423 flags:0x00004004
Call Trace:
context_switch kernel/sched/core.c:4324 [inline]
__schedule+0x90c/0x21a0 kernel/sched/core.c:5075
schedule+0xcf/0x270 kernel/sched/core.c:5154
schedule_timeout+0x1db/0x250 kernel/time/timer.c:1868
do_wait_for_common kernel/sched/completion.c:85 [inline]
__wait_for_common kernel/sched/completion.c:106 [inline]
wait_for_common kernel/sched/completion.c:117 [inline]
wait_for_completion+0x168/0x270 kernel/sched/completion.c:138
io_sq_thread_finish+0x96/0x580 fs/io_uring.c:7152
io_sq_offload_create fs/io_uring.c:7929 [inline]
io_uring_create fs/io_uring.c:9465 [inline]
io_uring_setup+0x1fb2/0x2c20 fs/io_uring.c:9550
do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
entry_SYSCALL_64_after_hwframe+0x44/0xae
which is due to exiting after the SQPOLL thread has been created, but
hasn't been started yet. Ensure that we always complete the startup
side when waiting for it to exit.
Reported-by: syzbot+c927c937cba8ef66dd4a@syzkaller.appspotmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Before the io-wq thread change, we maintained a hash work map and lock
per-node per-ring. That wasn't ideal, as we really wanted it to be per
ring. But now that we have per-task workers, the hash map ends up being
just per-task. That'll work just fine for the normal case of having
one task use a ring, but if you share the ring between tasks, then it's
considerably worse than it was before.
Make the hash map per ctx instead, which provides full per-ctx buffered
write serialization on hashed writes.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Because the iomap code using PF_MEMALLOC_NOFS to detect transaction
recursion in XFS is just wrong. Remove it from the iomap code and
replace it with XFS specific internal checks using
current->journal_info instead.
[djwong: This change also realigns the lifetime of NOFS flag changes to
match the incore transaction, instead of the inconsistent scheme we have
now.]
Fixes: 9070733b4e ("xfs: abstract PF_FSTRANS to PF_MEMALLOC_NOFS")
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Brian Foster reported a lockdep warning on xfs/167:
============================================
WARNING: possible recursive locking detected
5.11.0-rc4 #35 Tainted: G W I
--------------------------------------------
fsstress/17733 is trying to acquire lock:
ffff8e0fd1d90650 (sb_internal){++++}-{0:0}, at: xfs_free_eofblocks+0x104/0x1d0 [xfs]
but task is already holding lock:
ffff8e0fd1d90650 (sb_internal){++++}-{0:0}, at: xfs_trans_alloc_inode+0x5f/0x160 [xfs]
stack backtrace:
CPU: 38 PID: 17733 Comm: fsstress Tainted: G W I 5.11.0-rc4 #35
Hardware name: Dell Inc. PowerEdge R740/01KPX8, BIOS 1.6.11 11/20/2018
Call Trace:
dump_stack+0x8b/0xb0
__lock_acquire.cold+0x159/0x2ab
lock_acquire+0x116/0x370
xfs_trans_alloc+0x1ad/0x310 [xfs]
xfs_free_eofblocks+0x104/0x1d0 [xfs]
xfs_blockgc_scan_inode+0x24/0x60 [xfs]
xfs_inode_walk_ag+0x202/0x4b0 [xfs]
xfs_inode_walk+0x66/0xc0 [xfs]
xfs_trans_alloc+0x160/0x310 [xfs]
xfs_trans_alloc_inode+0x5f/0x160 [xfs]
xfs_alloc_file_space+0x105/0x300 [xfs]
xfs_file_fallocate+0x270/0x460 [xfs]
vfs_fallocate+0x14d/0x3d0
__x64_sys_fallocate+0x3e/0x70
do_syscall_64+0x33/0x40
entry_SYSCALL_64_after_hwframe+0x44/0xa9
The cause of this is the new code that spurs a scan to garbage collect
speculative preallocations if we fail to reserve enough blocks while
allocating a transaction. While the warning itself is a fairly benign
lockdep complaint, it does expose a potential livelock if the rwsem
behavior ever changes with regards to nesting read locks when someone's
waiting for a write lock.
Fix this by freeing the transaction and jumping back to xfs_trans_alloc
like this patch in the V4 submission[1].
[1] https://lore.kernel.org/linux-xfs/161142798066.2171939.9311024588681972086.stgit@magnolia/
Fixes: a1a7d05a05 ("xfs: flush speculative space allocations when we run out of space")
Reported-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Freed extents are marked busy from the point the freeing transaction
commits until the associated CIL context is checkpointed to the log.
This prevents reuse and overwrite of recently freed blocks before
the changes are committed to disk, which can lead to corruption
after a crash. The exception to this rule is that metadata
allocation is allowed to reuse busy extents because metadata changes
are also logged.
As of commit 97d3ac75e5 ("xfs: exact busy extent tracking"), XFS
has allowed modification or complete invalidation of outstanding
busy extents for metadata allocations. This implementation assumes
that use of the associated extent is imminent, which is not always
the case. For example, the trimmed extent might not satisfy the
minimum length of the allocation request, or the allocation
algorithm might be involved in a search for the optimal result based
on locality.
generic/019 reproduces a corruption caused by this scenario. First,
a metadata block (usually a bmbt or symlink block) is freed from an
inode. A subsequent bmbt split on an unrelated inode attempts a near
mode allocation request that invalidates the busy block during the
search, but does not ultimately allocate it. Due to the busy state
invalidation, the block is no longer considered busy to subsequent
allocation. A direct I/O write request immediately allocates the
block and writes to it. Finally, the filesystem crashes while in a
state where the initial metadata block free had not committed to the
on-disk log. After recovery, the original metadata block is in its
original location as expected, but has been corrupted by the
aforementioned dio.
This demonstrates that it is fundamentally unsafe to modify busy
extent state for extents that are not guaranteed to be allocated.
This applies to pretty much all of the code paths that currently
trim busy extents for one reason or another. Therefore to address
this problem, drop the reuse mechanism from the busy extent trim
path. This code already knows how to return partial non-busy ranges
of the targeted free extent and higher level code tracks the busy
state of the allocation attempt. If a block allocation fails where
one or more candidate extents is busy, we force the log and retry
the allocation.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
This reverts commit 88f171ab77.
I ran into a case where the ref resurrect now spins, so revert
this change for now until we can further investigate why it's
broken. The bug seems to indicate spinning on the lock itself,
likely there's some ABBA deadlock involved:
[<0>] __percpu_ref_switch_mode+0x45/0x180
[<0>] percpu_ref_resurrect+0x46/0x70
[<0>] io_refs_resurrect+0x25/0xa0
[<0>] __io_uring_register+0x135/0x10c0
[<0>] __x64_sys_io_uring_register+0xc2/0x1a0
[<0>] do_syscall_64+0x42/0x110
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Merge misc updates from Andrew Morton:
"A few small subsystems and some of MM.
172 patches.
Subsystems affected by this patch series: hexagon, scripts, ntfs,
ocfs2, vfs, and mm (slab-generic, slab, slub, debug, pagecache, swap,
memcg, pagemap, mprotect, mremap, page-reporting, vmalloc, kasan,
pagealloc, memory-failure, hugetlb, vmscan, z3fold, compaction,
mempolicy, oom-kill, hugetlbfs, and migration)"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (172 commits)
mm/migrate: remove unneeded semicolons
hugetlbfs: remove unneeded return value of hugetlb_vmtruncate()
hugetlbfs: fix some comment typos
hugetlbfs: correct some obsolete comments about inode i_mutex
hugetlbfs: make hugepage size conversion more readable
hugetlbfs: remove meaningless variable avoid_reserve
hugetlbfs: correct obsolete function name in hugetlbfs_read_iter()
hugetlbfs: use helper macro default_hstate in init_hugetlbfs_fs
hugetlbfs: remove useless BUG_ON(!inode) in hugetlbfs_setattr()
hugetlbfs: remove special hugetlbfs_set_page_dirty()
mm/hugetlb: change hugetlb_reserve_pages() to type bool
mm, oom: fix a comment in dump_task()
mm/mempolicy: use helper range_in_vma() in queue_pages_test_walk()
numa balancing: migrate on fault among multiple bound nodes
mm, compaction: make fast_isolate_freepages() stay within zone
mm/compaction: fix misbehaviors of fast_find_migrateblock()
mm/compaction: correct deferral logic for proactive compaction
mm/compaction: remove duplicated VM_BUG_ON_PAGE !PageLocked
mm/compaction: remove rcu_read_lock during page compaction
z3fold: simplify the zhdr initialization code in init_z3fold_page()
...
The function hugetlb_vmtruncate() is guaranteed to always success since
commit 7aa91e1040 ("hugetlb: allow extending ftruncate on hugetlbfs").
So we should remove the unneeded return value which is always 0.
Link: https://lkml.kernel.org/r/20210208084637.47789-1-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix typos reserv to reserve, minimim to minimum. No functional change
intended.
Link: https://lkml.kernel.org/r/20210130092351.28072-1-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since commit 9902af79c0 ("parallel lookups: actual switch to rwsem"),
i_mutex of inode is converted to i_rwsem. So replace i_mutex with i_rwsem
to make comments up to date.
Link: https://lkml.kernel.org/r/20210127093111.36672-1-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The calculation 1U << (h->order + PAGE_SHIFT - 10) is actually equal to
(PAGE_SHIFT << (h->order)) >> 10. So we can make it more readable by
replace it with huge_page_size(h) >> 10.
Link: https://lkml.kernel.org/r/20210122083141.24548-1-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The variable avoid_reserve is meaningless because we never changed its
value and just passed it to alloc_huge_page(). So remove it to make code
more clear that in hugetlbfs_fallocate, we never avoid reserve when alloc
hugepage yet. Also add a comment offered by Mike Kravetz to explain this.
Link: https://lkml.kernel.org/r/20210120071508.9078-1-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>