Pull crypto update from Herbert Xu:
"API:
- Add helper for simple skcipher modes.
- Add helper to register multiple templates.
- Set CRYPTO_TFM_NEED_KEY when setkey fails.
- Require neither or both of export/import in shash.
- AEAD decryption test vectors are now generated from encryption
ones.
- New option CONFIG_CRYPTO_MANAGER_EXTRA_TESTS that includes random
fuzzing.
Algorithms:
- Conversions to skcipher and helper for many templates.
- Add more test vectors for nhpoly1305 and adiantum.
Drivers:
- Add crypto4xx prng support.
- Add xcbc/cmac/ecb support in caam.
- Add AES support for Exynos5433 in s5p.
- Remove sha384/sha512 from artpec7 as hardware cannot do partial
hash"
[ There is a merge of the Freescale SoC tree in order to pull in changes
required by patches to the caam/qi2 driver. ]
* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (174 commits)
crypto: s5p - add AES support for Exynos5433
dt-bindings: crypto: document Exynos5433 SlimSSS
crypto: crypto4xx - add missing of_node_put after of_device_is_available
crypto: cavium/zip - fix collision with generic cra_driver_name
crypto: af_alg - use struct_size() in sock_kfree_s()
crypto: caam - remove redundant likely/unlikely annotation
crypto: s5p - update iv after AES-CBC op end
crypto: x86/poly1305 - Clear key material from stack in SSE2 variant
crypto: caam - generate hash keys in-place
crypto: caam - fix DMA mapping xcbc key twice
crypto: caam - fix hash context DMA unmap size
hwrng: bcm2835 - fix probe as platform device
crypto: s5p-sss - Use AES_BLOCK_SIZE define instead of number
crypto: stm32 - drop pointless static qualifier in stm32_hash_remove()
crypto: chelsio - Fixed Traffic Stall
crypto: marvell - Remove set but not used variable 'ivsize'
crypto: ccp - Update driver messages to remove some confusion
crypto: adiantum - add 1536 and 4096-byte test vectors
crypto: nhpoly1305 - add a test vector with len % 16 != 0
crypto: arm/aes-ce - update IV after partial final CTR block
...
Pull networking updates from David Miller:
"Here we go, another merge window full of networking and #ebpf changes:
1) Snoop DHCPACKS in batman-adv to learn MAC/IP pairs in the DHCP
range without dealing with floods of ARP traffic, from Linus
Lüssing.
2) Throttle buffered multicast packet transmission in mt76, from
Felix Fietkau.
3) Support adaptive interrupt moderation in ice, from Brett Creeley.
4) A lot of struct_size conversions, from Gustavo A. R. Silva.
5) Add peek/push/pop commands to bpftool, as well as bash completion,
from Stanislav Fomichev.
6) Optimize sk_msg_clone(), from Vakul Garg.
7) Add SO_BINDTOIFINDEX, from David Herrmann.
8) Be more conservative with local resends due to local congestion,
from Yuchung Cheng.
9) Allow vetoing of unsupported VXLAN FDBs, from Petr Machata.
10) Add health buffer support to devlink, from Eran Ben Elisha.
11) Add TXQ scheduling API to mac80211, from Toke Høiland-Jørgensen.
12) Add statistics to basic packet scheduler filter, from Cong Wang.
13) Add GRE tunnel support for mlxsw Spectrum-2, from Nir Dotan.
14) Lots of new IP tunneling forwarding tests, also from Nir Dotan.
15) Add 3ad stats to bonding, from Nikolay Aleksandrov.
16) Lots of probing improvements for bpftool, from Quentin Monnet.
17) Various nfp drive #ebpf JIT improvements from Jakub Kicinski.
18) Allow #ebpf programs to access gso_segs from skb shared info, from
Eric Dumazet.
19) Add sock_diag support for AF_XDP sockets, from Björn Töpel.
20) Support 22260 iwlwifi devices, from Luca Coelho.
21) Use rbtree for ipv6 defragmentation, from Peter Oskolkov.
22) Add JMP32 instruction class support to #ebpf, from Jiong Wang.
23) Add spinlock support to #ebpf, from Alexei Starovoitov.
24) Support 256-bit keys and TLS 1.3 in ktls, from Dave Watson.
25) Add device infomation API to devlink, from Jakub Kicinski.
26) Add new timestamping socket options which are y2038 safe, from
Deepa Dinamani.
27) Add RX checksum offloading for various sh_eth chips, from Sergei
Shtylyov.
28) Flow offload infrastructure, from Pablo Neira Ayuso.
29) Numerous cleanups, improvements, and bug fixes to the PHY layer
and many drivers from Heiner Kallweit.
30) Lots of changes to try and make packet scheduler classifiers run
lockless as much as possible, from Vlad Buslov.
31) Support BCM957504 chip in bnxt_en driver, from Erik Burrows.
32) Add concurrency tests to tc-tests infrastructure, from Vlad
Buslov.
33) Add hwmon support to aquantia, from Heiner Kallweit.
34) Allow 64-bit values for SO_MAX_PACING_RATE, from Eric Dumazet.
And I would be remiss if I didn't thank the various major networking
subsystem maintainers for integrating much of this work before I even
saw it. Alexei Starovoitov, Daniel Borkmann, Pablo Neira Ayuso,
Johannes Berg, Kalle Valo, and many others. Thank you!"
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (2207 commits)
net/sched: avoid unused-label warning
net: ignore sysctl_devconf_inherit_init_net without SYSCTL
phy: mdio-mux: fix Kconfig dependencies
net: phy: use phy_modify_mmd_changed in genphy_c45_an_config_aneg
net: dsa: mv88e6xxx: add call to mv88e6xxx_ports_cmode_init to probe for new DSA framework
selftest/net: Remove duplicate header
sky2: Disable MSI on Dell Inspiron 1545 and Gateway P-79
net/mlx5e: Update tx reporter status in case channels were successfully opened
devlink: Add support for direct reporter health state update
devlink: Update reporter state to error even if recover aborted
sctp: call iov_iter_revert() after sending ABORT
team: Free BPF filter when unregistering netdev
ip6mr: Do not call __IP6_INC_STATS() from preemptible context
isdn: mISDN: Fix potential NULL pointer dereference of kzalloc
net: dsa: mv88e6xxx: support in-band signalling on SGMII ports with external PHYs
cxgb4/chtls: Prefix adapter flags with CXGB4
net-sysfs: Switch to bitmap_zalloc()
mellanox: Switch to bitmap_zalloc()
bpf: add test cases for non-pointer sanitiation logic
mlxsw: i2c: Extend initialization by querying resources data
...
The kill() syscall operates on process identifiers (pid). After a process
has exited its pid can be reused by another process. If a caller sends a
signal to a reused pid it will end up signaling the wrong process. This
issue has often surfaced and there has been a push to address this problem [1].
This patch uses file descriptors (fd) from proc/<pid> as stable handles on
struct pid. Even if a pid is recycled the handle will not change. The fd
can be used to send signals to the process it refers to.
Thus, the new syscall pidfd_send_signal() is introduced to solve this
problem. Instead of pids it operates on process fds (pidfd).
/* prototype and argument /*
long pidfd_send_signal(int pidfd, int sig, siginfo_t *info, unsigned int flags);
/* syscall number 424 */
The syscall number was chosen to be 424 to align with Arnd's rework in his
y2038 to minimize merge conflicts (cf. [25]).
In addition to the pidfd and signal argument it takes an additional
siginfo_t and flags argument. If the siginfo_t argument is NULL then
pidfd_send_signal() is equivalent to kill(<positive-pid>, <signal>). If it
is not NULL pidfd_send_signal() is equivalent to rt_sigqueueinfo().
The flags argument is added to allow for future extensions of this syscall.
It currently needs to be passed as 0. Failing to do so will cause EINVAL.
/* pidfd_send_signal() replaces multiple pid-based syscalls */
The pidfd_send_signal() syscall currently takes on the job of
rt_sigqueueinfo(2) and parts of the functionality of kill(2), Namely, when a
positive pid is passed to kill(2). It will however be possible to also
replace tgkill(2) and rt_tgsigqueueinfo(2) if this syscall is extended.
/* sending signals to threads (tid) and process groups (pgid) */
Specifically, the pidfd_send_signal() syscall does currently not operate on
process groups or threads. This is left for future extensions.
In order to extend the syscall to allow sending signal to threads and
process groups appropriately named flags (e.g. PIDFD_TYPE_PGID, and
PIDFD_TYPE_TID) should be added. This implies that the flags argument will
determine what is signaled and not the file descriptor itself. Put in other
words, grouping in this api is a property of the flags argument not a
property of the file descriptor (cf. [13]). Clarification for this has been
requested by Eric (cf. [19]).
When appropriate extensions through the flags argument are added then
pidfd_send_signal() can additionally replace the part of kill(2) which
operates on process groups as well as the tgkill(2) and
rt_tgsigqueueinfo(2) syscalls.
How such an extension could be implemented has been very roughly sketched
in [14], [15], and [16]. However, this should not be taken as a commitment
to a particular implementation. There might be better ways to do it.
Right now this is intentionally left out to keep this patchset as simple as
possible (cf. [4]).
/* naming */
The syscall had various names throughout iterations of this patchset:
- procfd_signal()
- procfd_send_signal()
- taskfd_send_signal()
In the last round of reviews it was pointed out that given that if the
flags argument decides the scope of the signal instead of different types
of fds it might make sense to either settle for "procfd_" or "pidfd_" as
prefix. The community was willing to accept either (cf. [17] and [18]).
Given that one developer expressed strong preference for the "pidfd_"
prefix (cf. [13]) and with other developers less opinionated about the name
we should settle for "pidfd_" to avoid further bikeshedding.
The "_send_signal" suffix was chosen to reflect the fact that the syscall
takes on the job of multiple syscalls. It is therefore intentional that the
name is not reminiscent of neither kill(2) nor rt_sigqueueinfo(2). Not the
fomer because it might imply that pidfd_send_signal() is a replacement for
kill(2), and not the latter because it is a hassle to remember the correct
spelling - especially for non-native speakers - and because it is not
descriptive enough of what the syscall actually does. The name
"pidfd_send_signal" makes it very clear that its job is to send signals.
/* zombies */
Zombies can be signaled just as any other process. No special error will be
reported since a zombie state is an unreliable state (cf. [3]). However,
this can be added as an extension through the @flags argument if the need
ever arises.
/* cross-namespace signals */
The patch currently enforces that the signaler and signalee either are in
the same pid namespace or that the signaler's pid namespace is an ancestor
of the signalee's pid namespace. This is done for the sake of simplicity
and because it is unclear to what values certain members of struct
siginfo_t would need to be set to (cf. [5], [6]).
/* compat syscalls */
It became clear that we would like to avoid adding compat syscalls
(cf. [7]). The compat syscall handling is now done in kernel/signal.c
itself by adding __copy_siginfo_from_user_generic() which lets us avoid
compat syscalls (cf. [8]). It should be noted that the addition of
__copy_siginfo_from_user_any() is caused by a bug in the original
implementation of rt_sigqueueinfo(2) (cf. 12).
With upcoming rework for syscall handling things might improve
significantly (cf. [11]) and __copy_siginfo_from_user_any() will not gain
any additional callers.
/* testing */
This patch was tested on x64 and x86.
/* userspace usage */
An asciinema recording for the basic functionality can be found under [9].
With this patch a process can be killed via:
#define _GNU_SOURCE
#include <errno.h>
#include <fcntl.h>
#include <signal.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/stat.h>
#include <sys/syscall.h>
#include <sys/types.h>
#include <unistd.h>
static inline int do_pidfd_send_signal(int pidfd, int sig, siginfo_t *info,
unsigned int flags)
{
#ifdef __NR_pidfd_send_signal
return syscall(__NR_pidfd_send_signal, pidfd, sig, info, flags);
#else
return -ENOSYS;
#endif
}
int main(int argc, char *argv[])
{
int fd, ret, saved_errno, sig;
if (argc < 3)
exit(EXIT_FAILURE);
fd = open(argv[1], O_DIRECTORY | O_CLOEXEC);
if (fd < 0) {
printf("%s - Failed to open \"%s\"\n", strerror(errno), argv[1]);
exit(EXIT_FAILURE);
}
sig = atoi(argv[2]);
printf("Sending signal %d to process %s\n", sig, argv[1]);
ret = do_pidfd_send_signal(fd, sig, NULL, 0);
saved_errno = errno;
close(fd);
errno = saved_errno;
if (ret < 0) {
printf("%s - Failed to send signal %d to process %s\n",
strerror(errno), sig, argv[1]);
exit(EXIT_FAILURE);
}
exit(EXIT_SUCCESS);
}
/* Q&A
* Given that it seems the same questions get asked again by people who are
* late to the party it makes sense to add a Q&A section to the commit
* message so it's hopefully easier to avoid duplicate threads.
*
* For the sake of progress please consider these arguments settled unless
* there is a new point that desperately needs to be addressed. Please make
* sure to check the links to the threads in this commit message whether
* this has not already been covered.
*/
Q-01: (Florian Weimer [20], Andrew Morton [21])
What happens when the target process has exited?
A-01: Sending the signal will fail with ESRCH (cf. [22]).
Q-02: (Andrew Morton [21])
Is the task_struct pinned by the fd?
A-02: No. A reference to struct pid is kept. struct pid - as far as I
understand - was created exactly for the reason to not require to
pin struct task_struct (cf. [22]).
Q-03: (Andrew Morton [21])
Does the entire procfs directory remain visible? Just one entry
within it?
A-03: The same thing that happens right now when you hold a file descriptor
to /proc/<pid> open (cf. [22]).
Q-04: (Andrew Morton [21])
Does the pid remain reserved?
A-04: No. This patchset guarantees a stable handle not that pids are not
recycled (cf. [22]).
Q-05: (Andrew Morton [21])
Do attempts to signal that fd return errors?
A-05: See {Q,A}-01.
Q-06: (Andrew Morton [22])
Is there a cleaner way of obtaining the fd? Another syscall perhaps.
A-06: Userspace can already trivially retrieve file descriptors from procfs
so this is something that we will need to support anyway. Hence,
there's no immediate need to add another syscalls just to make
pidfd_send_signal() not dependent on the presence of procfs. However,
adding a syscalls to get such file descriptors is planned for a
future patchset (cf. [22]).
Q-07: (Andrew Morton [21] and others)
This fd-for-a-process sounds like a handy thing and people may well
think up other uses for it in the future, probably unrelated to
signals. Are the code and the interface designed to permit such
future applications?
A-07: Yes (cf. [22]).
Q-08: (Andrew Morton [21] and others)
Now I think about it, why a new syscall? This thing is looking
rather like an ioctl?
A-08: This has been extensively discussed. It was agreed that a syscall is
preferred for a variety or reasons. Here are just a few taken from
prior threads. Syscalls are safer than ioctl()s especially when
signaling to fds. Processes are a core kernel concept so a syscall
seems more appropriate. The layout of the syscall with its four
arguments would require the addition of a custom struct for the
ioctl() thereby causing at least the same amount or even more
complexity for userspace than a simple syscall. The new syscall will
replace multiple other pid-based syscalls (see description above).
The file-descriptors-for-processes concept introduced with this
syscall will be extended with other syscalls in the future. See also
[22], [23] and various other threads already linked in here.
Q-09: (Florian Weimer [24])
What happens if you use the new interface with an O_PATH descriptor?
A-09:
pidfds opened as O_PATH fds cannot be used to send signals to a
process (cf. [2]). Signaling processes through pidfds is the
equivalent of writing to a file. Thus, this is not an operation that
operates "purely at the file descriptor level" as required by the
open(2) manpage. See also [4].
/* References */
[1]: https://lore.kernel.org/lkml/20181029221037.87724-1-dancol@google.com/
[2]: https://lore.kernel.org/lkml/874lbtjvtd.fsf@oldenburg2.str.redhat.com/
[3]: https://lore.kernel.org/lkml/20181204132604.aspfupwjgjx6fhva@brauner.io/
[4]: https://lore.kernel.org/lkml/20181203180224.fkvw4kajtbvru2ku@brauner.io/
[5]: https://lore.kernel.org/lkml/20181121213946.GA10795@mail.hallyn.com/
[6]: https://lore.kernel.org/lkml/20181120103111.etlqp7zop34v6nv4@brauner.io/
[7]: https://lore.kernel.org/lkml/36323361-90BD-41AF-AB5B-EE0D7BA02C21@amacapital.net/
[8]: https://lore.kernel.org/lkml/87tvjxp8pc.fsf@xmission.com/
[9]: https://asciinema.org/a/IQjuCHew6bnq1cr78yuMv16cy
[11]: https://lore.kernel.org/lkml/F53D6D38-3521-4C20-9034-5AF447DF62FF@amacapital.net/
[12]: https://lore.kernel.org/lkml/87zhtjn8ck.fsf@xmission.com/
[13]: https://lore.kernel.org/lkml/871s6u9z6u.fsf@xmission.com/
[14]: https://lore.kernel.org/lkml/20181206231742.xxi4ghn24z4h2qki@brauner.io/
[15]: https://lore.kernel.org/lkml/20181207003124.GA11160@mail.hallyn.com/
[16]: https://lore.kernel.org/lkml/20181207015423.4miorx43l3qhppfz@brauner.io/
[17]: https://lore.kernel.org/lkml/CAGXu5jL8PciZAXvOvCeCU3wKUEB_dU-O3q0tDw4uB_ojMvDEew@mail.gmail.com/
[18]: https://lore.kernel.org/lkml/20181206222746.GB9224@mail.hallyn.com/
[19]: https://lore.kernel.org/lkml/20181208054059.19813-1-christian@brauner.io/
[20]: https://lore.kernel.org/lkml/8736rebl9s.fsf@oldenburg.str.redhat.com/
[21]: https://lore.kernel.org/lkml/20181228152012.dbf0508c2508138efc5f2bbe@linux-foundation.org/
[22]: https://lore.kernel.org/lkml/20181228233725.722tdfgijxcssg76@brauner.io/
[23]: https://lwn.net/Articles/773459/
[24]: https://lore.kernel.org/lkml/8736rebl9s.fsf@oldenburg.str.redhat.com/
[25]: https://lore.kernel.org/lkml/CAK8P3a0ej9NcJM8wXNPbcGUyOUZYX+VLoDFdbenW3s3114oQZw@mail.gmail.com/
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Jann Horn <jannh@google.com>
Cc: Andy Lutomirsky <luto@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Florian Weimer <fweimer@redhat.com>
Signed-off-by: Christian Brauner <christian@brauner.io>
Reviewed-by: Tycho Andersen <tycho@tycho.ws>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: David Howells <dhowells@redhat.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Serge Hallyn <serge@hallyn.com>
Acked-by: Aleksa Sarai <cyphar@cyphar.com>
There are a couple places where we still account for 4 bytes
in the beginning of SMB2 packet which is not true in the current
code. Fix this to use a header preamble size where possible.
Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Even if a response is malformed, we should count credits
granted by the server to avoid miscalculations and unnecessary
reconnects due to client or server bugs. If the response has
been received partially, the session will be reconnected anyway
on the next iteration of the demultiplex thread, so counting
credits for such cases shouldn't break things.
Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Currently we only skip credits logging on reconnects. When
unmounting a share the number of credits on the client doesn't
matter, so skip logging in such cases too.
Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Currently we skip setting a read error to -EIO if a stored
result is -ENODATA and a response hasn't been received. With
the recent changes in read error processing there shouldn't be
cases when -ENODATA is set without a response from the server,
so reset the error to -EIO unconditionally.
Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
When we hit failures during constructing MIDs or sending PDUs
through the network, we end up not using message IDs assigned
to the packet. The next SMB packet will skip those message IDs
and continue with the next one. This behavior may lead to a server
not granting us credits until we use the skipped IDs. Fix this by
reverting the current ID to the original value if any errors occur
before we push the packet through the network stack.
This patch fixes the generic/310 test from the xfs-tests.
Cc: <stable@vger.kernel.org> # 4.19.x
Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
If we try large I/O (read or write) immediately after mount
we won't typically have enough credits because we only request
large amounts of credits on the first session setup. So if
large I/O is attempted soon after mount we will typically only
have about 43 credits rather than 105 credits (with this patch)
available for the large i/o (which needs 64 credits minimum).
This patch requests more credits during tree connect, which
helps ensure that we have enough credits when mount completes
(between these requests and the first session setup) in order
to start large I/O immediately after mount if needed.
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
We negotiate rsize mounts (and it can be overridden by user) to
typically 4MB, so using larger default I/O sizes from userspace
(changing to 1MB default i/o size returned by stat) the
performance is much better (and not just for long latency
network connections) in most use cases for SMB3 than the default I/O
size (which ends up being 128K for cp and can be even smaller for cp).
This can be 4x slower or worse depending on network latency.
By changing inode->blocksize from 32K (which was perhaps ok
for very old SMB1/CIFS) to a larger value, 1MB (but still less than
max size negotiated with the server which is 4MB, in order to minimize
risk) it significantly increases performance for the
noncached case, and slightly increases it for the cached case.
This can be changed by the user on mount (specifying bsize=
values from 16K to 16MB) to tune better for performance
for applications that depend on blocksize.
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
CC: Stable <stable@vger.kernel.org>
Currently on lease break the client sets a caching level twice:
when oplock is detected and when oplock is processed. While the
1st attempt sets the level to the value provided by the server,
the 2nd one resets the level to None unconditionally.
This happens because the oplock/lease processing code was changed
to avoid races between page cache flushes and oplock breaks.
The commit c11f1df500 ("cifs: Wait for writebacks to complete
before attempting write.") fixed the races for oplocks but didn't
apply the same changes for leases resulting in overwriting the
server granted value to None. Fix this by properly processing
lease breaks.
Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
CC: Stable <stable@vger.kernel.org>
/proc/fs/cifs/Stats bytes_read was double counting reads when
uncached (ie mounted with cache=none)
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
BUGZILLA: https://bugzilla.kernel.org/show_bug.cgi?id=202007
When deleting an xattr/EA:
SMB2/3 servers will return SUCCESS when clients delete non-existing EAs.
This means that we need to first QUERY the server and check if the EA
exists or not so that we can return -ENODATA correctly when this happens.
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
We should add any credits granted to us from unmatched server responses.
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
a trivial patch that replaces all use of snprintf with scnprintf.
scnprintf() is generally seen as a safer function to use than
snprintf for many use cases.
In our case, there is no actual difference between the two since we never
look at the return value. Thus we did not have any of the bugs that
scnprintf protects against and the patch does nothing.
However, for people reading our code it will be a receipt that we
have done our due dilligence and checked our code for this type of bugs.
See the presentation "Making C Less Dangerous In The Linux Kernel"
at this years LCA
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
There is a NULL pointer dereference of devname in strspn()
The oops looks something like:
CIFS: Attempting to mount (null)
BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
...
RIP: 0010:strspn+0x0/0x50
...
Call Trace:
? cifs_parse_mount_options+0x222/0x1710 [cifs]
? cifs_get_volume_info+0x2f/0x80 [cifs]
cifs_setup_volume_info+0x20/0x190 [cifs]
cifs_get_volume_info+0x50/0x80 [cifs]
cifs_smb3_do_mount+0x59/0x630 [cifs]
? ida_alloc_range+0x34b/0x3d0
cifs_do_mount+0x11/0x20 [cifs]
mount_fs+0x52/0x170
vfs_kern_mount+0x6b/0x170
do_mount+0x216/0xdc0
ksys_mount+0x83/0xd0
__x64_sys_mount+0x25/0x30
do_syscall_64+0x65/0x220
entry_SYSCALL_64_after_hwframe+0x49/0xbe
Fix this by adding a NULL check on devname in cifs_parse_devname()
Signed-off-by: Yao Liu <yotta.liu@ucloud.cn>
Signed-off-by: Steve French <stfrench@microsoft.com>
If we don't find a writable file handle when retrying writepages
we break of the loop and do not unlock and put pages neither from
wdata2 nor from the original wdata. Fix this by walking through
all the remaining pages and cleanup them properly.
Cc: <stable@vger.kernel.org>
Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
The current implementation of splice() and tee() ignores O_NONBLOCK set
on pipe file descriptors and checks only the SPLICE_F_NONBLOCK flag for
blocking on pipe arguments. This is inconsistent since splice()-ing
from/to non-pipe file descriptors does take O_NONBLOCK into
consideration.
Fix this by promoting O_NONBLOCK, when set on a pipe, to
SPLICE_F_NONBLOCK.
Some context for how the current implementation of splice() leads to
inconsistent behavior. In the ongoing work[1] to add VM tracing
capability to trace-cmd we stream tracing data over named FIFOs or
vsockets from guests back to the host.
When we receive SIGINT from user to stop tracing, we set O_NONBLOCK on
the input file descriptor and set SPLICE_F_NONBLOCK for the next call to
splice(). If splice() was blocked waiting on data from the input FIFO,
after SIGINT splice() restarts with the same arguments (no
SPLICE_F_NONBLOCK) and blocks again instead of returning -EAGAIN when no
data is available.
This differs from the splice() behavior when reading from a vsocket or
when we're doing a traditional read()/write() loop (trace-cmd's
--nosplice argument).
With this patch applied we get the same behavior in all situations after
setting O_NONBLOCK which also matches the behavior of doing a
read()/write() loop instead of splice().
This change does have potential of breaking users who don't expect
EAGAIN from splice() when SPLICE_F_NONBLOCK is not set. OTOH programs
that set O_NONBLOCK and don't anticipate EAGAIN are arguably buggy[2].
[1] https://github.com/skaslev/trace-cmd/tree/vsock
[2] d47e3da175/fs/read_write.c (L1425)
Signed-off-by: Slavomir Kaslev <kaslevs@vmware.com>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull vfs fixes from Al Viro:
"Assorted fixes that sat in -next for a while, all over the place"
* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
aio: Fix locking in aio_poll()
exec: Fix mem leak in kernel_read_file
copy_mount_string: Limit string length to PATH_MAX
cgroup: saner refcounting for cgroup_root
fix cgroup_do_mount() handling of failure exits
Every in-kernel use of this function defined it to KERNEL_DS (either as
an actual define, or as an inline function). It's an entirely
historical artifact, and long long long ago used to actually read the
segment selector valueof '%ds' on x86.
Which in the kernel is always KERNEL_DS.
Inspired by a patch from Jann Horn that just did this for a very small
subset of users (the ones in fs/), along with Al who suggested a script.
I then just took it to the logical extreme and removed all the remaining
gunk.
Roughly scripted with
git grep -l '(get_ds())' -- :^tools/ | xargs sed -i 's/(get_ds())/(KERNEL_DS)/'
git grep -lw 'get_ds' -- :^tools/ | xargs sed -i '/^#define get_ds()/d'
plus manual fixups to remove a few unusual usage patterns, the couple of
inline function cases and to fix up a comment that had become stale.
The 'get_ds()' function remains in an x86 kvm selftest, since in user
space it actually does something relevant.
Inspired-by: Jann Horn <jannh@google.com>
Inspired-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Al Viro root-caused a race where the IOCB_CMD_POLL handling of
fget/fput() could cause us to access the file pointer after it had
already been freed:
"In more details - normally IOCB_CMD_POLL handling looks so:
1) io_submit(2) allocates aio_kiocb instance and passes it to
aio_poll()
2) aio_poll() resolves the descriptor to struct file by req->file =
fget(iocb->aio_fildes)
3) aio_poll() sets ->woken to false and raises ->ki_refcnt of that
aio_kiocb to 2 (bumps by 1, that is).
4) aio_poll() calls vfs_poll(). After sanity checks (basically,
"poll_wait() had been called and only once") it locks the queue.
That's what the extra reference to iocb had been for - we know we
can safely access it.
5) With queue locked, we check if ->woken has already been set to
true (by aio_poll_wake()) and, if it had been, we unlock the
queue, drop a reference to aio_kiocb and bugger off - at that
point it's a responsibility to aio_poll_wake() and the stuff
called/scheduled by it. That code will drop the reference to file
in req->file, along with the other reference to our aio_kiocb.
6) otherwise, we see whether we need to wait. If we do, we unlock the
queue, drop one reference to aio_kiocb and go away - eventual
wakeup (or cancel) will deal with the reference to file and with
the other reference to aio_kiocb
7) otherwise we remove ourselves from waitqueue (still under the
queue lock), so that wakeup won't get us. No async activity will
be happening, so we can safely drop req->file and iocb ourselves.
If wakeup happens while we are in vfs_poll(), we are fine - aio_kiocb
won't get freed under us, so we can do all the checks and locking
safely. And we don't touch ->file if we detect that case.
However, vfs_poll() most certainly *does* touch the file it had been
given. So wakeup coming while we are still in ->poll() might end up
doing fput() on that file. That case is not too rare, and usually we
are saved by the still present reference from descriptor table - that
fput() is not the final one.
But if another thread closes that descriptor right after our fget()
and wakeup does happen before ->poll() returns, we are in trouble -
final fput() done while we are in the middle of a method:
Al also wrote a patch to take an extra reference to the file descriptor
to fix this, but I instead suggested we just streamline the whole file
pointer handling by submit_io() so that the generic aio submission code
simply keeps the file pointer around until the aio has completed.
Fixes: bfe4037e72 ("aio: implement IOCB_CMD_POLL")
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Reported-by: syzbot+503d4cc169fcec1cb18c@syzkaller.appspotmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use inode->i_lock to protect i_size_write(), else i_size_read() in
generic_fillattr() may loop infinitely in read_seqcount_begin() when
multiple processes invoke v9fs_vfs_getattr() or v9fs_vfs_getattr_dotl()
simultaneously under 32-bit SMP environment, and a soft lockup will be
triggered as show below:
watchdog: BUG: soft lockup - CPU#5 stuck for 22s! [stat:2217]
Modules linked in:
CPU: 5 PID: 2217 Comm: stat Not tainted 5.0.0-rc1-00005-g7f702faf5a9e #4
Hardware name: Generic DT based system
PC is at generic_fillattr+0x104/0x108
LR is at 0xec497f00
pc : [<802b8898>] lr : [<ec497f00>] psr: 200c0013
sp : ec497e20 ip : ed608030 fp : ec497e3c
r10: 00000000 r9 : ec497f00 r8 : ed608030
r7 : ec497ebc r6 : ec497f00 r5 : ee5c1550 r4 : ee005780
r3 : 0000052d r2 : 00000000 r1 : ec497f00 r0 : ed608030
Flags: nzCv IRQs on FIQs on Mode SVC_32 ISA ARM Segment none
Control: 10c5387d Table: ac48006a DAC: 00000051
CPU: 5 PID: 2217 Comm: stat Not tainted 5.0.0-rc1-00005-g7f702faf5a9e #4
Hardware name: Generic DT based system
Backtrace:
[<8010d974>] (dump_backtrace) from [<8010dc88>] (show_stack+0x20/0x24)
[<8010dc68>] (show_stack) from [<80a1d194>] (dump_stack+0xb0/0xdc)
[<80a1d0e4>] (dump_stack) from [<80109f34>] (show_regs+0x1c/0x20)
[<80109f18>] (show_regs) from [<801d0a80>] (watchdog_timer_fn+0x280/0x2f8)
[<801d0800>] (watchdog_timer_fn) from [<80198658>] (__hrtimer_run_queues+0x18c/0x380)
[<801984cc>] (__hrtimer_run_queues) from [<80198e60>] (hrtimer_run_queues+0xb8/0xf0)
[<80198da8>] (hrtimer_run_queues) from [<801973e8>] (run_local_timers+0x28/0x64)
[<801973c0>] (run_local_timers) from [<80197460>] (update_process_times+0x3c/0x6c)
[<80197424>] (update_process_times) from [<801ab2b8>] (tick_nohz_handler+0xe0/0x1bc)
[<801ab1d8>] (tick_nohz_handler) from [<80843050>] (arch_timer_handler_virt+0x38/0x48)
[<80843018>] (arch_timer_handler_virt) from [<80180a64>] (handle_percpu_devid_irq+0x8c/0x240)
[<801809d8>] (handle_percpu_devid_irq) from [<8017ac20>] (generic_handle_irq+0x34/0x44)
[<8017abec>] (generic_handle_irq) from [<8017b344>] (__handle_domain_irq+0x6c/0xc4)
[<8017b2d8>] (__handle_domain_irq) from [<801022e0>] (gic_handle_irq+0x4c/0x88)
[<80102294>] (gic_handle_irq) from [<80101a30>] (__irq_svc+0x70/0x98)
[<802b8794>] (generic_fillattr) from [<8056b284>] (v9fs_vfs_getattr_dotl+0x74/0xa4)
[<8056b210>] (v9fs_vfs_getattr_dotl) from [<802b8904>] (vfs_getattr_nosec+0x68/0x7c)
[<802b889c>] (vfs_getattr_nosec) from [<802b895c>] (vfs_getattr+0x44/0x48)
[<802b8918>] (vfs_getattr) from [<802b8a74>] (vfs_statx+0x9c/0xec)
[<802b89d8>] (vfs_statx) from [<802b9428>] (sys_lstat64+0x48/0x78)
[<802b93e0>] (sys_lstat64) from [<80101000>] (ret_fast_syscall+0x0/0x28)
[dominique.martinet@cea.fr: updated comment to not refer to a function
in another subsystem]
Link: http://lkml.kernel.org/r/20190124063514.8571-2-houtao1@huawei.com
Cc: stable@vger.kernel.org
Fixes: 7549ae3e81 ("9p: Use the i_size_[read, write]() macros instead of using inode->i_size directly.")
Reported-by: Xing Gaopeng <xingaopeng@huawei.com>
Signed-off-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Dominique Martinet <dominique.martinet@cea.fr>
Users can still control this value explicitly using the
max_session_cb_slots module parameter, but let's bump the default
up to 16 for now.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Get rid of the redundant parameter and rename the function
ff_layout_mirror_valid() to ff_layout_init_mirror_ds() for clarity.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
nfs4_ff_alloc_deviceid_node() guarantees that if mirror->mirror_ds is
a valid pointer, then so is mirror->mirror_ds->ds.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Pass in a pointer to the mirror rather than having to retrieve it from
the array and then verify the resulting pointer.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
If we notice that a DS may be down, we should attempt to read from the
other mirrors first before we go back to retry the dead DS.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
If the DS is unresponsive, we want to just mark it as such, while
reporting the errors. If the server later returns the same deviceid
in a new layout, then we don't want to have to look it up again.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
In ff_layout_mirror_valid() we may not want to invalidate the layout
segment despite the call to GETDEVICEINFO failing. The reason is that
a read may still be able to make progress on another mirror.
So instead we let the caller (in this case nfs4_ff_layout_prepare_ds())
decide whether or not it needs to invalidate.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
While we may want to skip attempting to connect to a downed mirror
when we're deciding which mirror to select for a read, we do not
want to do so once we've committed to attempting the I/O in
ff_layout_read/write_pagelist(), or ff_layout_initiate_commit()
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
When a read to the preferred mirror returns an error, the flexfiles
driver records the error in the inode list and currently marks the
layout for return before failing over the attempted read to the next
mirror.
What we actually want to do is fire off a LAYOUTERROR to notify the
MDS that there is an issue with the preferred mirror, then we fail
over. Only once we've failed to read from all mirrors should we
return the layout.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
The radix tree would rewind the index in an iterator to the lowest index
of a multi-slot entry. The XArray iterators instead leave the index
unchanged, but I overlooked that when converting DAX from the radix tree
to the XArray. Adjust the index that we use for flushing to the start
of the PMD range.
Fixes: c1901cd33c ("page cache: Convert find_get_entries_tag to XArray")
Cc: <stable@vger.kernel.org>
Reported-by: Piotr Balcer <piotr.balcer@intel.com>
Tested-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
If a layout segment gets invalidated while a pNFS I/O operation
is queued for transmission, then we ideally want to abort
immediately. This is particularly the case when there is a large
number of I/O related RPCs queued in the RPC layer, and the layout
segment gets invalidated due to an ENOSPC error, or an EACCES (because
the client was fenced). We may end up forced to spam the MDS with a
lot of otherwise unnecessary LAYOUTERRORs after that I/O fails.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Fix the memory barriers in nfs4_mark_deviceid_unavailable() and
nfs4_test_deviceid_unavailable().
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
If the attempt to instantiate the mirror's layout DS pointer failed,
then that pointer may hold a value of type ERR_PTR(), so we need
to check that before we dereference it.
Fixes: 65990d1afb ("pNFS/flexfiles: Fix a deadlock on LAYOUTGET")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
These really should have been there from the beginning, but we never
noticed because there was enough slack in the RPC request for the extra
bytes. Chuck's recent patch to use au_cslack and au_rslack to compute
buffer size shrunk the buffer enough that this was now a problem for
SEEK operations on my test client.
Fixes: f4ac1674f5 ("nfs: Add ALLOCATE support")
Fixes: 2e72448b07 ("NFS: Add COPY nfs operation")
Fixes: cb95deea0b ("NFS OFFLOAD_CANCEL xdr")
Fixes: 624bd5b7b6 ("nfs: Add DEALLOCATE support")
Fixes: 1c6dcbe5ce ("NFS: Implement SEEK")
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Ensure that if we call nfs41_sequence_process() a second time for the
same rpc_task, then we only process the results once.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
If we have to retransmit a request, we should ensure that we reinitialise
the sequence results structure, since in the event of a signal
we need to treat the request as if it had not been sent.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: stable@vger.kernel.org
Merge misc fixes from Andrew Morton:
"2 fixes"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
hugetlbfs: fix races and page leaks during migration
kasan: turn off asan-stack for clang-8 and earlier
hugetlb pages should only be migrated if they are 'active'. The
routines set/clear_page_huge_active() modify the active state of hugetlb
pages.
When a new hugetlb page is allocated at fault time, set_page_huge_active
is called before the page is locked. Therefore, another thread could
race and migrate the page while it is being added to page table by the
fault code. This race is somewhat hard to trigger, but can be seen by
strategically adding udelay to simulate worst case scheduling behavior.
Depending on 'how' the code races, various BUG()s could be triggered.
To address this issue, simply delay the set_page_huge_active call until
after the page is successfully added to the page table.
Hugetlb pages can also be leaked at migration time if the pages are
associated with a file in an explicitly mounted hugetlbfs filesystem.
For example, consider a two node system with 4GB worth of huge pages
available. A program mmaps a 2G file in a hugetlbfs filesystem. It
then migrates the pages associated with the file from one node to
another. When the program exits, huge page counts are as follows:
node0
1024 free_hugepages
1024 nr_hugepages
node1
0 free_hugepages
1024 nr_hugepages
Filesystem Size Used Avail Use% Mounted on
nodev 4.0G 2.0G 2.0G 50% /var/opt/hugepool
That is as expected. 2G of huge pages are taken from the free_hugepages
counts, and 2G is the size of the file in the explicitly mounted
filesystem. If the file is then removed, the counts become:
node0
1024 free_hugepages
1024 nr_hugepages
node1
1024 free_hugepages
1024 nr_hugepages
Filesystem Size Used Avail Use% Mounted on
nodev 4.0G 2.0G 2.0G 50% /var/opt/hugepool
Note that the filesystem still shows 2G of pages used, while there
actually are no huge pages in use. The only way to 'fix' the filesystem
accounting is to unmount the filesystem
If a hugetlb page is associated with an explicitly mounted filesystem,
this information in contained in the page_private field. At migration
time, this information is not preserved. To fix, simply transfer
page_private from old to new page at migration time if necessary.
There is a related race with removing a huge page from a file and
migration. When a huge page is removed from the pagecache, the
page_mapping() field is cleared, yet page_private remains set until the
page is actually freed by free_huge_page(). A page could be migrated
while in this state. However, since page_mapping() is not set the
hugetlbfs specific routine to transfer page_private is not called and we
leak the page count in the filesystem.
To fix that, check for this condition before migrating a huge page. If
the condition is detected, return EBUSY for the page.
Link: http://lkml.kernel.org/r/74510272-7319-7372-9ea6-ec914734c179@oracle.com
Link: http://lkml.kernel.org/r/20190212221400.3512-1-mike.kravetz@oracle.com
Fixes: bcc5422230 ("mm: hugetlb: introduce page_huge_active")
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: <stable@vger.kernel.org>
[mike.kravetz@oracle.com: v2]
Link: http://lkml.kernel.org/r/7534d322-d782-8ac6-1c8d-a8dc380eb3ab@oracle.com
[mike.kravetz@oracle.com: update comment and changelog]
Link: http://lkml.kernel.org/r/420bcfd6-158b-38e4-98da-26d0cd85bd01@oracle.com
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
statx(2) notes that any attribute that is not indicated as supported by
stx_attributes_mask has no usable value. Commit 5f955f26f3 ("xfs: report
crtime and attribute flags to statx") added support for informing userspace
of extra file attributes but forgot to list these flags as supported
making reporting them rather useless for the pedantic userspace author.
$ git describe --contains 5f955f26f3
v4.11-rc6~5^2^2~2
Fixes: 5f955f26f3 ("xfs: report crtime and attribute flags to statx")
Signed-off-by: Luis R. Rodriguez <mcgrof@kernel.org>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
[darrick: add a comment reminding people to keep attributes_mask up to date]
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
In jbd2_get_transaction, a new transaction is initialized,
and set to the j_running_transaction. No need for a return
value, so remove it.
Also, adjust some comments to match the actual operation
of this function.
Signed-off-by: Liu Song <liu.song11@zte.com.cn>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
In jbd2_journal_commit_transaction(), if we are in abort mode,
we may flush the buffer without setting descriptor block checksum
by goto start_journal_io. Then fs is mounted,
jbd2_descriptor_block_csum_verify() failed.
[ 271.379811] EXT4-fs (vdd): shut down requested (2)
[ 271.381827] Aborting journal on device vdd-8.
[ 271.597136] JBD2: Invalid checksum recovering block 22199 in log
[ 271.598023] JBD2: recovery failed
[ 271.598484] EXT4-fs (vdd): error loading journal
Fix this problem by keep setting descriptor block checksum if the
descriptor buffer is not NULL.
This checksum problem can be reproduced by xfstests generic/388.
Signed-off-by: luojiajun <luojiajun3@huawei.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Ext4 may not free clusters correctly when punching holes in bigalloc
file systems under high load conditions. If it's not possible to
extend and restart the journal in ext4_ext_rm_leaf() when preparing to
remove blocks from a punched region, a retry of the entire punch
operation is triggered in ext4_ext_remove_space(). This causes a
partial cluster to be set to the first cluster in the extent found to
the right of the punched region. However, if the punch operation
prior to the retry had made enough progress to delete one or more
extents and a partial cluster candidate for freeing had already been
recorded, the retry would overwrite the partial cluster. The loss of
this information makes it impossible to correctly free the original
partial cluster in all cases.
This bug can cause generic/476 to fail when run as part of
xfstests-bld's bigalloc and bigalloc_1k test cases. The failure is
reported when e2fsck detects bad iblocks counts greater than expected
in units of whole clusters and also detects a number of negative block
bitmap differences equal to the iblocks discrepancy in cluster units.
Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
guard_bio_eod() can truncate a segment in bio to allow it to do IO on
odd last sectors of a device.
It already checks if the IO starts past EOD, but it does not consider
the possibility of an IO request starting within device boundaries can
contain more than one segment past EOD.
In such cases, truncated_bytes can be bigger than PAGE_SIZE, and will
underflow bvec->bv_len.
Fix this by checking if truncated_bytes is lower than PAGE_SIZE.
This situation has been found on filesystems such as isofs and vfat,
which doesn't check the device size before mount, if the device is
smaller than the filesystem itself, a readahead on such filesystem,
which spans EOD, can trigger this situation, leading a call to
zero_user() with a wrong size possibly corrupting memory.
I didn't see any crash, or didn't let the system run long enough to
check if memory corruption will be hit somewhere, but adding
instrumentation to guard_bio_end() to check truncated_bytes size, was
enough to see the error.
The following script can trigger the error.
MNT=/mnt
IMG=./DISK.img
DEV=/dev/loop0
mkfs.vfat $IMG
mount $IMG $MNT
cp -R /etc $MNT &> /dev/null
umount $MNT
losetup -D
losetup --find --show --sizelimit 16247280 $IMG
mount $DEV $MNT
find $MNT -type f -exec cat {} + >/dev/null
Kudos to Eric Sandeen for coming up with the reproducer above
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We'll use this for the POLL implementation. Regular requests will
NOT be using references, so initialize it to 0. Any real use of
the io_kiocb ref will initialize it to at least 2.
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This enables an application to do IO, without ever entering the kernel.
By using the SQ ring to fill in new sqes and watching for completions
on the CQ ring, we can submit and reap IOs without doing a single system
call. The kernel side thread will poll for new submissions, and in case
of HIPRI/polled IO, it'll also poll for completions.
By default, we allow 1 second of active spinning. This can by changed
by passing in a different grace period at io_uring_register(2) time.
If the thread exceeds this idle time without having any work to do, it
will set:
sq_ring->flags |= IORING_SQ_NEED_WAKEUP.
The application will have to call io_uring_enter() to start things back
up again. If IO is kept busy, that will never be needed. Basically an
application that has this feature enabled will guard it's
io_uring_enter(2) call with:
read_barrier();
if (*sq_ring->flags & IORING_SQ_NEED_WAKEUP)
io_uring_enter(fd, 0, 0, IORING_ENTER_SQ_WAKEUP);
instead of calling it unconditionally.
It's mandatory to use fixed files with this feature. Failure to do so
will result in the application getting an -EBADF CQ entry when
submitting IO.
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We normally have to fget/fput for each IO we do on a file. Even with
the batching we do, the cost of the atomic inc/dec of the file usage
count adds up.
This adds IORING_REGISTER_FILES, and IORING_UNREGISTER_FILES opcodes
for the io_uring_register(2) system call. The arguments passed in must
be an array of __s32 holding file descriptors, and nr_args should hold
the number of file descriptors the application wishes to pin for the
duration of the io_uring instance (or until IORING_UNREGISTER_FILES is
called).
When used, the application must set IOSQE_FIXED_FILE in the sqe->flags
member. Then, instead of setting sqe->fd to the real fd, it sets sqe->fd
to the index in the array passed in to IORING_REGISTER_FILES.
Files are automatically unregistered when the io_uring instance is torn
down. An application need only unregister if it wishes to register a new
set of fds.
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If we have fixed user buffers, we can map them into the kernel when we
setup the io_uring. That avoids the need to do get_user_pages() for
each and every IO.
To utilize this feature, the application must call io_uring_register()
after having setup an io_uring instance, passing in
IORING_REGISTER_BUFFERS as the opcode. The argument must be a pointer to
an iovec array, and the nr_args should contain how many iovecs the
application wishes to map.
If successful, these buffers are now mapped into the kernel, eligible
for IO. To use these fixed buffers, the application must use the
IORING_OP_READ_FIXED and IORING_OP_WRITE_FIXED opcodes, and then
set sqe->index to the desired buffer index. sqe->addr..sqe->addr+seq->len
must point to somewhere inside the indexed buffer.
The application may register buffers throughout the lifetime of the
io_uring instance. It can call io_uring_register() with
IORING_UNREGISTER_BUFFERS as the opcode to unregister the current set of
buffers, and then register a new set. The application need not
unregister buffers explicitly before shutting down the io_uring
instance.
It's perfectly valid to setup a larger buffer, and then sometimes only
use parts of it for an IO. As long as the range is within the originally
mapped region, it will work just fine.
For now, buffers must not be file backed. If file backed buffers are
passed in, the registration will fail with -1/EOPNOTSUPP. This
restriction may be relaxed in the future.
RLIMIT_MEMLOCK is used to check how much memory we can pin. A somewhat
arbitrary 1G per buffer size is also imposed.
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Similarly to how we use the state->ios_left to know how many references
to get to a file, we can use it to allocate the io_kiocb's we need in
bulk.
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add a separate io_submit_state structure, to cache some of the things
we need for IO submission.
One such example is file reference batching. io_submit_state. We get as
many references as the number of sqes we are submitting, and drop
unused ones if we end up switching files. The assumption here is that
we're usually only dealing with one fd, and if there are multiple,
hopefuly they are at least somewhat ordered. Could trivially be extended
to cover multiple fds, if needed.
On the completion side we do the same thing, except this is trivially
done just locally in io_iopoll_reap().
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Some uses cases repeatedly get and put references to the same file, but
the only exposed interface is doing these one at the time. As each of
these entail an atomic inc or dec on a shared structure, that cost can
add up.
Add fget_many(), which works just like fget(), except it takes an
argument for how many references to get on the file. Ditto fput_many(),
which can drop an arbitrary number of references to a file.
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add support for a polled io_uring instance. When a read or write is
submitted to a polled io_uring, the application must poll for
completions on the CQ ring through io_uring_enter(2). Polled IO may not
generate IRQ completions, hence they need to be actively found by the
application itself.
To use polling, io_uring_setup() must be used with the
IORING_SETUP_IOPOLL flag being set. It is illegal to mix and match
polled and non-polled IO on an io_uring.
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add a new fsync opcode, which either syncs a range if one is passed,
or the whole file if the offset and length fields are both cleared
to zero. A flag is provided to use fdatasync semantics, that is only
force out metadata which is required to retrieve the file data, but
not others like metadata.
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The submission queue (SQ) and completion queue (CQ) rings are shared
between the application and the kernel. This eliminates the need to
copy data back and forth to submit and complete IO.
IO submissions use the io_uring_sqe data structure, and completions
are generated in the form of io_uring_cqe data structures. The SQ
ring is an index into the io_uring_sqe array, which makes it possible
to submit a batch of IOs without them being contiguous in the ring.
The CQ ring is always contiguous, as completion events are inherently
unordered, and hence any io_uring_cqe entry can point back to an
arbitrary submission.
Two new system calls are added for this:
io_uring_setup(entries, params)
Sets up an io_uring instance for doing async IO. On success,
returns a file descriptor that the application can mmap to
gain access to the SQ ring, CQ ring, and io_uring_sqes.
io_uring_enter(fd, to_submit, min_complete, flags, sigset, sigsetsize)
Initiates IO against the rings mapped to this fd, or waits for
them to complete, or both. The behavior is controlled by the
parameters passed in. If 'to_submit' is non-zero, then we'll
try and submit new IO. If IORING_ENTER_GETEVENTS is set, the
kernel will wait for 'min_complete' events, if they aren't
already available. It's valid to set IORING_ENTER_GETEVENTS
and 'min_complete' == 0 at the same time, this allows the
kernel to return already completed events without waiting
for them. This is useful only for polling, as for IRQ
driven IO, the application can just check the CQ ring
without entering the kernel.
With this setup, it's possible to do async IO with a single system
call. Future developments will enable polled IO with this interface,
and polled submission as well. The latter will enable an application
to do IO without doing ANY system calls at all.
For IRQ driven IO, an application only needs to enter the kernel for
completions if it wants to wait for them to occur.
Each io_uring is backed by a workqueue, to support buffered async IO
as well. We will only punt to an async context if the command would
need to wait for IO on the device side. Any data that can be accessed
directly in the page cache is done inline. This avoids the slowness
issue of usual threadpools, since cached data is accessed as quickly
as a sync interface.
Sample application: http://git.kernel.dk/cgit/fio/plain/t/io_uring.c
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Alter the AFS automounting code to create and modify an fs_context struct
when parameterising a new mount triggered by an AFS mountpoint rather than
constructing device name and option strings.
Also remove the cell=, vol= and rwpath options as they are then redundant.
The reason they existed is because the 'device name' may be derived
literally from a mountpoint object in the filesystem, so default cell and
parent-type information needed to be passed in by some other method from
the automount routines. The vol= option didn't end up being used.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Eric W. Biederman <ebiederm@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Add fs_context support to the AFS filesystem, converting the parameter
parsing to store options there.
This will form the basis for namespace propagation over mountpoints within
the AFS model, thereby allowing AFS to be used in containers more easily.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Add some logging to the core users of the fs_context log so that
information can be extracted from them as to the reason for failure.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Implement the ability for filesystems to log error, warning and
informational messages through the fs_context. In the future, these will
be extractable by userspace by reading from an fd created by the fsopen()
syscall.
Error messages are prefixed with "e ", warnings with "w " and informational
messages with "i ".
In the future, inside the kernel, formatted messages will be malloc'd but
unformatted messages will not copied if they're either in the core .rodata
section or in the .rodata section of the filesystem module pinned by
fs_context::fs_type. The messages will only be good till the fs_type is
released.
Note that the logging object will be shared between duplicated fs_context
structures. This is so that such as NFS which do a mount within a mount
can get at least some of the errors from the inner mount.
Five logging functions are provided for this:
(1) void logfc(struct fs_context *fc, const char *fmt, ...);
This logs a message into the context. If the buffer is full, the
earliest message is discarded.
(2) void errorf(fc, fmt, ...);
This wraps logfc() to log an error.
(3) void invalf(fc, fmt, ...);
This wraps errorf() and returns -EINVAL for convenience.
(4) void warnf(fc, fmt, ...);
This wraps logfc() to log a warning.
(5) void infof(fc, fmt, ...);
This wraps logfc() to log an informational message.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The kern_mount_data() isn't used any more so remove it.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Convert the hugetlbfs to use the fs_context during mount.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Make kernfs support superblock creation/mount/remount with fs_context.
This requires that sysfs, cgroup and intel_rdt, which are built on kernfs,
be made to support fs_context also.
Notes:
(1) A kernfs_fs_context struct is created to wrap fs_context and the
kernfs mount parameters are moved in here (or are in fs_context).
(2) kernfs_mount{,_ns}() are made into kernfs_get_tree(). The extra
namespace tag parameter is passed in the context if desired
(3) kernfs_free_fs_context() is provided as a destructor for the
kernfs_fs_context struct, but for the moment it does nothing except
get called in the right places.
(4) sysfs doesn't wrap kernfs_fs_context since it has no parameters to
pass, but possibly this should be done anyway in case someone wants to
add a parameter in future.
(5) A cgroup_fs_context struct is created to wrap kernfs_fs_context and
the cgroup v1 and v2 mount parameters are all moved there.
(6) cgroup1 parameter parsing error messages are now handled by invalf(),
which allows userspace to collect them directly.
(7) cgroup1 parameter cleanup is now done in the context destructor rather
than in the mount/get_tree and remount functions.
Weirdies:
(*) cgroup_do_get_tree() calls cset_cgroup_from_root() with locks held,
but then uses the resulting pointer after dropping the locks. I'm
told this is okay and needs commenting.
(*) The cgroup refcount web. This really needs documenting.
(*) cgroup2 only has one root?
Add a suggestion from Thomas Gleixner in which the RDT enablement code is
placed into its own function.
[folded a leak fix from Andrey Vagin]
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
cc: Tejun Heo <tj@kernel.org>
cc: Li Zefan <lizefan@huawei.com>
cc: Johannes Weiner <hannes@cmpxchg.org>
cc: cgroups@vger.kernel.org
cc: fenghua.yu@intel.com
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Add fs_context support to procfs.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Move proc_fill_super() to fs/proc/root.c as that's where the other
superblock stuff is.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Alexey Dobriyan <adobriyan@gmail.com>
cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
new primitive: vfs_dup_fs_context(). Comes with fs_context
method (->dup()) for copying the filesystem-specific parts
of fs_context, along with LSM one (->fs_context_dup()) for
doing the same to LSM parts.
[needs better commit message, and change of Author:, anyway]
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
the former is an analogue of mount_{single,nodev} for use in
->get_tree() instances, the latter - analogue of sget() for the
same.
These are fairly similar to the originals, but the callback signature
for sget_fc() is different from sget() ones, so getting bits and
pieces shared would be too convoluted; we might get around to that
later, but for now let's just remember to keep them in sync. They
do live next to each other, and changes in either won't be hard
to spot.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
[AV - unfuck kern_mount_data(); we want non-NULL ->mnt_ns on long-living
mounts]
[AV - reordering fs/namespace.c is badly overdue, but let's keep it
separate from that series]
[AV - drop simple_pin_fs() change]
[AV - clean vfs_kern_mount() failure exits up]
Implement a filesystem context concept to be used during superblock
creation for mount and superblock reconfiguration for remount.
The mounting procedure then becomes:
(1) Allocate new fs_context context.
(2) Configure the context.
(3) Create superblock.
(4) Query the superblock.
(5) Create a mount for the superblock.
(6) Destroy the context.
Rather than calling fs_type->mount(), an fs_context struct is created and
fs_type->init_fs_context() is called to set it up. Pointers exist for the
filesystem and LSM to hang their private data off.
A set of operations has to be set by ->init_fs_context() to provide
freeing, duplication, option parsing, binary data parsing, validation,
mounting and superblock filling.
Legacy filesystems are supported by the provision of a set of legacy
fs_context operations that build up a list of mount options and then invoke
fs_type->mount() from within the fs_context ->get_tree() operation. This
allows all filesystems to be accessed using fs_context.
It should be noted that, whilst this patch adds a lot of lines of code,
there is quite a bit of duplication with existing code that can be
eliminated should all filesystems be converted over.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Because the new API passes in key,value parameters, match_token() cannot be
used with it. Instead, provide three new helpers to aid with parsing:
(1) fs_parse(). This takes a parameter and a simple static description of
all the parameters and maps the key name to an ID. It returns 1 on a
match, 0 on no match if unknowns should be ignored and some other
negative error code on a parse error.
The parameter description includes a list of key names to IDs, desired
parameter types and a list of enumeration name -> ID mappings.
[!] Note that for the moment I've required that the key->ID mapping
array is expected to be sorted and unterminated. The size of the
array is noted in the fsconfig_parser struct. This allows me to use
bsearch(), but I'm not sure any performance gain is worth the hassle
of requiring people to keep the array sorted.
The parameter type array is sized according to the number of parameter
IDs and is indexed directly. The optional enum mapping array is an
unterminated, unsorted list and the size goes into the fsconfig_parser
struct.
The function can do some additional things:
(a) If it's not ambiguous and no value is given, the prefix "no" on
a key name is permitted to indicate that the parameter should
be considered negatory.
(b) If the desired type is a single simple integer, it will perform
an appropriate conversion and store the result in a union in
the parse result.
(c) If the desired type is an enumeration, {key ID, name} will be
looked up in the enumeration list and the matching value will
be stored in the parse result union.
(d) Optionally generate an error if the key is unrecognised.
This is called something like:
enum rdt_param {
Opt_cdp,
Opt_cdpl2,
Opt_mba_mpbs,
nr__rdt_params
};
const struct fs_parameter_spec rdt_param_specs[nr__rdt_params] = {
[Opt_cdp] = { fs_param_is_bool },
[Opt_cdpl2] = { fs_param_is_bool },
[Opt_mba_mpbs] = { fs_param_is_bool },
};
const const char *const rdt_param_keys[nr__rdt_params] = {
[Opt_cdp] = "cdp",
[Opt_cdpl2] = "cdpl2",
[Opt_mba_mpbs] = "mba_mbps",
};
const struct fs_parameter_description rdt_parser = {
.name = "rdt",
.nr_params = nr__rdt_params,
.keys = rdt_param_keys,
.specs = rdt_param_specs,
.no_source = true,
};
int rdt_parse_param(struct fs_context *fc,
struct fs_parameter *param)
{
struct fs_parse_result parse;
struct rdt_fs_context *ctx = rdt_fc2context(fc);
int ret;
ret = fs_parse(fc, &rdt_parser, param, &parse);
if (ret < 0)
return ret;
switch (parse.key) {
case Opt_cdp:
ctx->enable_cdpl3 = true;
return 0;
case Opt_cdpl2:
ctx->enable_cdpl2 = true;
return 0;
case Opt_mba_mpbs:
ctx->enable_mba_mbps = true;
return 0;
}
return -EINVAL;
}
(2) fs_lookup_param(). This takes a { dirfd, path, LOOKUP_EMPTY? } or
string value and performs an appropriate path lookup to convert it
into a path object, which it will then return.
If the desired type was a blockdev, the type of the looked up inode
will be checked to make sure it is one.
This can be used like:
enum foo_param {
Opt_source,
nr__foo_params
};
const struct fs_parameter_spec foo_param_specs[nr__foo_params] = {
[Opt_source] = { fs_param_is_blockdev },
};
const char *char foo_param_keys[nr__foo_params] = {
[Opt_source] = "source",
};
const struct constant_table foo_param_alt_keys[] = {
{ "device", Opt_source },
};
const struct fs_parameter_description foo_parser = {
.name = "foo",
.nr_params = nr__foo_params,
.nr_alt_keys = ARRAY_SIZE(foo_param_alt_keys),
.keys = foo_param_keys,
.alt_keys = foo_param_alt_keys,
.specs = foo_param_specs,
};
int foo_parse_param(struct fs_context *fc,
struct fs_parameter *param)
{
struct fs_parse_result parse;
struct foo_fs_context *ctx = foo_fc2context(fc);
int ret;
ret = fs_parse(fc, &foo_parser, param, &parse);
if (ret < 0)
return ret;
switch (parse.key) {
case Opt_source:
return fs_lookup_param(fc, &foo_parser, param,
&parse, &ctx->source);
default:
return -EINVAL;
}
}
(3) lookup_constant(). This takes a table of named constants and looks up
the given name within it. The table is expected to be sorted such
that bsearch() be used upon it.
Possibly I should require the table be terminated and just use a
for-loop to scan it instead of using bsearch() to reduce hassle.
Tables look something like:
static const struct constant_table bool_names[] = {
{ "0", false },
{ "1", true },
{ "false", false },
{ "no", false },
{ "true", true },
{ "yes", true },
};
and a lookup is done with something like:
b = lookup_constant(bool_names, param->string, -1);
Additionally, optional validation routines for the parameter description
are provided that can be enabled at compile time. A later patch will
invoke these when a filesystem is registered.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Effective revert commit:
87709e28dc ("fs/locks: Use percpu_down_read_preempt_disable()")
This is causing major pain for PREEMPT_RT.
Sebastian did a lot of lockperf runs on 2 and 4 node machines with all
preemption modes (PREEMPT=n should be an obvious NOP for this patch
and thus serves as a good control) and no results showed significance
over 2-sigma (the PREEMPT=n results were almost empty at 1-sigma).
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The timer function, zstd_reclaim_timer_fn(), reschedules itself under
certain conditions. When cleaning up, take the lock and remove all
workspaces. This prevents the timer from rearming itself. Lastly, switch
to del_timer_sync() to ensure that the timer function can't trigger as
we're unloading.
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The allocation happens with GFP_KERNEL after a transaction has been
started, this can potentially cause deadlock if reclaim tries to get the
memory by flushing filesystem data.
The fs_info::qgroup_ulist is not used during transaction start when
quotas are not enabled. The status bit BTRFS_FS_QUOTA_ENABLED is set
later in btrfs_quota_enable so it's safe to move it before the
transaction start.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Previously we only updated the drop_progress key if we were in the
DROP_REFERENCE stage of snapshot deletion. This is because the
UPDATE_BACKREF stage checks the flags of the blocks it's converting to
FULL_BACKREF, so if we go over a block we processed before it doesn't
matter, we just don't do anything.
The problem is in do_walk_down() we will go ahead and drop the roots
reference to any blocks that we know we won't need to walk into.
Given subvolume A and snapshot B. The root of B points to all of the
nodes that belong to A, so all of those nodes have a refcnt > 1. If B
did not modify those blocks it'll hit this condition in do_walk_down
if (!wc->update_ref ||
generation <= root->root_key.offset)
goto skip;
and in "goto skip" we simply do a btrfs_free_extent() for that bytenr
that we point at.
Now assume we modified some data in B, and then took a snapshot of B and
call it C. C points to all the nodes in B, making every node the root
of B points to have a refcnt > 1. This assumes the root level is 2 or
higher.
We delete snapshot B, which does the above work in do_walk_down,
free'ing our ref for nodes we share with A that we didn't modify. Now
we hit a node we _did_ modify, thus we own. We need to walk down into
this node and we set wc->stage == UPDATE_BACKREF. We walk down to level
0 which we also own because we modified data. We can't walk any further
down and thus now need to walk up and start the next part of the
deletion. Now walk_up_proc is supposed to put us back into
DROP_REFERENCE, but there's an exception to this
if (level < wc->shared_level)
goto out;
we are at level == 0, and our shared_level == 1. We skip out of this
one and go up to level 1. Since path->slots[1] < nritems we
path->slots[1]++ and break out of walk_up_tree to stop our transaction
and loop back around. Now in btrfs_drop_snapshot we have this snippet
if (wc->stage == DROP_REFERENCE) {
level = wc->level;
btrfs_node_key(path->nodes[level],
&root_item->drop_progress,
path->slots[level]);
root_item->drop_level = level;
}
our stage == UPDATE_BACKREF still, so we don't update the drop_progress
key. This is a problem because we would have done btrfs_free_extent()
for the nodes leading up to our current position. If we crash or
unmount here and go to remount we'll start over where we were before and
try to free our ref for blocks we've already freed, and thus abort()
out.
Fix this by keeping track of the last place we dropped a reference for
our block in do_walk_down. Then if wc->stage == UPDATE_BACKREF we know
we'll start over from a place we meant to, and otherwise things continue
to work as they did before.
I have a complicated reproducer for this problem, without this patch
we'll fail to fsck the fs when replaying the log writes log. With this
patch we can replay the whole log without any fsck or mount failures.
The steps to reproduce this easily are sort of tricky, I had to add a
couple of debug patches to the kernel in order to make it easy,
basically I just needed to make sure we did actually commit the
transaction every time we finished a walk_down_tree/walk_up_tree combo.
The reproducer:
1) Creates a base subvolume.
2) Creates 100k files in the subvolume.
3) Snapshots the base subvolume (snap1).
4) Touches files 5000-6000 in snap1.
5) Snapshots snap1 (snap2).
6) Deletes snap1.
I do this with dm-log-writes, and then replay to every FUA in the log
and fsck the fs.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
[ copy reproducer steps ]
Signed-off-by: David Sterba <dsterba@suse.com>
There's a bug in snapshot deletion where we won't update the
drop_progress key if we're in the UPDATE_BACKREF stage. This is a
problem because we could drop refs for blocks we know don't belong to
ours. If we crash or umount at the right time we could experience
messages such as the following when snapshot deletion resumes
BTRFS error (device dm-3): unable to find ref byte nr 66797568 parent 0 root 258 owner 1 offset 0
------------[ cut here ]------------
WARNING: CPU: 3 PID: 16052 at fs/btrfs/extent-tree.c:7108 __btrfs_free_extent.isra.78+0x62c/0xb30 [btrfs]
CPU: 3 PID: 16052 Comm: umount Tainted: G W OE 5.0.0-rc4+ #147
Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./890FX Deluxe5, BIOS P1.40 05/03/2011
RIP: 0010:__btrfs_free_extent.isra.78+0x62c/0xb30 [btrfs]
RSP: 0018:ffffc90005cd7b18 EFLAGS: 00010286
RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000000
RDX: ffff88842fade680 RSI: ffff88842fad6b18 RDI: ffff88842fad6b18
RBP: ffffc90005cd7bc8 R08: 0000000000000000 R09: 0000000000000001
R10: 0000000000000001 R11: ffffffff822696b8 R12: 0000000003fb4000
R13: 0000000000000001 R14: 0000000000000102 R15: ffff88819c9d67e0
FS: 00007f08bb138fc0(0000) GS:ffff88842fac0000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f8f5d861ea0 CR3: 00000003e99fe000 CR4: 00000000000006e0
Call Trace:
? _raw_spin_unlock+0x27/0x40
? btrfs_merge_delayed_refs+0x356/0x3e0 [btrfs]
__btrfs_run_delayed_refs+0x75a/0x13c0 [btrfs]
? join_transaction+0x2b/0x460 [btrfs]
btrfs_run_delayed_refs+0xf3/0x1c0 [btrfs]
btrfs_commit_transaction+0x52/0xa50 [btrfs]
? start_transaction+0xa6/0x510 [btrfs]
btrfs_sync_fs+0x79/0x1c0 [btrfs]
sync_filesystem+0x70/0x90
generic_shutdown_super+0x27/0x120
kill_anon_super+0x12/0x30
btrfs_kill_super+0x16/0xa0 [btrfs]
deactivate_locked_super+0x43/0x70
deactivate_super+0x40/0x60
cleanup_mnt+0x3f/0x80
__cleanup_mnt+0x12/0x20
task_work_run+0x8b/0xc0
exit_to_usermode_loop+0xce/0xd0
do_syscall_64+0x20b/0x210
entry_SYSCALL_64_after_hwframe+0x49/0xbe
To fix this simply mark dead roots we read from disk as DEAD and then
set the walk_control->restarted flag so we know we have a restarted
deletion. From here whenever we try to drop refs for blocks we check to
verify our ref is set on them, and if it is not we skip it. Once we
find a ref that is set we unset walk_control->restarted since the tree
should be in a normal state from then on, and any problems we run into
from there are different issues. I tested this with an existing broken
fs and my reproducer that creates a broken fs and it fixed both file
systems.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Reflinking (clone/dedupe) and rename are operations that operate on two
inodes and therefore need to lock them in the same order to avoid ABBA
deadlocks. It happens that Btrfs' reflink implementation always locked
them in a different order from VFS's lock_two_nondirectories() helper,
which is used by the rename code in VFS, resulting in ABBA type deadlocks.
Btrfs' locking order:
static void btrfs_double_inode_lock(struct inode *inode1, struct inode *inode2)
{
if (inode1 < inode2)
swap(inode1, inode2);
inode_lock_nested(inode1, I_MUTEX_PARENT);
inode_lock_nested(inode2, I_MUTEX_CHILD);
}
VFS's locking order:
void lock_two_nondirectories(struct inode *inode1, struct inode *inode2)
{
if (inode1 > inode2)
swap(inode1, inode2);
if (inode1 && !S_ISDIR(inode1->i_mode))
inode_lock(inode1);
if (inode2 && !S_ISDIR(inode2->i_mode) && inode2 != inode1)
inode_lock_nested(inode2, I_MUTEX_NONDIR2);
}
Fix this by killing the btrfs helper function that does the double inode
locking and replace it with VFS's helper lock_two_nondirectories().
Reported-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Fixes: 416161db9b ("btrfs: offline dedupe")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In the past we had data corruption when reading compressed extents that
are shared within the same file and they are consecutive, this got fixed
by commit 005efedf2c ("Btrfs: fix read corruption of compressed and
shared extents") and by commit 808f80b467 ("Btrfs: update fix for read
corruption of compressed and shared extents"). However there was a case
that was missing in those fixes, which is when the shared and compressed
extents are referenced with a non-zero offset. The following shell script
creates a reproducer for this issue:
#!/bin/bash
mkfs.btrfs -f /dev/sdc &> /dev/null
mount -o compress /dev/sdc /mnt/sdc
# Create a file with 3 consecutive compressed extents, each has an
# uncompressed size of 128Kb and a compressed size of 4Kb.
for ((i = 1; i <= 3; i++)); do
head -c 4096 /dev/zero
for ((j = 1; j <= 31; j++)); do
head -c 4096 /dev/zero | tr '\0' "\377"
done
done > /mnt/sdc/foobar
sync
echo "Digest after file creation: $(md5sum /mnt/sdc/foobar)"
# Clone the first extent into offsets 128K and 256K.
xfs_io -c "reflink /mnt/sdc/foobar 0 128K 128K" /mnt/sdc/foobar
xfs_io -c "reflink /mnt/sdc/foobar 0 256K 128K" /mnt/sdc/foobar
sync
echo "Digest after cloning: $(md5sum /mnt/sdc/foobar)"
# Punch holes into the regions that are already full of zeroes.
xfs_io -c "fpunch 0 4K" /mnt/sdc/foobar
xfs_io -c "fpunch 128K 4K" /mnt/sdc/foobar
xfs_io -c "fpunch 256K 4K" /mnt/sdc/foobar
sync
echo "Digest after hole punching: $(md5sum /mnt/sdc/foobar)"
echo "Dropping page cache..."
sysctl -q vm.drop_caches=1
echo "Digest after hole punching: $(md5sum /mnt/sdc/foobar)"
umount /dev/sdc
When running the script we get the following output:
Digest after file creation: 5a0888d80d7ab1fd31c229f83a3bbcc8 /mnt/sdc/foobar
linked 131072/131072 bytes at offset 131072
128 KiB, 1 ops; 0.0033 sec (36.960 MiB/sec and 295.6830 ops/sec)
linked 131072/131072 bytes at offset 262144
128 KiB, 1 ops; 0.0015 sec (78.567 MiB/sec and 628.5355 ops/sec)
Digest after cloning: 5a0888d80d7ab1fd31c229f83a3bbcc8 /mnt/sdc/foobar
Digest after hole punching: 5a0888d80d7ab1fd31c229f83a3bbcc8 /mnt/sdc/foobar
Dropping page cache...
Digest after hole punching: fba694ae8664ed0c2e9ff8937e7f1484 /mnt/sdc/foobar
This happens because after reading all the pages of the extent in the
range from 128K to 256K for example, we read the hole at offset 256K
and then when reading the page at offset 260K we don't submit the
existing bio, which is responsible for filling all the page in the
range 128K to 256K only, therefore adding the pages from range 260K
to 384K to the existing bio and submitting it after iterating over the
entire range. Once the bio completes, the uncompressed data fills only
the pages in the range 128K to 256K because there's no more data read
from disk, leaving the pages in the range 260K to 384K unfilled. It is
just a slightly different variant of what was solved by commit
005efedf2c ("Btrfs: fix read corruption of compressed and shared
extents").
Fix this by forcing a bio submit, during readpages(), whenever we find a
compressed extent map for a page that is different from the extent map
for the previous page or has a different starting offset (in case it's
the same compressed extent), instead of the extent map's original start
offset.
A test case for fstests follows soon.
Reported-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Fixes: 808f80b467 ("Btrfs: update fix for read corruption of compressed and shared extents")
Fixes: 005efedf2c ("Btrfs: fix read corruption of compressed and shared extents")
Cc: stable@vger.kernel.org # 4.3+
Tested-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When a cell with a volume location server list is added manually by
echoing the details into /proc/net/afs/cells, a record is added but the
flag saying it has been looked up isn't set.
This causes the VL server rotation code to wait forever, with the top of
/proc/pid/stack looking like:
afs_select_vlserver+0x3a6/0x6f3
afs_vl_lookup_vldb+0x4b/0x92
afs_create_volume+0x25/0x1b9
...
with the thread stuck in afs_start_vl_iteration() waiting for
AFS_CELL_FL_NO_LOOKUP_YET to be cleared.
Fix this by clearing AFS_CELL_FL_NO_LOOKUP_YET when setting up a record
if that record's details were supplied manually.
Fixes: 0a5143f2f8 ("afs: Implement VL server rotation")
Reported-by: Dave Botsch <dwb7@cornell.edu>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix a backwards endian conversion of a constant.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
smatch complained about some uninitialized error returns, so fix those.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Rework the data flow in xfs_file_iomap_begin where we decide if we have
to break shared extents.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
This reverts commit 9da3f2b740.
It was well-intentioned, but wrong. Overriding the exception tables for
instructions for random reasons is just wrong, and that is what the new
code did.
It caused problems for tracing, and it caused problems for strncpy_from_user(),
because the new checks made perfectly valid use cases break, rather than
catch things that did bad things.
Unchecked user space accesses are a problem, but that's not a reason to
add invalid checks that then people have to work around with silly flags
(in this case, that 'kernel_uaccess_faults_ok' flag, which is just an
odd way to say "this commit was wrong" and was sprinked into random
places to hide the wrongness).
The real fix to unchecked user space accesses is to get rid of the
special "let's not check __get_user() and __put_user() at all" logic.
Make __{get|put}_user() be just aliases to the regular {get|put}_user()
functions, and make it impossible to access user space without having
the proper checks in places.
The raison d'être of the special double-underscore versions used to be
that the range check was expensive, and if you did multiple user
accesses, you'd do the range check up front (like the signal frame
handling code, for example). But SMAP (on x86) and PAN (on ARM) have
made that optimization pointless, because the _real_ expense is the "set
CPU flag to allow user space access".
Do let's not break the valid cases to catch invalid cases that shouldn't
even exist.
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Kees Cook <keescook@chromium.org>
Cc: Tobin C. Harding <tobin@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Jann Horn <jannh@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Don't pass raw iomap flags to xfs_reflink_allocate_cow; signal our
intention with a boolean argument.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
There is a messy cast here:
min_t(int, len, (int)sizeof(*item)));
min_t() should normally cast to unsigned. It's not possible for "len"
to be negative, but if it were then we definitely wouldn't want to pass
negatives to read_extent_buffer(). Also there is an extra cast.
This patch shouldn't affect runtime, it's just a clean up.
Reviewed-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: David Sterba <dsterba@suse.com>
At ctree.c:key_search(), the assertion that verifies the first key on a
child extent buffer corresponds to the key at a specific slot in the
parent has a disadvantage: we effectively hit a BUG_ON() which requires
rebooting the machine later. It also does not tell any information about
which extent buffer is affected, from which root, the expected and found
keys, etc.
However as of commit 581c176041 ("btrfs: Validate child tree block's
level and first key"), that assertion is not needed since at the time we
read an extent buffer from disk we validate that its first key matches the
key, at the respective slot, in the parent extent buffer. Therefore just
remove the assertion at key_search().
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function map_private_extent_buffer() can return an -EINVAL error, and
it is called by generic_bin_search() which will return back the error. The
btrfs_bin_search() function in turn calls generic_bin_search() and the
key_search() function calls btrfs_bin_search(), so both can return the
-EINVAL error coming from the map_private_extent_buffer() function. Some
callers of these functions were ignoring that these functions can return
an error, so fix them to deal with error return values.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We should drop the lock on this error path. This has been found by a
static tool.
The lock needs to be released, it's there to protect access to the
dev_replace members and is not supposed to be left locked. The value of
state that's being switched would need to be artifically changed to an
invalid value so the default: branch is taken.
Fixes: d189dd70e2 ("btrfs: fix use-after-free due to race between replace start and cancel")
CC: stable@vger.kernel.org # 5.0+
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We recently had a customer issue with a corrupted filesystem. When
trying to mount this image btrfs panicked with a division by zero in
calc_stripe_length().
The corrupt chunk had a 'num_stripes' value of 1. calc_stripe_length()
takes this value and divides it by the number of copies the RAID profile
is expected to have to calculate the amount of data stripes. As a DUP
profile is expected to have 2 copies this division resulted in 1/2 = 0.
Later then the 'data_stripes' variable is used as a divisor in the
stripe length calculation which results in a division by 0 and thus a
kernel panic.
When encountering a filesystem with a DUP block group and a
'num_stripes' value unequal to 2, refuse mounting as the image is
corrupted and will lead to unexpected behaviour.
Code inspection showed a RAID1 block group has the same issues.
Fixes: e06cd3dd7c ("Btrfs: add validadtion checks for chunk loading")
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The scrub_ctx csum_list member must be initialized before scrub_free_ctx
is called. If the csum_list is not initialized beforehand, the
list_empty call in scrub_free_csums will result in a null deref if the
allocation fails in the for loop.
Fixes: a2de733c78 ("btrfs: scrub")
CC: stable@vger.kernel.org # 3.0+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Dan Robertson <dan@dlrobertson.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Comparing the content of the pages in the range to deduplicate is now
done in generic_remap_checks called by the generic helper
generic_remap_file_range_prep(), which takes care of ensuring we do not
compare/deduplicate undefined data beyond a file's EOF (range from EOF
to the next block boundary). So remove these checks which are now
redundant.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
After a succession of renames operations of different files and unlinking
one of them, if we fsync one of the renamed files we can end up with a
log that will either fail to replay at mount time or result in a filesystem
that is in an inconsistent state. One example scenario:
$ mkfs.btrfs -f /dev/sdb
$ mount /dev/sdb /mnt
$ mkdir /mnt/testdir
$ touch /mnt/testdir/fname1
$ touch /mnt/testdir/fname2
$ sync
$ mv /mnt/testdir/fname1 /mnt/testdir/fname3
$ rm -f /mnt/testdir/fname2
$ ln /mnt/testdir/fname3 /mnt/testdir/fname2
$ touch /mnt/testdir/fname1
$ xfs_io -c "fsync" /mnt/testdir/fname1
<power failure>
$ mount /dev/sdb /mnt
$ umount /mnt
$ btrfs check /dev/sdb
[1/7] checking root items
[2/7] checking extents
[3/7] checking free space cache
[4/7] checking fs roots
root 5 inode 259 errors 2, no orphan item
ERROR: errors found in fs roots
Opening filesystem to check...
Checking filesystem on /dev/sdc
UUID: 20e4abb8-5a19-4492-8bb4-6084125c2d0d
found 393216 bytes used, error(s) found
total csum bytes: 0
total tree bytes: 131072
total fs tree bytes: 32768
total extent tree bytes: 16384
btree space waste bytes: 122986
file data blocks allocated: 262144
referenced 262144
On a kernel without the first patch in this series, titled
"[PATCH] Btrfs: fix fsync after succession of renames of different files",
we get instead an error when mounting the filesystem due to failure of
replaying the log:
$ mount /dev/sdb /mnt
mount: mount /dev/sdb on /mnt failed: File exists
Fix this by logging the parent directory of an inode whenever we find an
inode that no longer exists (was unlinked in the current transaction),
during the procedure which finds inodes that have old names that collide
with new names of other inodes.
A test case for fstests follows soon.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
After a succession of rename operations of different files and fsyncing
one of them, such that each file gets a new name that corresponds to an
old name of another file, we can end up with a log that will cause a
failure when attempted to replay at mount time (an EEXIST error).
We currently have correct behaviour when such succession of renames
involves only two files, but if there are more files involved, we end up
not logging all the inodes that are needed, therefore resulting in a
failure when attempting to replay the log.
Example:
$ mkfs.btrfs -f /dev/sdb
$ mount /dev/sdb /mnt
$ mkdir /mnt/testdir
$ touch /mnt/testdir/fname1
$ touch /mnt/testdir/fname2
$ sync
$ mv /mnt/testdir/fname1 /mnt/testdir/fname3
$ mv /mnt/testdir/fname2 /mnt/testdir/fname4
$ ln /mnt/testdir/fname3 /mnt/testdir/fname2
$ touch /mnt/testdir/fname1
$ xfs_io -c "fsync" /mnt/testdir/fname1
<power failure>
$ mount /dev/sdb /mnt
mount: mount /dev/sdb on /mnt failed: File exists
So fix this by checking all inode dependencies when logging an inode. That
is, if one logged inode A has a new name that matches the old name of some
other inode B, check if inode B has a new name that matches the old name
of some other inode C, and so on. This fix is implemented not by doing any
recursive function calls but by using an iterative method using a linked
list that is used in a first-in-first-out fashion.
A test case for fstests follows soon.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Qgroups will do the old roots lookup at delayed ref time, which could be
while walking down the extent root while running a delayed ref. This
should be fine, except we specifically lock eb's in the backref walking
code irrespective of path->skip_locking, which deadlocks the system.
Fix up the backref code to honor path->skip_locking, nobody will be
modifying the commit_root when we're searching so it's completely safe
to do.
This happens since fb235dc06f ("btrfs: qgroup: Move half of the qgroup
accounting time out of commit trans"), kernel may lockup with quota
enabled.
There is one backref trace triggered by snapshot dropping along with
write operation in the source subvolume. The example can be reliably
reproduced:
btrfs-cleaner D 0 4062 2 0x80000000
Call Trace:
schedule+0x32/0x90
btrfs_tree_read_lock+0x93/0x130 [btrfs]
find_parent_nodes+0x29b/0x1170 [btrfs]
btrfs_find_all_roots_safe+0xa8/0x120 [btrfs]
btrfs_find_all_roots+0x57/0x70 [btrfs]
btrfs_qgroup_trace_extent_post+0x37/0x70 [btrfs]
btrfs_qgroup_trace_leaf_items+0x10b/0x140 [btrfs]
btrfs_qgroup_trace_subtree+0xc8/0xe0 [btrfs]
do_walk_down+0x541/0x5e3 [btrfs]
walk_down_tree+0xab/0xe7 [btrfs]
btrfs_drop_snapshot+0x356/0x71a [btrfs]
btrfs_clean_one_deleted_snapshot+0xb8/0xf0 [btrfs]
cleaner_kthread+0x12b/0x160 [btrfs]
kthread+0x112/0x130
ret_from_fork+0x27/0x50
When dropping snapshots with qgroup enabled, we will trigger backref
walk.
However such backref walk at that timing is pretty dangerous, as if one
of the parent nodes get WRITE locked by other thread, we could cause a
dead lock.
For example:
FS 260 FS 261 (Dropped)
node A node B
/ \ / \
node C node D node E
/ \ / \ / \
leaf F|leaf G|leaf H|leaf I|leaf J|leaf K
The lock sequence would be:
Thread A (cleaner) | Thread B (other writer)
-----------------------------------------------------------------------
write_lock(B) |
write_lock(D) |
^^^ called by walk_down_tree() |
| write_lock(A)
| write_lock(D) << Stall
read_lock(H) << for backref walk |
read_lock(D) << lock owner is |
the same thread A |
so read lock is OK |
read_lock(A) << Stall |
So thread A hold write lock D, and needs read lock A to unlock.
While thread B holds write lock A, while needs lock D to unlock.
This will cause a deadlock.
This is not only limited to snapshot dropping case. As the backref
walk, even only happens on commit trees, is breaking the normal top-down
locking order, makes it deadlock prone.
Fixes: fb235dc06f ("btrfs: qgroup: Move half of the qgroup accounting time out of commit trans")
CC: stable@vger.kernel.org # 4.14+
Reported-and-tested-by: David Sterba <dsterba@suse.com>
Reported-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
[ rebase to latest branch and fix lock assert bug in btrfs/007 ]
Signed-off-by: Qu Wenruo <wqu@suse.com>
[ copy logs and deadlock analysis from Qu's patch ]
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
Btrfs qgroup will still hit EDQUOT under the following case:
$ dev=/dev/test/test
$ mnt=/mnt/btrfs
$ umount $mnt &> /dev/null
$ umount $dev &> /dev/null
$ mkfs.btrfs -f $dev
$ mount $dev $mnt -o nospace_cache
$ btrfs subv create $mnt/subv
$ btrfs quota enable $mnt
$ btrfs quota rescan -w $mnt
$ btrfs qgroup limit -e 1G $mnt/subv
$ fallocate -l 900M $mnt/subv/padding
$ sync
$ rm $mnt/subv/padding
# Hit EDQUOT
$ xfs_io -f -c "pwrite 0 512M" $mnt/subv/real_file
[CAUSE]
Since commit a514d63882 ("btrfs: qgroup: Commit transaction in advance
to reduce early EDQUOT"), btrfs is not forced to commit transaction to
reclaim more quota space.
Instead, we just check pertrans metadata reservation against some
threshold and try to do asynchronously transaction commit.
However in above case, the pertrans metadata reservation is pretty small
thus it will never trigger asynchronous transaction commit.
[FIX]
Instead of only accounting pertrans metadata reservation, we calculate
how much free space we have, and if there isn't much free space left,
commit transaction asynchronously to try to free some space.
This may slow down the fs when we have less than 32M free qgroup space,
but should reduce a lot of false EDQUOT, so the cost should be
acceptable.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
Btrfs/139 will fail with a high probability if the testing machine (VM)
has only 2G RAM.
Resulting the final write success while it should fail due to EDQUOT,
and the fs will have quota exceeding the limit by 16K.
The simplified reproducer will be: (needs a 2G ram VM)
$ mkfs.btrfs -f $dev
$ mount $dev $mnt
$ btrfs subv create $mnt/subv
$ btrfs quota enable $mnt
$ btrfs quota rescan -w $mnt
$ btrfs qgroup limit -e 1G $mnt/subv
$ for i in $(seq -w 1 8); do
xfs_io -f -c "pwrite 0 128M" $mnt/subv/file_$i > /dev/null
echo "file $i written" > /dev/kmsg
done
$ sync
$ btrfs qgroup show -pcre --raw $mnt
The last pwrite will not trigger EDQUOT and final 'qgroup show' will
show something like:
qgroupid rfer excl max_rfer max_excl parent child
-------- ---- ---- -------- -------- ------ -----
0/5 16384 16384 none none --- ---
0/256 1073758208 1073758208 none 1073741824 --- ---
And 1073758208 is larger than
> 1073741824.
[CAUSE]
It's a bug in btrfs qgroup data reserved space management.
For quota limit, we must ensure that:
reserved (data + metadata) + rfer/excl <= limit
Since rfer/excl is only updated at transaction commmit time, reserved
space needs to be taken special care.
One important part of reserved space is data, and for a new data extent
written to disk, we still need to take the reserved space until
rfer/excl numbers get updated.
Originally when an ordered extent finishes, we migrate the reserved
qgroup data space from extent_io tree to delayed ref head of the data
extent, expecting delayed ref will only be cleaned up at commit
transaction time.
However for small RAM machine, due to memory pressure dirty pages can be
flushed back to disk without committing a transaction.
The related events will be something like:
file 1 written
btrfs_finish_ordered_io: ino=258 ordered offset=0 len=54947840
btrfs_finish_ordered_io: ino=258 ordered offset=54947840 len=5636096
btrfs_finish_ordered_io: ino=258 ordered offset=61153280 len=57344
btrfs_finish_ordered_io: ino=258 ordered offset=61210624 len=8192
btrfs_finish_ordered_io: ino=258 ordered offset=60583936 len=569344
cleanup_ref_head: num_bytes=54947840
cleanup_ref_head: num_bytes=5636096
cleanup_ref_head: num_bytes=569344
cleanup_ref_head: num_bytes=57344
cleanup_ref_head: num_bytes=8192
^^^^^^^^^^^^^^^^ This will free qgroup data reserved space
file 2 written
...
file 8 written
cleanup_ref_head: num_bytes=8192
...
btrfs_commit_transaction <<< the only transaction committed during
the test
When file 2 is written, we have already freed 128M reserved qgroup data
space for ino 258. Thus later write won't trigger EDQUOT.
This allows us to write more data beyond qgroup limit.
In my 2G ram VM, it could reach about 1.2G before hitting EDQUOT.
[FIX]
By moving reserved qgroup data space from btrfs_delayed_ref_head to
btrfs_qgroup_extent_record, we can ensure that reserved qgroup data
space won't be freed half way before commit transaction, thus fix the
problem.
Fixes: f64d5ca868 ("btrfs: delayed_ref: Add new function to record reserved space into delayed ref")
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The member btrfs_fs_info::scrub_nocow_workers is unused since the nocow
optimization was removed from scrub in 9bebe665c3 ("btrfs: scrub:
Remove unused copy_nocow_pages and its callchain").
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The scrub worker pointers are not NULL iff the scrub is running, so
reset them back once the last reference is dropped. Add assertions to
the initial phase of scrub to verify that.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Use the refcount_t for fs_info::scrub_workers_refcnt instead of int so
we get the extra checks. All reference changes are still done under
scrub_lock.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
scrub_workers_refcnt is protected by scrub_lock, add lockdep_assert_held()
in scrub_workers_get().
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Suggested-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have killed volume mutex (commit: dccdb07bc9
btrfs: kill btrfs_fs_info::volume_mutex). This a trival one seems to have
escaped.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There is no need to forward declare flush_write_bio(), as it only
depends on submit_one_bio(). Both of them are pretty small, just move
them to kill the forward declaration.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The variables and function parameters of __etree_search which pertain to
prev/next are grossly misnamed. Namely, prev_ret holds the next state
and not the previous. Similarly, next_ret actually holds the previous
extent state relating to the offset we are interested in. Fix this by
renaming the variables as well as switching the arguments order. No
functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
With the refactoring introduced in 8b62f87bad ("Btrfs: reworki
outstanding_extents") this flag became unused. Remove it and renumber
the following flags accordingly. No functional changes.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There is no point in using a construct like 'if (!condition)
WARN_ON(1)'. Use WARN_ON(!condition) directly. No functional changes.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We could generate a lot of delayed refs in evict but never have any left
over space from our block rsv to make up for that fact. So reserve some
extra space and give it to the transaction so it can be used to refill
the delayed refs rsv every loop through the truncate path.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
For FLUSH_LIMIT flushers we really can only allocate chunks and flush
delayed inode items, everything else is problematic. I added a bunch of
new states and it lead to weirdness in the FLUSH_LIMIT case because I
forgot about how it worked. So instead explicitly declare the states
that are ok for flushing with FLUSH_LIMIT and use that for our state
machine. Then as we add new things that are safe we can just add them
to this list.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
With severe fragmentation we can end up with our inode rsv size being
huge during writeout, which would cause us to need to make very large
metadata reservations.
However we may not actually need that much once writeout is complete,
because of the over-reservation for the worst case.
So instead try to make our reservation, and if we couldn't make it
re-calculate our new reservation size and try again. If our reservation
size doesn't change between tries then we know we are actually out of
space and can error. Flushing that could have been running in parallel
did not make any space.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
[ rename to calc_refill_bytes, update comment and changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
With the introduction of the per-inode block_rsv it became possible to
have really really large reservation requests made because of data
fragmentation. Since the ticket stuff assumed that we'd always have
relatively small reservation requests it just killed all tickets if we
were unable to satisfy the current request.
However, this is generally not the case anymore. So fix this logic to
instead see if we had a ticket that we were able to give some
reservation to, and if we were continue the flushing loop again.
Likewise we make the tickets use the space_info_add_old_bytes() method
of returning what reservation they did receive in hopes that it could
satisfy reservations down the line.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We've done this forever because of the voodoo around knowing how much
space we have. However, we have better ways of doing this now, and on
normal file systems we'll easily have a global reserve of 512MiB, and
since metadata chunks are usually 1GiB that means we'll allocate
metadata chunks more readily. Instead use the actual used amount when
determining if we need to allocate a chunk or not.
This has a side effect for mixed block group fs'es where we are no
longer allocating enough chunks for the data/metadata requirements. To
deal with this add a ALLOC_CHUNK_FORCE step to the flushing state
machine. This will only get used if we've already made a full loop
through the flushing machinery and tried committing the transaction.
If we have then we can try and force a chunk allocation since we likely
need it to make progress. This resolves issues I was seeing with
the mixed bg tests in xfstests without the new flushing state.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
[ merged with patch "add ALLOC_CHUNK_FORCE to the flushing code" ]
Signed-off-by: David Sterba <dsterba@suse.com>
For enospc_debug having the block rsvs is super helpful to see if we've
done something wrong.
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
may_commit_transaction will skip committing the transaction if we don't
have enough pinned space or if we're trying to find space for a SYSTEM
chunk. However, if we have pending free block groups in this transaction
we still want to commit as we may be able to allocate a chunk to make
our reservation. So instead of just returning ENOSPC, check if we have
free block groups pending, and if so commit the transaction to allow us
to use that free space.
Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Zstd compression requires different amounts of memory for each level of
compression. The prior patches implemented indirection to allow for each
compression type to manage their workspaces independently. This patch
uses this indirection to implement compression level support for zstd.
To manage the additional memory require, each compression level has its
own queue of workspaces. A global LRU is used to help with reclaim.
Reclaim is done via a timer which provides a mechanism to decrease
memory utilization by keeping only workspaces around that are sized
appropriately. Forward progress is guaranteed by a preallocated max
workspace hidden from the LRU.
When getting a workspace, it uses a bitmap to identify the levels that
are populated and scans up. If it finds a workspace that is greater than
it, it uses it, but does not update the last_used time and the
corresponding place in the LRU. If we hit memory pressure, we sleep on
the max level workspace. We continue to rescan in case we can use a
smaller workspace, but eventually should be able to obtain the max level
workspace or allocate one again should memory pressure subside.
The memory requirement for decompression is the same as level 1, and
therefore can use any of available workspace.
The number of workspaces is bound by an upper limit of the workqueue's
limit which currently is 2 (percpu limit). The reclaim timer is used to
free inactive/improperly sized workspaces and is set to 307s to avoid
colliding with transaction commit (every 30s).
Repeating the experiment from v2 [1], the Silesia corpus was copied to a
btrfs filesystem 10 times and then read back after dropping the caches.
The btrfs filesystem was on an SSD.
Level Ratio Compression (MB/s) Decompression (MB/s) Memory (KB)
1 2.658 438.47 910.51 780
2 2.744 364.86 886.55 1004
3 2.801 336.33 828.41 1260
4 2.858 286.71 886.55 1260
5 2.916 212.77 556.84 1388
6 2.363 119.82 990.85 1516
7 3.000 154.06 849.30 1516
8 3.011 159.54 875.03 1772
9 3.025 100.51 940.15 1772
10 3.033 118.97 616.26 1772
11 3.036 94.19 802.11 1772
12 3.037 73.45 931.49 1772
13 3.041 55.17 835.26 2284
14 3.087 44.70 716.78 2547
15 3.126 37.30 878.84 2547
[1] https://lore.kernel.org/linux-btrfs/20181031181108.289340-1-terrelln@fb.com/
Cc: Nick Terrell <terrelln@fb.com>
Cc: Omar Sandoval <osandov@osandov.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It is possible based on the level configurations that a higher level
workspace uses less memory than a lower level workspace. In order to
reuse workspaces, this must be made a monotonic relationship. This
precomputes the required memory for each level and enforces the
monotonicity between level and memory required. This is also done
in upstream zstd in [1].
[1] a68b76afef
Cc: Nick Terrell <terrelln@fb.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Zstd currently only supports the default level of compression. This
patch switches to using the level passed in for btrfs zstd
configuration.
Zstd workspaces now keep track of the requested level as this can differ
from the size of the workspace.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently, the only user of set_level() is zlib which sets an internal
workspace parameter. As level is now plumbed into get_workspace(), this
can be handled there rather than separately.
This repurposes set_level() to bound the level passed in so it can be
used when setting the mounts compression level and as well as verifying
the level before getting a workspace. The other benefit is this divides
the meaning of compress(0) and get_workspace(0). The former means we
want to use the default compression level of the compression type. The
latter means we can use any workspace available.
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Zlib compression supports multiple levels, but doesn't require changing
in how a workspace itself is created and managed. Zstd introduces a
different memory requirement such that higher levels of compression
require more memory.
This requires changes in how the alloc()/get() methods work for zstd.
This pach plumbs compression level through the interface as a parameter
in preparation for zstd compression levels. This gives the compression
types opportunity to create/manage based on the compression level.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The previous patch added generic helpers for get_workspace() and
put_workspace(). Now, we can migrate ownership of the workspace_manager
to be in the compression type code as the compression code itself
doesn't care beyond being able to get a workspace. The init/cleanup and
get/put methods are abstracted so each compression algorithm can decide
how they want to manage their workspaces.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There are two levels of workspace management. First, alloc()/free()
which are responsible for actually creating and destroy workspaces.
Second, at a higher level, get()/put() which is the compression code
asking for a workspace from a workspace_manager.
The compression code shouldn't really care how it gets a workspace, but
that it got a workspace. This adds get_workspace() and put_workspace()
to be the higher level interface which is responsible for indexing into
the appropriate compression type. It also introduces
btrfs_put_workspace() and btrfs_get_workspace() to be the generic
implementations of the higher interface.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Workspace manager init and cleanup code is open coded inside a for loop
over the compression types. This forces each compression type to rely on
the same workspace manager implementation. This patch creates helper
methods that will be the generic implementation for btrfs workspace
management.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Make the workspace_manager own the interface operations rather than
managing index-paired arrays for the workspace_manager and compression
operations.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
While the heuristic workspaces aren't really compression workspaces,
they use the same interface for managing them. So rather than branching,
let's just handle them once again as the index 0 compression type.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is in preparation for zstd compression levels. As each level will
require different size of workspace, workspaces_list is no longer a
really fitting name.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It is very easy to miss places that rely on a certain bitshifting for
decoding the type_level overloading. Add helpers to do this instead.
Cc: Omar Sandoval <osandov@osandov.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Support for a new command that can be used eg. as a command
$ btrfs device scan --forget [dev]'
(the final name may change though)
to undo the effects of 'btrfs device scan [dev]'. For this purpose
this patch proposes to use ioctl #5 as it was empty and is next to the
SCAN ioctl.
The new ioctl BTRFS_IOC_FORGET_DEV works only on the control device
(/dev/btrfs-control) to unregister one or all devices, devices that are
not mounted.
The argument is struct btrfs_ioctl_vol_args, ::name specifies the device
path. To unregister all device, the path is an empty string.
Again, the devices are removed only if they aren't part of a mounte
filesystem.
This new ioctl provides:
- release of unwanted btrfs_fs_devices and btrfs_devices structures
from memory if the device is not going to be mounted
- ability to mount filesystem in degraded mode, when one devices is
corrupted like in split brain raid1
- running test cases which would require reloading the kernel module
but this is not possible eg. due to mounted filesystem or built-in
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
The throttle path doesn't take cleaner_delayed_iput_mutex, which means
we could think we're done flushing iputs in the data space reservation
path when we could have a throttler doing an iput. There's no real
reason to serialize the delayed iput flushing, so instead of taking the
cleaner_delayed_iput_mutex whenever we flush the delayed iputs just
replace it with an atomic counter and a waitqueue. This removes the
short (or long depending on how big the inode is) window where we think
there are no more pending iputs when there really are some.
The waiting is killable as it could be indirectly called from user
operations like fallocate or zero-range. Such call sites should handle
the error but otherwise it's not necessary. Eg. flush_space just needs
to attempt to make space by waiting on iputs.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
[ add killable comment and changelog parts ]
Signed-off-by: David Sterba <dsterba@suse.com>
Since inc_block_group_ro() would return -ENOSPC, outputting debug info
for enospc_debug mount option would be helpful to debug some balance
false ENOSPC report.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Inside qgroup_rsv_add/release(), we have trace events
trace_qgroup_update_reserve() to catch reserved space update.
However we still have two manual trace_qgroup_update_reserve() calls
just outside these functions. Remove these duplicated calls.
Fixes: 64ee4e751a ("btrfs: qgroup: Update trace events to use new separate rsv types")
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
A compiler warning (in a patch in development) pointed to a variable
that was used only inside and ASSERT:
u64 root_objectid = root->root_key.objectid;
ASSERT(root_objectid == ...);
fs/btrfs/relocation.c: In function ‘insert_dirty_subv’:
fs/btrfs/relocation.c:2138:6: warning: unused variable ‘root_objectid’ [-Wunused-variable]
u64 root_objectid = root->root_key.objectid;
^~~~~~~~~~~~~
When CONFIG_BRTFS_ASSERT isn't enabled, variable root_objectid isn't used.
Rework the assertion helper by adding a runtime check instead of the
'#ifdef CONFIG_BTRFS_ASSERT #else ...", so the compiler sees the
condition being passed into an inline function after preprocessing.
Signed-off-by: Anders Roxell <anders.roxell@linaro.org>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
The last caller that does not have a fixed value of lock is
btrfs_set_path_blocking, that actually does the same conditional swtich
by the lock type so we can merge the branches together and remove the
helper.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently, the number of readers and writers is checked and in case
there are any, wait and redo the locks. There's some duplication
before the branches go back to again label, eg. calling wait_event on
blocking_readers twice.
The sequence is transformed
loop:
* wait for readers
* wait for writers
* write_lock
* check readers, unlock and wait for readers, loop
* check writers, unlock and wait for writers, loop
The new sequence is not exactly the same due to the simplification, for
readers it's slightly faster. For the writers, original code does
* wait for writers
* (loop) wait for readers
* wait for writers -- again
while the new goes directly to the reader check. This should behave the
same on a contended lock with multiple writers and readers, but can
reduce number of times we're waiting on something.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_set_lock_blocking is now only a simple wrapper around
btrfs_set_lock_blocking_write. The name does not bring any semantic
value that could not be inferred from the new function so there's no
point keeping it.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
We can use the right helper where the lock type is a fixed parameter.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
There are many callers that hardcode the desired lock type so we can
avoid the switch and call them directly. Split the current function to
two. There are no remaining users of btrfs_clear_lock_blocking_rw so
it's removed. The call sites will be converted in followup patches.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
There are many callers that hardcode the desired lock type so we can
avoid the switch and call them directly. Split the current function to
two but leave a helper that still takes the variable lock type to make
current code compile. The call sites will be converted in followup
patches.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
Since it's replaced by new delayed subtree swap code, remove the
original code.
The cleanup is small since most of its core function is still used by
delayed subtree swap trace.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Before this patch, qgroup code traces the whole subtree of subvolume and
reloc trees unconditionally.
This makes qgroup numbers consistent, but it could cause tons of
unnecessary extent tracing, which causes a lot of overhead.
However for subtree swap of balance, just swap both subtrees because
they contain the same contents and tree structure, so qgroup numbers
won't change.
It's the race window between subtree swap and transaction commit could
cause qgroup number change.
This patch will delay the qgroup subtree scan until COW happens for the
subtree root.
So if there is no other operations for the fs, balance won't cause extra
qgroup overhead. (best case scenario)
Depending on the workload, most of the subtree scan can still be
avoided.
Only for worst case scenario, it will fall back to old subtree swap
overhead. (scan all swapped subtrees)
[[Benchmark]]
Hardware:
VM 4G vRAM, 8 vCPUs,
disk is using 'unsafe' cache mode,
backing device is SAMSUNG 850 evo SSD.
Host has 16G ram.
Mkfs parameter:
--nodesize 4K (To bump up tree size)
Initial subvolume contents:
4G data copied from /usr and /lib.
(With enough regular small files)
Snapshots:
16 snapshots of the original subvolume.
each snapshot has 3 random files modified.
balance parameter:
-m
So the content should be pretty similar to a real world root fs layout.
And after file system population, there is no other activity, so it
should be the best case scenario.
| v4.20-rc1 | w/ patchset | diff
-----------------------------------------------------------------------
relocated extents | 22615 | 22457 | -0.1%
qgroup dirty extents | 163457 | 121606 | -25.6%
time (sys) | 22.884s | 18.842s | -17.6%
time (real) | 27.724s | 22.884s | -17.5%
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
To allow delayed subtree swap rescan, btrfs needs to record per-root
information about which tree blocks get swapped. This patch introduces
the required infrastructure.
The designed workflow will be:
1) Record the subtree root block that gets swapped.
During subtree swap:
O = Old tree blocks
N = New tree blocks
reloc tree subvolume tree X
Root Root
/ \ / \
NA OB OA OB
/ | | \ / | | \
NC ND OE OF OC OD OE OF
In this case, NA and OA are going to be swapped, record (NA, OA) into
subvolume tree X.
2) After subtree swap.
reloc tree subvolume tree X
Root Root
/ \ / \
OA OB NA OB
/ | | \ / | | \
OC OD OE OF NC ND OE OF
3a) COW happens for OB
If we are going to COW tree block OB, we check OB's bytenr against
tree X's swapped_blocks structure.
If it doesn't fit any, nothing will happen.
3b) COW happens for NA
Check NA's bytenr against tree X's swapped_blocks, and get a hit.
Then we do subtree scan on both subtrees OA and NA.
Resulting 6 tree blocks to be scanned (OA, OC, OD, NA, NC, ND).
Then no matter what we do to subvolume tree X, qgroup numbers will
still be correct.
Then NA's record gets removed from X's swapped_blocks.
4) Transaction commit
Any record in X's swapped_blocks gets removed, since there is no
modification to swapped subtrees, no need to trigger heavy qgroup
subtree rescan for them.
This will introduce 128 bytes overhead for each btrfs_root even qgroup
is not enabled. This is to reduce memory allocations and potential
failures.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Refactor btrfs_qgroup_trace_subtree_swap() into
qgroup_trace_subtree_swap(), which only needs two extent buffer and some
other bool to control the behavior.
This provides the basis for later delayed subtree scan work.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Relocation code will drop btrfs_root::reloc_root as soon as
merge_reloc_root() finishes.
However later qgroup code will need to access btrfs_root::reloc_root
after merge_reloc_root() for delayed subtree rescan.
So alter the timming of resetting btrfs_root:::reloc_root, make it
happens after transaction commit.
With this patch, we will introduce a new btrfs_root::state,
BTRFS_ROOT_DEAD_RELOC_TREE, to info part of btrfs_root::reloc_tree user
that although btrfs_root::reloc_tree is still non-NULL, but still it's
not used any more.
The lifespan of btrfs_root::reloc tree will become:
Old behavior | New
------------------------------------------------------------------------
btrfs_init_reloc_root() --- | btrfs_init_reloc_root() ---
set reloc_root | | set reloc_root |
| | |
| | |
merge_reloc_root() | | merge_reloc_root() |
|- btrfs_update_reloc_root() --- | |- btrfs_update_reloc_root() -+-
clear btrfs_root::reloc_root | set ROOT_DEAD_RELOC_TREE |
| record root into dirty |
| roots rbtree |
| |
| reloc_block_group() Or |
| btrfs_recover_relocation() |
| | After transaction commit |
| |- clean_dirty_subvols() ---
| clear btrfs_root::reloc_root
During ROOT_DEAD_RELOC_TREE set lifespan, the only user of
btrfs_root::reloc_tree should be qgroup.
Since reloc root needs a longer life-span, this patch will also delay
btrfs_drop_snapshot() call.
Now btrfs_drop_snapshot() is called in clean_dirty_subvols().
This patch will increase the size of btrfs_root by 16 bytes.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The first thing we do is loop through the list, this
if (!list_empty())
btrfs_create_pending_block_groups();
thing is just wasted space.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Instead of open coding this stuff use the helper instead.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have this open coded in btrfs_destroy_delayed_refs, use the helper
instead.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The kernel log messages help debugging and audit, add them for scrub
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The workqueue name is constructed from a format string but the prefix
does not need to be set by %s.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Both btrfs_find_device() and find_device() does the same thing except
that the latter does not take the seed device onto account in the device
scanning context. We can merge them.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Preparatory patch to add ioctl that allows to forget a device (ie.
reverse of scan).
Refactors btrfs_free_stale_devices() to obtain return status. As this
function can fail if it can't find the given path (returns -ENOENT) or
trying to delete a mounted device (returns -EBUSY).
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_find_device() accepts fs_info as an argument and retrieves
fs_devices from fs_info.
Instead use fs_devices, so that this function can be used in non-mount
(during device scanning) context as well.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_find_device_by_devspec() finds the device by @devid or by
@device_path. This patch makes code flow easy to read by open coding the
else part and renames devpath to device_path.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_find_device_missing_or_by_path() is relatively small function, and
its only parent btrfs_find_device_by_devspec() is small as well. Besides
there are a number of find_device functions. Merge
btrfs_find_device_missing_or_by_path() into its parent.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In order to avoid duplicating init code for em there is an additional
label, not_found_em, which is used to only set ->block_start. The only
case when it will be used is if the extent we are adding overlaps with
an existing extent. Make that case more obvious by:
1. Adding a comment hinting at what's going on
2. Assigning EXTENT_MAP_HOLE and directly going to insert.
No functional changes.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Core btree functions in btrfs generally return 0 when an item is found,
1 in case the sought item cannot be found and <0 when an error happens.
Consolidate the checks for those conditions in one 'if () {} else if ()
{}' construct rather than 2 separate 'if () {}' statements. This
emphasizes that the handling code pertains to a single function. No
functional changes.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
found_type really holds the type of extent and is guaranteed to to have
a value between [0, 2]. The only time it can contain anything different
is if btrfs_lookup_file_extent returned a positive value and the
previous item is different than an extent. Avoid this situation by
simply checking found_key.type rather than assigning the item type to
found_type intermittently. Also make the variable an u8 to reduce stack
usage. No functional changes.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Move the check that verifies if both inodes have checksums disabled or
both have them enabled, from the clone and deduplication functions into
the new common helper btrfs_remap_file_range_prep().
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We can never have extents marked as EXTENT_MAP_DELALLOC since this
value is only ever used by btrfs_get_extent_fiemap. In this case the
extent map is created by btrfs_get_extent_fiemap and is never really
published, this flag is used to return the corresponding userspace one.
Considering this, it's pointless having a check for EXTENT_MAP_DELALLOC
in mergable_maps. Just remove it.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If the call to btrfs_balance() failed we would overwrite the error
returned to user space with -EFAULT if the call to copy_to_user() failed
as well. Fix that by calling copy_to_user() only if btrfs_balance()
returned success or was canceled.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If the call to btrfs_dev_replace_by_ioctl() failed we would overwrite the
error returned to user space with -EFAULT if the call to copy_to_user()
failed as well. Fix that by calling copy_to_user() only if no error
happened before or a device replace operation was canceled.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Checking if either of the inodes corresponds to a swapfile is already
performed by generic_remap_file_range_prep(), so we do not need to do
it in the btrfs clone and deduplication functions.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add a couple of comments regarding the logic flow in shrink_delalloc.
Then, cease using max_reclaim as a temporary variable when calculating
nr_pages. Finally give max_reclaim a more becoming name, which
uneqivocally shows at what this variable really holds. No functional
changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add a comment explaining when ->inode could be NULL and why we always
perform the ->async_delalloc_pages modification.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It can never trigger since before calling alloc_delalloc_work we have
called igrab in start_delalloc_inodes.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
ihold is supposed to be used when the caller already has a reference to
the inode. In the case of cow_file_range_async this invariants holds,
since the 3 call chains leading to this function all take a reference:
btrfs_writepage <--- does igrab
extent_write_full_page
__extent_writepage
writepage_delalloc
btrfs_run_delalloc_range
cow_file_range_async
extent_write_cache_pages <--- does igrab
__extent_writepage (same callchain as above)
and
submit_compressed_extents <-- already called from async CoW submit path,
which would have done ihold.
extent_write_locked_range
__extent_writepage
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add comment ]
Signed-off-by: David Sterba <dsterba@suse.com>
It's used only once so just inline the call to i_size_read. The
semantics regarding the inode size are not changed, the pages in the
range are locked and i_size cannot change between the time it was set
and used.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We already pass the async_cow struct that holds a reference to the
inode. Exploit this fact and remove the extra inode argument. No
functional changes.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Fixes gcc '-Wunused-but-set-variable' warning:
fs/btrfs/ioctl.c: In function 'btrfs_extent_same':
fs/btrfs/ioctl.c:3260:6: warning:
variable 'num_pages' set but not used [-Wunused-but-set-variable]
It not used any more since commit 9ee8234e6220 ("Btrfs: use
generic_remap_file_range_prep() for cloning and deduplication")
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: David Sterba <dsterba@suse.com>
hole_len is only used if the hole falls within the requested range. Make
that explicitly clear by only assigning in the corresponding branch.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Make btrfs_get_extent_fiemap a bit more friendly. First step is to
rename the closely related, yet arbitrary named
range_start/found_end/found variables. They define the delalloc range
that is found in case a real extent wasn't found. Subsequently remove
an unnecessary check for hole_em since it's guaranteed to be set i.e the
check is always true. Top it off by giving all comments a refresh.
No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ reformatted a few more comments ]
Signed-off-by: David Sterba <dsterba@suse.com>
This function is a simple wrapper over btrfs_get_extent that returns
either:
a) A real extent in the passed range or
b) Adjusted extent based on whether delalloc bytes are found backing up
a hole.
To support these semantics it doesn't need the page/pg_offset/create
arguments which are passed to btrfs_get_extent in case an extent is to
be created. So simplify the function by removing the unused arguments.
No functional changes.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We are holding a transaction handle when setting an acl, therefore we can
not allocate the xattr value buffer using GFP_KERNEL, as we could deadlock
if reclaim is triggered by the allocation, therefore setup a nofs context.
Fixes: 39a27ec100 ("btrfs: use GFP_KERNEL for xattr and acl allocations")
CC: stable@vger.kernel.org # 4.9+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We are holding a transaction handle when creating a tree, therefore we can
not allocate the root using GFP_KERNEL, as we could deadlock if reclaim is
triggered by the allocation, therefore setup a nofs context.
Fixes: 74e4d82757 ("btrfs: let callers of btrfs_alloc_root pass gfp flags")
CC: stable@vger.kernel.org # 4.9+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If the call to btrfs_get_dev_stats() failed we would overwrite the error
returned to user space with -EFAULT if the call to copy_to_user() failed
as well. Fix that by calling copy_to_user() only if btrfs_get_dev_stats()
returned success.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If the call to btrfs_scrub_progress() failed we would overwrite the error
returned to user space with -EFAULT if the call to copy_to_user() failed
as well. Fix that by calling copy_to_user() only if btrfs_scrub_progress()
returned success.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If scrub returned an error and then the copy_to_user() call did not
succeed, we would overwrite the error returned by scrub with -EFAULT.
Fix that by calling copy_to_user() only if btrfs_scrub_dev() returned
success.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since this function is no longer a callback there is no need to have
its first argument obfuscated with a void *. Change it directly to a
pointer to an inode. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Drop LIST_HEAD where the variable it declares is never used.
The uses were removed in 3fd0a5585e ("Btrfs: Metadata ENOSPC
handling for balance"), but not the declaration.
The semantic patch that fixes this problem is as follows:
(http://coccinelle.lip6.fr/)
// <smpl>
@@
identifier x;
@@
- LIST_HEAD(x);
... when != x
// </smpl>
Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Three conflicts, one of which, for marvell10g.c is non-trivial and
requires some follow-up from Heiner or someone else.
The issue is that Heiner converted the marvell10g driver over to
use the generic c45 code as much as possible.
However, in 'net' a bug fix appeared which makes sure that a new
local mask (MDIO_AN_10GBT_CTRL_ADV_NBT_MASK) with value 0x01e0
is cleared.
Signed-off-by: David S. Miller <davem@davemloft.net>
Store the request queue the last bio was submitted to in the iocb
private data in addition to the cookie so that we find the right block
device. Also refactor the common direct I/O bio submission code into a
nice little helper.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Modified to use bio_set_polled().
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
For the upcoming async polled IO, we can't sleep allocating requests.
If we do, then we introduce a deadlock where the submitter already
has async polled IO in-flight, but can't wait for them to complete
since polled requests must be active found and reaped.
Utilize the helper in the blockdev DIRECT_IO code.
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Just call blk_poll on the iocb cookie, we can derive the block device
from the inode trivially.
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Reject unsupported ioctl flags explicitly, so the following command
on a regular ubifs file will fail:
chattr +d ubifs_file
And xfstests generic/424 will pass.
Signed-off-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
If a bulk layout recall or a metadata server reboot coincides with a
umount, then holding a reference to an inode is unsafe unless we
also hold a reference to the super block.
Fixes: fd9a8d7160 ("NFSv4.1: Fix bulk recall and destroy of layouts")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
__vfs_write() was unexported, and removed from <linux/fs.h>, but
forgotten to be made static.
Fixes: eb031849d5 ("fs: unexport __vfs_read/__vfs_write")
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Fix a soft lockup when NFS client delegation recovery is attempted
but the inode is in the process of being freed. When the
igrab(inode) call fails, and we have to restart the recovery process,
we need to ensure that we won't attempt to recover the same delegation
again.
Fixes: 45870d6909 ("NFSv4.1: Test delegation stateids when server...")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
The check if (bh) in udf_sync_fs() is pointless as we cannot have
sbi->s_lvid_dirty and !sbi->s_lvid_bh (as already asserted by
udf_updated_lvid()). So just drop the pointless check.
Reviewed-by: Steven J. Magnani <steve@digidescorp.com>
Reported-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Jan Kara <jack@suse.cz>
A 'false retry' in NFSv4.1 occurs when the client attempts to transmit a
new RPC call using a slot+sequence number combination that references an
already cached one. Currently, the Linux NFS client will do this if a
user process interrupts an RPC call that is in progress.
The problem with doing so is that we defeat the main mechanism used by
the server to differentiate between a new call and a replayed one. Even
if the server is able to perfectly cache the arguments of the old call,
it cannot know if the client intended to replay or send a new call.
The obvious fix is to bump the sequence number pre-emptively if an
RPC call is interrupted, but in order to deal with the corner cases
where the interrupted call is not actually received and processed by
the server, we need to interpret the error NFS4ERR_SEQ_MISORDERED
as a sign that we need to either wait or locate a correct sequence
number that lies between the value we sent, and the last value that
was acked by a SEQUENCE call on that slot.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Tested-by: Jason Tibbitts <tibbs@math.uh.edu>
-----BEGIN PGP SIGNATURE-----
iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAlxu23wTHGlkcnlvbW92
QGdtYWlsLmNvbQAKCRBKf944AhHzi04sB/9/Lm694io6coIW/8mTVjKs9iF7+AkA
V11Pu3VHnp8okcNs3iSv6rtbPgLrONP5fg7OC9xAzzsEjA8kitnV7kC0RunzuUiw
BnZcE0v5Eh4NB34DW4w4I0JDkEH75zlsw/ar8HDvgxTw9hCgV56n1oic+MkqAuiL
5TwBjuLL5Nl4tA9djcMtHhBy3u+jQXAEDJ9tiKFBhzQOPo73N2IyZyY+Kz3qqygV
XLWvpmW6PA9jJt8pUGRCqpahaYAY7+cyzOUJbqdRUpeESVl/2Epr7zJZo9LHQtsE
GympS5weYMg6xLoZHQXvLbb77E/SlXkeVyNzBw7LnV6us2PS4tbgSt2J
=alOG
-----END PGP SIGNATURE-----
Merge tag 'ceph-for-5.0-rc8' of git://github.com/ceph/ceph-client
Pull ceph fixes from Ilya Dryomov:
"Two bug fixes for old issues, both marked for stable"
* tag 'ceph-for-5.0-rc8' of git://github.com/ceph/ceph-client:
ceph: avoid repeatedly adding inode to mdsc->snap_flush_list
libceph: handle an empty authorize reply
Tetsuo has reported that creating a thousands of processes sharing MM
without SIGHAND (aka alien threads) and setting
/proc/<pid>/oom_score_adj will swamp the kernel log and takes ages [1]
to finish. This is especially worrisome that all that printing is done
under RCU lock and this can potentially trigger RCU stall or softlockup
detector.
The primary reason for the printk was to catch potential users who might
depend on the behavior prior to 44a70adec9 ("mm, oom_adj: make sure
processes sharing mm have same view of oom_score_adj") but after more
than 2 years without a single report I guess it is safe to simply remove
the printk altogether.
The next step should be moving oom_score_adj over to the mm struct and
remove all the tasks crawling as suggested by [2]
[1] http://lkml.kernel.org/r/97fce864-6f75-bca5-14bc-12c9f890e740@i-love.sakura.ne.jp
[2] http://lkml.kernel.org/r/20190117155159.GA4087@dhcp22.suse.cz
Link: http://lkml.kernel.org/r/20190212102129.26288-1-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Yong-Taek Lee <ytk.lee@samsung.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is useful for moving journal thread into cgroup or
for tracing it with ftrace/perf/blktrace.
For now the only way is `pgrep jbd2/$DISK` but this is not reliable:
name may be longer than "comm" limit and any task could mock it.
Attribute shows pid in current pid-namespace or 0 if task is unreachable.
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
All other configuration options for the ext* family of file systems use
"Ext%u" instead of "EXT%u".
Fixes: 6ba495e925 ("ext4: Add configurable run-time mballoc debugging")
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Fix compile error below when using BUFFER_TRACE.
fs/ext4/inode.c: In function ‘ext4_expand_extra_isize’:
fs/ext4/inode.c:5979:19: error: request for member ‘bh’ in something not a structure or union
BUFFER_TRACE(iloc.bh, "get_write_access");
Fixes: c03b45b853 ("ext4, project: expand inode extra size if possible")
Signed-off-by: zhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
The jh pointer may be used uninitialized in the two cases below and the
compiler complain about it when enabling JBUFFER_TRACE macro, fix them.
In file included from fs/jbd2/transaction.c:19:0:
fs/jbd2/transaction.c: In function ‘jbd2_journal_get_undo_access’:
./include/linux/jbd2.h:1637:38: warning: ‘jh’ is used uninitialized in this function [-Wuninitialized]
#define JBUFFER_TRACE(jh, info) do { printk("%s: %d\n", __func__, jh->b_jcount);} while (0)
^
fs/jbd2/transaction.c:1219:23: note: ‘jh’ was declared here
struct journal_head *jh;
^
In file included from fs/jbd2/transaction.c:19:0:
fs/jbd2/transaction.c: In function ‘jbd2_journal_dirty_metadata’:
./include/linux/jbd2.h:1637:38: warning: ‘jh’ may be used uninitialized in this function [-Wmaybe-uninitialized]
#define JBUFFER_TRACE(jh, info) do { printk("%s: %d\n", __func__, jh->b_jcount);} while (0)
^
fs/jbd2/transaction.c:1332:23: note: ‘jh’ was declared here
struct journal_head *jh;
^
Signed-off-by: zhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
Reviewed-by: Jan Kara <jack@suse.cz>
We can't pass error pointers to brelse().
Fixes: fb265c9cb4 ("ext4: add ext4_sb_bread() to disambiguate ENOMEM cases")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
A previous commit removed the initialization of variable 'error' to zero,
and can cause a bogus error return. This occurs when error contains a
non-zero garbage value and the call to xchk_should_terminate detects a
pending fatal signal and checks for a zero error before setting it
to -EAGAIN. Fix the issue by initializing error to zero.
Fixes: b9454fe056 ("xfs: clean up the inode cluster checking in the inobt scrub")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Add a mode where XFS never overwrites existing blocks in place. This
is to aid debugging our COW code, and also put infatructure in place
for things like possible future support for zoned block devices, which
can't support overwrites.
This mode is enabled globally by doing a:
echo 1 > /sys/fs/xfs/debug/always_cow
Note that the parameter is global to allow running all tests in xfstests
easily in this mode, which would not easily be possible with a per-fs
sysfs file.
In always_cow mode persistent preallocations are disabled, and fallocate
will fail when called with a 0 mode (with our without
FALLOC_FL_KEEP_SIZE), and not create unwritten extent for zeroed space
when called with FALLOC_FL_ZERO_RANGE or FALLOC_FL_UNSHARE_RANGE.
There are a few interesting xfstests failures when run in always_cow
mode:
- generic/392 fails because the bytes used in the file used to test
hole punch recovery are less after the log replay. This is
because the blocks written and then punched out are only freed
with a delay due to the logging mechanism.
- xfs/170 will fail as the already fragile file streams mechanism
doesn't seem to interact well with the COW allocator
- xfs/180 xfs/182 xfs/192 xfs/198 xfs/204 and xfs/208 will claim
the file system is badly fragmented, but there is not much we
can do to avoid that when always writing out of place
- xfs/205 fails because overwriting a file in always_cow mode
will require new space allocation and the assumption in the
test thus don't work anymore.
- xfs/326 fails to modify the file at all in always_cow mode after
injecting the refcount error, leading to an unexpected md5sum
after the remount, but that again is expected
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
No user of it in the iomap code at the moment, but we should not
actively report wrong information if we can trivially get it right.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
If we have racing buffered and direct I/O COW fork extents under
writeback can have been moved to the data fork by the time we call
xfs_reflink_convert_cow from xfs_submit_ioend. This would be mostly
harmless as the block numbers don't change by this move, except for
the fact that xfs_bmapi_write will crash or trigger asserts when
not finding existing extents, even despite trying to paper over this
with the XFS_BMAPI_CONVERT_ONLY flag.
Instead of special casing non-transaction conversions in the already
way too complicated xfs_bmapi_write just add a new helper for the much
simpler non-transactional COW fork case, which simplify ignores not
found extents.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Besides simplifying the code a bit this allows to actually implement
the behavior of using COW preallocation for non-COW data mentioned
in the current comments.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
This only matters if we want to write data through the COW fork that is
not actually an overwrite of existing data. Reasons for that are
speculative COW fork allocations using the cowextsize, or a mode where
we always write through the COW fork. Currently both can't actually
happen, but I plan to enable them.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
While using delalloc for extsize hints is generally a good idea, the
current code that does so only for COW doesn't help us much and creates
a lot of special cases. Switch it to use real allocations like we
do for direct I/O.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
We speculatively allocate extents in the COW fork to reduce
fragmentation. But when we write data into such COW fork blocks that
do now shadow an allocation in the data fork SEEK_DATA will not
correctly report it, as it only looks at the data fork extents.
The only reason why that hasn't been an issue so far is because
we even use these speculative COW fork preallocations over holes in
the data fork at all for buffered writes, and blocks in the COW
fork that are written by direct writes are moved into the data
fork immediately at I/O completion time.
Add a new set of iomap_ops for SEEK_HOLE/SEEK_DATA which looks into
both the COW and data fork, and reports all COW extents as unwritten
to the iomap layer. While this isn't strictly true for COW fork
extents that were already converted to real extents, the practical
semantics that you can't read data from them until they are moved
into the data fork are very similar, and this will force the iomap
layer into probing the extents for actually present data.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Move checking for invalid zero blocks and setting of various iomap flags
into this helper. Also make it deal with "raw" delalloc extents to
avoid clutter in the callers.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
There is a plan to build the kernel with -Wimplicit-fallthrough and
these places in the code produced warnings (W=1). Fix them up.
This commit remove the following warnings:
fs/ext4/indirect.c:1182:6: warning: this statement may fall through [-Wimplicit-fallthrough=]
fs/ext4/indirect.c:1188:6: warning: this statement may fall through [-Wimplicit-fallthrough=]
fs/ext4/indirect.c:1432:6: warning: this statement may fall through [-Wimplicit-fallthrough=]
fs/ext4/indirect.c:1440:6: warning: this statement may fall through [-Wimplicit-fallthrough=]
Signed-off-by: Mathieu Malaterre <malat@debian.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
There is a plan to build the kernel with -Wimplicit-fallthrough and
these places in the code produced warnings (W=1). Fix them up.
This commit remove the following warnings:
fs/ext4/hash.c:233:15: warning: this statement may fall through [-Wimplicit-fallthrough=]
fs/ext4/hash.c:246:15: warning: this statement may fall through [-Wimplicit-fallthrough=]
Signed-off-by: Mathieu Malaterre <malat@debian.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
We're unintentionally limiting the number of slots per nfsv4.1 session
to 10. Often more than 10 simultaneous RPCs are needed for the best
performance.
This calculation was meant to prevent any one client from using up more
than a third of the limit we set for total memory use across all clients
and sessions. Instead, it's limiting the client to a third of the
maximum for a single session.
Fix this.
Reported-by: Chris Tracy <ctracy@engr.scu.edu>
Cc: stable@vger.kernel.org
Fixes: de766e5704 "nfsd: give out fewer session slots as limit approaches"
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Making waits for response to fanotify permission events interruptible
can result in EINTR returns from open(2) or other syscalls when there's
e.g. AV software that's monitoring the file. Orion reports that e.g.
bash is complaining like:
bash: /etc/bash_completion.d/itweb-settings.bash: Interrupted system call
So for now convert the wait from interruptible to only killable one.
That is mostly invisible to userspace. Sadly this breaks hibernation
with fanotify permission events pending again but we have to put more
thought into how to fix this without regressing userspace visible
behavior.
Reported-by: Orion Poplawski <orion@nwra.com>
Signed-off-by: Jan Kara <jack@suse.cz>
After setxattr, the nfsv3 cached the acl which set by user.
But at the backend, the shared file system (eg. ext4) will check
the acl, if it can merged with mode, it won't add acl to the file.
So, the nfsv3 cached acl is redundant.
Don't 'set_cached_acl' when setxattr.
Signed-off-by: ZhangXiaoxu <zhangxiaoxu5@huawei.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
As the block and SCSI layouts can only read/write fixed-length
blocks, we must perform read-modify-write when data to be written is
not aligned to a block boundary or smaller than the block size.
(612aa983a0 pnfs: add flag to force read-modify-write in ->write_begin)
The current code tries to see if we have to do read-modify-write
on block-oriented pNFS layouts by just checking !PageUptodate(page),
but the same condition also applies for overwriting of any uncached
potions of existing files, making such operations excessively slow
even it is block-aligned.
The change does not affect the optimization for modify-write-read
cases (38c73044f5 NFS: read-modify-write page updating),
because partial update of !PageUptodate() pages can only happen
in layouts that can do arbitrary length read/write and never
in block-based ones.
Testing results:
We ran fio on one of the pNFS clients running 4.20 kernel
(vanilla and patched) in this configuration to read/write/overwrite
files on the storage array, exported as pnfs share by the server.
pNFS clients ---1G Ethernet--- pNFS server
(HP DL360 G8) (HP DL360 G8)
| |
| |
+------8G Fiber Channel--------+
|
Storage Array
(HP P6350)
Throughput of overwrite (both buffered and O_SYNC) is noticeably
improved.
Ops. |block size| Throughput |
| (KiB) | (MiB/s) |
| | 4.20 | patched|
---------+----------+----------------+
buffered | 4| 21.3 | 232 |
overwrite| 32| 22.2 | 256 |
| 512| 22.4 | 260 |
---------+----------+----------------+
O_SYNC | 4| 3.84| 4.77|
overwrite| 32| 12.2 | 32.0 |
| 512| 18.5 | 152 |
---------+----------+----------------+
Read and write (buffered and O_SYNC) by the same client remain unchanged
by the patch either negatively or positively, as they should do.
Ops. |block size| Throughput |
| (KiB) | (MiB/s) |
| | 4.20 | patched|
---------+----------+----------------+
read | 4| 548 | 550 |
| 32| 547 | 551 |
| 512| 548 | 551 |
---------+----------+----------------+
buffered | 4| 237 | 244 |
write | 32| 261 | 268 |
| 512| 265 | 272 |
---------+----------+----------------+
O_SYNC | 4| 0.46| 0.46|
write | 32| 3.60| 3.57|
| 512| 105 | 106 |
---------+----------+----------------+
Signed-off-by: Kazuo Ito <ito_kazuo_g3@lab.ntt.co.jp>
Tested-by: Hiroyuki Watanabe <watanabe.hiroyuki@lab.ntt.co.jp>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
nfs_want_read_modify_write() didn't check for !PagePrivate when pNFS
block or SCSI layout was in use, therefore we could lose data forever
if the page being written was filled by a read before completion.
Signed-off-by: Kazuo Ito <ito_kazuo_g3@lab.ntt.co.jp>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
This fixes the typo in comments of nfs_readdir_alloc_pages().
Because nfs_readdir_large_page and nfs_readdir_free_pagearray had been
renamed.
Signed-off-by: Liguang Zhang <zhangliguang@linux.alibaba.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
When listing very large directories via NFS, clients may take a long
time to complete. There are about three factors involved:
First of all, ls and practically every other method of listing a
directory including python os.listdir and find rely on libc readdir().
However readdir() only reads 32K of directory entries at a time, which
means that if you have a lot of files in the same directory, it is going
to take an insanely long time to read all the directory entries.
Secondly, libc readdir() reads 32K of directory entries at a time, in
kernel space 32K buffer split into 8 pages. One NFS readdirplus rpc will
be called for one page, which introduces many readdirplus rpc calls.
Lastly, one NFS readdirplus rpc asks for 32K data (filled by nfs_dentry)
to fill one page (filled by dentry), we found that nearly one third of
data was wasted.
To solve above problems, pagecache mechanism was introduced. One NFS
readdirplus rpc will ask for a large data (more than 32k), the data can
fill more than one page, the cached pages can be used for next readdir
call. This can reduce many readdirplus rpc calls and improve readdirplus
performance.
TESTING:
When listing very large directories(include 300 thousand files) via NFS
time ls -l /nfs_mount | wc -l
without the patch:
300001
real 1m53.524s
user 0m2.314s
sys 0m2.599s
with the patch:
300001
real 0m23.487s
user 0m2.305s
sys 0m2.558s
Improved performance: 79.6%
readdirplus rpc calls decrease: 85%
Signed-off-by: Liguang Zhang <zhangliguang@linux.alibaba.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
In the rare and unsupported case of a hostname list nfs_parse_devname
will modify dev_name. There is no need to modify dev_name as the all
that is being computed is the length of the hostname, so the computed
length can just be shorted.
Fixes: dc04589827 ("NFS: Use common device name parsing logic for NFSv4 and NFSv2/v3")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Drop LIST_HEAD where the variable it declares has never
been used.
The semantic patch that fixes this problem is as follows:
(http://coccinelle.lip6.fr/)
// <smpl>
@@
identifier x;
@@
- LIST_HEAD(x);
... when != x
// </smpl>
Fixes: 0e20162ed1 ("NFSv4.1 Use MDS auth flavor for data server connection")
Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Fix up some compiler warnings about function parameters, etc not being
correctly described or formatted.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
All the allocations that we can hit in the NFS layer and sunrpc layers
themselves are already marked as GFP_NOFS, but we need to ensure that
any calls to generic kernel functionality do the right thing as well.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Allow the caller to pass error information when cleaning up a failed
I/O request so that we can conditionally take action to cancel the
request altogether if the error turned out to be fatal.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
In several places we're just moving the struct nfs_page from one list to
another by first removing from the existing list, then adding to the new
one.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
If the I/O completion failed with a fatal error, then we should just
exit nfs_pageio_complete_mirror() rather than try to recoalesce.
Fixes: a7d42ddb30 ("nfs: add mirroring support to pgio layer")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: stable@vger.kernel.org # v4.0+
Whether we need to exit early, or just reprocess the list, we
must not lost track of the request which failed to get recoalesced.
Fixes: 03d5eb65b5 ("NFS: Fix a memory leak in nfs_do_recoalesce")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: stable@vger.kernel.org # v4.0+
When we fail to add the request to the I/O queue, we currently leave it
to the caller to free the failed request. However since some of the
requests that fail are actually created by nfs_pageio_add_request()
itself, and are not passed back the caller, this leads to a leakage
issue, which can again cause page locks to leak.
This commit addresses the leakage by freeing the created requests on
error, using desc->pg_completion_ops->error_cleanup()
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Fixes: a7d42ddb30 ("nfs: add mirroring support to pgio layer")
Cc: stable@vger.kernel.org # v4.0: c18b96a1b8: nfs: clean up rest of reqs
Cc: stable@vger.kernel.org # v4.0: d600ad1f2b: NFS41: pop some layoutget
Cc: stable@vger.kernel.org # v4.0+
Pull keys fixes from James Morris:
- Handle quotas better, allowing full quota to be reached.
- Fix the creation of shortcuts in the assoc_array internal
representation when the index key needs to be an exact multiple of
the machine word size.
- Fix a dependency loop between the request_key contruction record and
the request_key authentication key. The construction record isn't
really necessary and can be dispensed with.
- Set the timestamp on a new key rather than leaving it as 0. This
would ordinarily be fine - provided the system clock is never set to
a time before 1970
* 'fixes-v5.1-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security:
keys: Timestamp new keys
keys: Fix dependency loop between construction record and auth key
assoc_array: Fix shortcut creation
KEYS: allow reaching the keys quotas exactly
Commit 8099b047ec ("exec: load_script: don't blindly truncate
shebang string") was trying to protect against a confused exec of a
truncated interpreter path. However, it was overeager and also refused
to truncate arguments as well, which broke userspace, and it was
reverted. This attempts the protection again, but allows arguments to
remain truncated. In an effort to improve readability, helper functions
and comments have been added.
Co-developed-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Samuel Dionne-Riel <samuel@dionne-riel.com>
Cc: Richard Weinberger <richard.weinberger@gmail.com>
Cc: Graham Christensen <graham@grahamc.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Create a separate magic16 check function so that we don't run afoul of
static checkers.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Since statx, every filesystem should fill the attributes/attributes_mask
in routine getattr. But the generic_fillattr has not fill that, so add
ext2_getattr to do this. This can fix generic/424 while testing ext2.
Reviewed-by: zhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: yangerkun <yangerkun@huawei.com>
Signed-off-by: Jan Kara <jack@suse.cz>
When waiting for response to fanotify permission events, we currently
use uninterruptible waits. That makes code simple however it can cause
lots of processes to end up in uninterruptible sleep with hard reboot
being the only alternative in case fanotify listener process stops
responding (e.g. due to a bug in its implementation). Uninterruptible
sleep also makes system hibernation fail if the listener gets frozen
before the process generating fanotify permission event.
Fix these problems by using interruptible sleep for waiting for response
to fanotify event. This is slightly tricky though - we have to
detect when the event got already reported to userspace as in that
case we must not free the event. Instead we push the responsibility for
freeing the event to the process that will write response to the
event.
Reported-by: Orion Poplawski <orion@nwra.com>
Reported-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Track whether permission event got already reported to userspace and
whether userspace already answered to the permission event. Protect
stores to this field together with updates to ->response field by
group->notification_lock. This will allow aborting wait for reply to
permission event from userspace.
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Simplify iteration cleaning access_list in fanotify_release(). That will
make following changes more obvious.
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Create function to remove event from the notification list. Later it will
be used from more places.
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
get_one_event() has a single caller and that just locks
notification_lock around the call. Move locking inside get_one_event()
as that will make using ->response field for permission event state
easier.
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Fold dequeue_event() into process_access_response(). This will make
changes to use of ->response field easier.
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
While we can only truncate a block under the page lock for the current
page, there is no high-level synchronization for moving extents from the
COW to the data fork. This means that for example we can have another
thread doing a direct I/O completion that moves extents from the COW to
the data fork race with writeback. While this race is very hard to hit
the always_cow seems to reproduce it reasonably well, and it also exists
without that. Because of that there is a chance that a delalloc
conversion for the COW fork might not find any extents to convert. In
that case we should retry the whole block lookup and now find the blocks
in the data fork.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Now that we properly handle the race with truncate in the delalloc
allocator there is no need to short cut this exceptional case earlier
on.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
This function is a small wrapper only used by the writeback code, so
move it together with the writeback code and simplify it down to the
glorified do { } while loop that is now is.
A few bits intentionally got lost here: no need to call xfs_qm_dqattach
because quotas are always attached when we create the delalloc
reservation, and no need for the imap->br_startblock == 0 check given
that xfs_bmapi_convert_delalloc already has a WARN_ON_ONCE for exactly
that condition.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
This way we can actually count how many bytes got converted and how many
calls we need, unlike in the caller which doesn't have the detailed
view.
Note that this includes a slight change in behavior as the
xs_xstrat_quick is now bumped for every allocation instead of just the
one covering the requested writeback offset, which makes a lot more
sense.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
No need to deal with the transaction and the inode locking in the
caller. Note that we also switch to passing whichfork as the second
paramter, matching what most related functions do.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Delalloc conversion has traditionally been part of our function to
allocate blocks on disk (first xfs_bmapi, then xfs_bmapi_write), but
delalloc conversion is a little special as we really do not want
to allocate blocks over holes, for which we don't have reservations.
Split the delalloc conversions into a separate helper to keep the
code simple and structured.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
We want to be able to reuse them for the upcoming dedidcated delalloc
convert routine.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Move boilerplate code from the callers into xfs_bmap_btree_to_extents:
- exit early without failure if we don't need to convert to the
extent format
- assert that we have a btree cursor
- don't reinitialize the passed in logflags argument
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
We already ensure all data fits into s_maxbytes in the write / fault
path. The only reason we have them here is that they were copy and
pasted from xfs_bmapi_read when we stopped using that function.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The io_type field contains what is basically a summary of information
from the inode fork and the imap. But we can just as easily use that
information directly, simplifying a few bits here and there and
improving the trace points.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
that could prevent clients from reclaiming state after a kernel upgrade.
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJcZzZHAAoJECebzXlCjuG+EOsQALVuwSJqQh4GUVMSBYzL6Ov4
SfinB8LJ8/1HwngSvRB3xQ4HiOtpFSNkjzfFYE7epy6augY8tRRnHGbnlHbsG5vI
wQqTR6PbSq2mupgpi2WGRlRh521SDOi8V49fplUC+FuV7dJT/wm0hgdKsHCPHPX4
TEYPglsvG6PLu5IcAofNac9PVZH21s3yVIKvqd6yifED5lhopdNw210s5DtzvugI
g2JgHOhTfana+xQS/cJ1U8JHbbpM7jwOXAJ7IWD8k4GXdAW03X6jNOcseudcBTQY
qSL33//6Xdu0r0uI21z4ZWxSWCOtt8YvnbMoG4EBqh3DpKbUpExh8j4eIyNPSuSF
Y/8iAVJ9KWYhWO+IVPqvHVXz4mCIDK+f7iJ/m+lLjOQmWkpp6koeUDjKs4k9zBUC
mbGTOrh0TJzXvKWKEU5Qy7meZVJGUpV+9ca+cDs5XN7Xa3blTp+5VrRVeDgKO5Kx
OF3Y3IBOWhqN7+kEH98RvdZAmtbO0zg02IEIHOMPxH69JU8o0EsEni1LXsqDJrRi
sLVYXvLwdPLfkqSjpI8xNeaoFXeelopx8Re+2oNEFIEvsfeT5XikbQHoqgFJNsyk
hz7PHwuyGjc6NJRRSBUKYouWKPP4rrM7ZiOSyIEDYIIwyhBirpjrECaHzdi3D5j+
xUyFGMF5F3wk1fdQHPfD
=NopI
-----END PGP SIGNATURE-----
Merge tag 'nfsd-5.0-2' of git://linux-nfs.org/~bfields/linux
Pull more nfsd fixes from Bruce Fields:
"Two small fixes, one for crashes using nfs/krb5 with older enctypes,
one that could prevent clients from reclaiming state after a kernel
upgrade"
* tag 'nfsd-5.0-2' of git://linux-nfs.org/~bfields/linux:
sunrpc: fix 4 more call sites that were using stack memory with a scatterlist
Revert "nfsd4: return default lease period"
- Make sure Send CQ is allocated on an existing compvec
- Properly check debugfs dentry before using it
- Don't use page_file_mapping() after removing a page
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAlxnMQ0ACgkQ18tUv7Cl
QOsbhQ//VhgoXX25xHrApLz8wMuYPNOboDFSUf0O1GWoHi3opHnP+9LPf/iZkRQy
YS0ufcO95i1LGjZLb8ac9hBWkko8TBl/dIONsG4ppf2bAbiVuag848wehi8hsGba
zaSsXV6qdibq4qZsyK35hh0cHVHDgB1EMTu7AVORdvXsTHVX3xL86vts2y2VSLKv
w9yKQBg4E4pWwENi7v77icSuGg/WpwfKnYxBzG6JPXuHQLGidyc/HrnVmLwhd6DQ
0Sa6nzOAvgjjgVibB+tJfsitScmMTsaxulvHsm5iLjPJZ8SUjxYvAPl3AZdCYPvU
XaADy8nrvXJUe9APhMINbkoxnF4W/OPnUMG3bWkWp2LeNZvk5l7VOzTW5Sh49Xyk
pBAOd7qr3kfjFdvzypVz9NeXuS6BsTUA6LAudo8rF7nxi8jHPp6L+zZNWVrPIjY0
+bNIj3K1Bji3jU9vTHyTzxDRB/4ZnzJaPF2Gv/5Y2cvkI7mfzHUz5p6cAU1OPIVB
kuhZXkQFEPSS2OV6MUOe/HgmtY0oLM3XU9cEaFkLz59D1kb1fjO/yUu9YBQMq6Ke
o6b7Dwh4WvLVN/AbgegKOnp5G0/ljmz6y7ML0AElYXg1iT4k0zE+qJpMWhOTRJnd
+jf4hSS+l7p7D1ed+uqdMS/jc1s5vcuxwYDQUIutELjA/TCbLNI=
=28v+
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-5.0-4' of git://git.linux-nfs.org/projects/anna/linux-nfs
Pull more NFS client fixes from Anna Schumaker:
"Three fixes this time.
Nicolas's is for xprtrdma completion vector allocation on single-core
systems. Greg's adds an error check when allocating a debugfs dentry.
And Ben's is an additional fix for nfs_page_async_flush() to prevent
pages from accidentally getting truncated.
Summary:
- Make sure Send CQ is allocated on an existing compvec
- Properly check debugfs dentry before using it
- Don't use page_file_mapping() after removing a page"
* tag 'nfs-for-5.0-4' of git://git.linux-nfs.org/projects/anna/linux-nfs:
NFS: Don't use page_file_mapping after removing the page
rpc: properly check debugfs dentry before using it
xprtrdma: Make sure Send CQ is allocated on an existing compvec
The preadv2 and pwritev2 syscalls are supposed to emulate the readv and
writev syscalls when offset == -1. Therefore the compat code should
check for offset before calling do_compat_preadv64 and
do_compat_pwritev64. This is the case for the preadv2 and pwritev2
syscalls, but handling of offset == -1 is missing in their 64-bit
equivalent.
This patch fixes that, calling do_compat_readv and do_compat_writev when
offset == -1. This fixes the following glibc tests on x32:
- misc/tst-preadvwritev2
- misc/tst-preadvwritev64v2
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: H.J. Lu <hjl.tools@gmail.com>
Signed-off-by: Aurelien Jarno <aurelien@aurel32.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Let's use xattr_prefix instead of open code.
No logic changes.
Signed-off-by: Gao Xiang <gaoxiang25@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Some works after roll-forward recovery can get an error which will release
all the data structures. Let's flush them in order to make it clean.
One possible corruption came from:
[ 90.400500] list_del corruption. prev->next should be ffffffed1f566208, but was (null)
[ 90.675349] Call trace:
[ 90.677869] __list_del_entry_valid+0x94/0xb4
[ 90.682351] remove_dirty_inode+0xac/0x114
[ 90.686563] __f2fs_write_data_pages+0x6a8/0x6c8
[ 90.691302] f2fs_write_data_pages+0x40/0x4c
[ 90.695695] do_writepages+0x80/0xf0
[ 90.699372] __writeback_single_inode+0xdc/0x4ac
[ 90.704113] writeback_sb_inodes+0x280/0x440
[ 90.708501] wb_writeback+0x1b8/0x3d0
[ 90.712267] wb_workfn+0x1a8/0x4d4
[ 90.715765] process_one_work+0x1c0/0x3d4
[ 90.719883] worker_thread+0x224/0x344
[ 90.723739] kthread+0x120/0x130
[ 90.727055] ret_from_fork+0x10/0x18
Reported-by: Sahitya Tummala <stummala@codeaurora.org>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
After quota_off, we'll get some dirty blocks. If put_super don't have a chance
to flush them by checkpoint, it causes NULL pointer exception in end_io after
iput(node_inode). (e.g., by checkpoint=disable)
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Otherwise, it wakes up discard thread which will sleep again by busy IOs
in a loop.
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If every discard were issued successfully, we can avoid further discard.
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This mode returns mount() quickly with EAGAIN. We can trigger this by
shutdown(F2FS_GOING_DOWN_NEED_FSCK).
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In the request_key() upcall mechanism there's a dependency loop by which if
a key type driver overrides the ->request_key hook and the userspace side
manages to lose the authorisation key, the auth key and the internal
construction record (struct key_construction) can keep each other pinned.
Fix this by the following changes:
(1) Killing off the construction record and using the auth key instead.
(2) Including the operation name in the auth key payload and making the
payload available outside of security/keys/.
(3) The ->request_key hook is given the authkey instead of the cons
record and operation name.
Changes (2) and (3) allow the auth key to naturally be cleaned up if the
keyring it is in is destroyed or cleared or the auth key is unlinked.
Fixes: 7ee02a316600 ("keys: Fix dependency loop between construction record and auth key")
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: James Morris <james.morris@microsoft.com>
The netfilter conflicts were rather simple overlapping
changes.
However, the cls_tcindex.c stuff was a bit more complex.
On the 'net' side, Cong is fixing several races and memory
leaks. Whilst on the 'net-next' side we have Vlad adding
the rtnl-ness support.
What I've decided to do, in order to resolve this, is revert the
conversion over to using a workqueue that Cong did, bringing us back
to pure RCU. I did it this way because I believe that either Cong's
races don't apply with have Vlad did things, or Cong will have to
implement the race fix slightly differently.
Signed-off-by: David S. Miller <davem@davemloft.net>
-----BEGIN PGP SIGNATURE-----
iQFRBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAlxgqNUeHHRvcnZhbGRz
QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGwsoH+OVXu0NQofwTvVru
8lgF3BSDG2mhf7mxbBBlBizGVy9jnjRNGCFMC+Jq8IwiFLwprja/G27kaDTkpuF1
PHC3yfjKvjTeUP5aNdHlmxv6j1sSJfZl0y46DQal4UeTG/Giq8TFTi+Tbz7Wb/WV
yCx4Lr8okAwTuNhnL8ojUCVIpd3c8QsyR9v6nEQ14Mj+MvEbokyTkMJV0bzOrM38
JOB+/X1XY4JPZ6o3MoXrBca3bxbAJzMneq+9CWw1U5eiIG3msg4a+Ua3++RQMDNr
8BP0yCZ6wo32S8uu0PI6HrZaBnLYi5g9Wh7Q7yc0mn1Uh1zWFykA6TtqK90agJeR
A6Ktjw==
=scY4
-----END PGP SIGNATURE-----
Merge tag 'v5.0-rc6' into for-5.1/block
Pull in 5.0-rc6 to avoid a dumb merge conflict with fs/iomap.c.
This is needed since io_uring is now based on the block branch,
to avoid a conflict between the multi-page bvecs and the bits
of io_uring that touch the core block parts.
* tag 'v5.0-rc6': (525 commits)
Linux 5.0-rc6
x86/mm: Make set_pmd_at() paravirt aware
MAINTAINERS: Update the ocores i2c bus driver maintainer, etc
blk-mq: remove duplicated definition of blk_mq_freeze_queue
Blk-iolatency: warn on negative inflight IO counter
blk-iolatency: fix IO hang due to negative inflight counter
MAINTAINERS: unify reference to xen-devel list
x86/mm/cpa: Fix set_mce_nospec()
futex: Handle early deadlock return correctly
futex: Fix barrier comment
net: dsa: b53: Fix for failure when irq is not defined in dt
blktrace: Show requests without sector
mips: cm: reprime error cause
mips: loongson64: remove unreachable(), fix loongson_poweroff().
sit: check if IPv6 enabled before calling ip6_err_gen_icmpv6_unreach()
geneve: should not call rt6_lookup() when ipv6 was disabled
KVM: nVMX: unconditionally cancel preemption timer in free_nested (CVE-2019-7221)
KVM: x86: work around leak of uninitialized stack contents (CVE-2019-7222)
kvm: fix kvm_ioctl_create_device() reference counting (CVE-2019-6974)
signal: Better detection of synchronous signals
...
This patch pulls the trigger for multi-page bvecs.
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This patch introduces one extra iterator variable to bio_for_each_segment_all(),
then we can allow bio_for_each_segment_all() to iterate over multi-page bvec.
Given it is just one mechannical & simple change on all bio_for_each_segment_all()
users, this patch does tree-wide change in one single patch, so that we can
avoid to use a temporary helper for this conversion.
Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Once multi-page bvec is enabled, the last bvec may include more than one
page, this patch use mp_bvec_last_segment() to truncate the bio.
Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
bio_readpage_error currently uses bi_vcnt to decide if it is worth
retrying an I/O. But the vector count is mostly an implementation
artifact - it really should figure out if there is more than a
single sector worth retrying. Use bi_size for that and shift by
PAGE_SHIFT. This really should be blocks/sectors, but given that
btrfs doesn't support a sector size different from the PAGE_SIZE
using the page size keeps the changes to a minimum.
Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When XFS creates an O_TMPFILE file, the inode is created with nlink = 1,
put on the unlinked list, and then the VFS sets nlink = 0 in d_tmpfile.
If we crash before anything logs the inode (it's dirty incore but the
vfs doesn't tell us it's dirty so we never log that change), the iunlink
processing part of recovery will then explode with a pile of:
XFS: Assertion failed: VFS_I(ip)->i_nlink == 0, file:
fs/xfs/xfs_log_recover.c, line: 5072
Worse yet, since nlink is nonzero, the inodes also don't get cleaned up
and they just leak until the next xfs_repair run.
Therefore, change xfs_iunlink to require that inodes being put on the
unlinked list have nlink == 0, change the tmpfile callers to instantiate
nodes that way, and set the nlink to 1 just prior to calling d_tmpfile.
Fix the comment for xfs_iunlink while we're at it.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Log recovery frees all the inodes stored in the unlinked list, which can
cause expansion of the free inode btree. The ifree code skips block
reservations if it thinks there's a per-AG space reservation, but we
don't set up the reservation until after log recovery, which means that
a finobt expansion blows up in xfs_trans_mod_sb when we exceed the
transaction's block reservation.
To fix this, we set the "no finobt reservation" flag to true when we
create the xfs_mount and only set it to false if we confirm that every
AG had enough free space to put aside for the finobt.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Rename this flag variable to imply more strongly that it's related to
the free inode btree (finobt) operation. No functional changes.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
This reverts commit 8099b047ec.
It turns out that people do actually depend on the shebang string being
truncated, and on the fact that an interpreter (like perl) will often
just re-interpret it entirely to get the full argument list.
Reported-by: Samuel Dionne-Riel <samuel@dionne-riel.com>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Don't update the superblock s_rev_level during mount if it isn't
actually necessary, only if superblock features are being set by
the kernel. This was originally added for ext3 since it always
set the INCOMPAT_RECOVER and HAS_JOURNAL features during mount,
but this is not needed since no journal mode was added to ext4.
That will allow Geert to mount his 20-year-old ext2 rev 0.0 m68k
filesystem, as a testament of the backward compatibility of ext4.
Fixes: 0390131ba8 ("ext4: Allow ext4 to run without a journal")
Signed-off-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
The functions jbd2_superblock_csum_verify() and
jbd2_superblock_csum_set() only get called from one location, so to
simplify things, fold them into their callers.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
The jbd2 superblock is lockless now, so there is probably a race
condition between writing it so disk and modifing contents of it, which
may lead to checksum error. The following race is the one case that we
have captured.
jbd2 fsstress
jbd2_journal_commit_transaction
jbd2_journal_update_sb_log_tail
jbd2_write_superblock
jbd2_superblock_csum_set jbd2_journal_revoke
jbd2_journal_set_features(revork)
modify superblock
submit_bh(checksum incorrect)
Fix this by locking the buffer head before modifing it. We always
write the jbd2 superblock after we modify it, so this just means
calling the lock_buffer() a little earlier.
This checksum corruption problem can be reproduced by xfstests
generic/475.
Reported-by: zhangyi (F) <yi.zhang@huawei.com>
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
This reverts commit 2a5f14f279.
This patch causes xfstests generic/311 to fail. Reverting this for
now until we have a proper fix.
Signed-off-by: Abhi Das <adas@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
For VFS listxattr calls, xfs_xattr_put_listent calls
__xfs_xattr_put_listent twice if it sees an attribute
"trusted.SGI_ACL_FILE": once for that name, and again for
"system.posix_acl_access". Unfortunately, if we happen to run out of
buffer space while emitting the first name, we set count to -1 (so that
we can feed ERANGE to the caller). The second invocation doesn't check that
the context parameters make sense and overwrites the byte before the
buffer, triggering a KASAN report:
==================================================================
BUG: KASAN: slab-out-of-bounds in strncpy+0xb3/0xd0
Write of size 1 at addr ffff88807fbd317f by task syz/1113
CPU: 3 PID: 1113 Comm: syz Not tainted 5.0.0-rc6-xfsx #rc6
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.10.2-1ubuntu1 04/01/2014
Call Trace:
dump_stack+0xcc/0x180
print_address_description+0x6c/0x23c
kasan_report.cold.3+0x1c/0x35
strncpy+0xb3/0xd0
__xfs_xattr_put_listent+0x1a9/0x2c0 [xfs]
xfs_attr_list_int_ilocked+0x11af/0x1800 [xfs]
xfs_attr_list_int+0x20c/0x2e0 [xfs]
xfs_vn_listxattr+0x225/0x320 [xfs]
listxattr+0x11f/0x1b0
path_listxattr+0xbd/0x130
do_syscall_64+0x139/0x560
While we're at it we add an assert to the other put_listent to avoid
this sort of thing ever happening to the attrlist_by_handle code.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
This reverts commit d6ebf5088f.
I forgot that the kernel's default lease period should never be
decreased!
After a kernel upgrade, the kernel has no way of knowing on its own what
the previous lease time was. Unless userspace tells it otherwise, it
will assume the previous lease period was the same.
So if we decrease this value in a kernel upgrade, we end up enforcing a
grace period that's too short, and clients will fail to reclaim state in
time. Symptoms may include EIO and log messages like "NFS:
nfs4_reclaim_open_state: Lock reclaim failed!"
There was no real justification for the lease period decrease anyway.
Reported-by: Donald Buczek <buczek@molgen.mpg.de>
Fixes: d6ebf5088f "nfsd4: return default lease period"
Cc: stable@vger.kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Fanotify now uses exportfs_encode_inode_fh() so it needs to select
EXPORTFS.
Fixes: e9e0c89030 "fanotify: encode file identifier for FAN_REPORT_FID"
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Certain NFS results (eg. READLINK) might expect a data payload that
is not an exact multiple of 4 bytes. In this case, XDR encoding
is required to pad that payload so its length on the wire is a
multiple of 4 bytes. The constants that define the maximum size of
each NFS result do not appear to account for this extra word.
In each case where the data payload is to be received into pages:
- 1 word is added to the size of the receive buffer allocated by
call_allocate
- rpc_inline_rcv_pages subtracts 1 word from @hdrsize so that the
extra buffer space falls into the rcv_buf's tail iovec
- If buf->pagelen is word-aligned, an XDR pad is not needed and
is thus removed from the tail
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
prepare_reply_buffer() and its NFSv4 equivalents expose the details
of the RPC header and the auth slack values to upper layer
consumers, creating a layering violation, and duplicating code.
Remedy these issues by adding a new RPC client API that hides those
details from upper layers in a common helper function.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
These can help field troubleshooting without needing the overhead
of a full network capture (ie, tcpdump).
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
This issue is now captured by a trace point in the RPC client.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Having access to the controlling rpc_rqst means a trace point in the
XDR code can report:
- the XID
- the task ID and client ID
- the p_name of RPC being processed
Subsequent patches will introduce such trace points.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
If a filesystem returns ENOSYS from opendir and thus opts out of
opendir and releasedir requests, it almost certainly would also like
readdir results cached. Default open_flags to FOPEN_KEEP_CACHE and
FOPEN_CACHE_DIR in that case.
With this patch, I've measured recursive directory enumeration across
large FUSE mounts to be faster than native mounts.
Signed-off-by: Chad Austin <chadaustin@fb.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Allow filesystems to return ENOSYS from opendir, preventing the kernel from
sending opendir and releasedir messages in the future. This avoids
userspace transitions when filesystems don't need to keep track of state
per directory handle.
A new capability flag, FUSE_NO_OPENDIR_SUPPORT, parallels
FUSE_NO_OPEN_SUPPORT, indicating the new semantics for returning ENOSYS
from opendir.
Signed-off-by: Chad Austin <chadaustin@fb.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Bad inode checks were done done in various places, and move them into
fuse_file_{read|write}_iter().
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
This is cleanup, as well as allowing switching between I/O modes while the
file is open in the future.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Nothing preventing copy_file_range to work on files opened with
FOPEN_DIRECT_IO.
Fixes: 88bc7d5097 ("fuse: add support for copy_file_range()")
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
The default splice implementation is grossly inefficient and the iter based
ones work just fine, so use those instead. I've measured an 8x speedup for
splice write (with len = 128k).
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Switch to using the async directo IO code path in fuse_direct_read_iter()
and fuse_direct_write_iter(). This is especially important in connection
with loop devices with direct IO enabled as loop assumes async direct io is
actually async.
Signed-off-by: Martin Raiber <martin@urbackup.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
The only caller that needs fc->aborted set is fuse_conn_abort_write().
Setting fc->aborted is now racy (fuse_abort_conn() may already be in
progress or finished) but there's no reason to care.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
This is rather natural action after previous patches, and it just decreases
load of fc->lock.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
This continues previous patch and introduces the same protection for
nlookup field.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
To minimize contention of fc->lock, this patch introduces a new spinlock
for protection fuse_inode metadata:
fuse_inode:
writectr
writepages
write_files
queued_writes
attr_version
inode:
i_size
i_nlink
i_mtime
i_ctime
Also, it protects the fields changed in fuse_change_attributes_common()
(too many to list).
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
This patch makes fc->attr_version of atomic64_t type, so fc->lock won't be
needed to read or modify it anymore.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Here is preparation for next patches, which introduce new fi->lock for
protection of ff->write_entry linked into fi->write_files.
This patch just passes new argument to the function.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
When queue_interrupt() is called from fuse_dev_do_write(), it came from
userspace directly. Userspace may pass any request id, even the request's
we have not interrupted (or even background's request). This patch adds
sanity check to make kernel safe against that.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Currently, we wait on req->waitq in request_wait_answer() function only,
and it's never used for background requests. Since wake_up() is not a
light-weight macros, instead of this, it unfolds in really called function,
which makes locking operations taking some cpu cycles, let's avoid its call
for the case we definitely know it's completely useless.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
We take global fiq->waitq.lock every time, when we are in this function,
but interrupted requests are just small subset of all requests. This patch
optimizes request_end() and makes it to take the lock when it's really
needed.
queue_interrupt() needs small change for that. After req is linked to
interrupt list, we do smp_mb() and check for FR_FINISHED again. In case of
FR_FINISHED bit has appeared, we remove req and leave the function:
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
We should sent signal only in case of interrupt is really queued. Not a
real problem, but this makes the code clearer and intuitive.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
It looks like we can optimize page replacement and avoid copying by simple
updating the request's page.
[SzM: swap with new request's tmp page to avoid use after free.]
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Auxiliary requests chained on req->misc.write.next may be leaked on
truncate. Free these as well if the parent request was truncated off.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Don't reuse the queued request, even if it only contains a single page.
This is needed because previous locking changes (spliting out
fiq->waitq.lock from fc->lock) broke the assumption that request will
remain in FR_PENDING at least until the new page contents are copied.
This fix removes a slight optimization for a rare corner case, so we really
shoudln't care.
Reported-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Fixes: fd22d62ed0 ("fuse: no fc->lock for iqueue parts")
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Restructure the function to better separate the locked and the unlocked
parts. Use the "old_req" local variable to mean only the queued request,
and not any auxiliary requests added onto its misc.write.next list. These
changes are in preparation for the following patch.
Also turn BUG_ON instances into WARN_ON and add a header comment explaining
what the function does.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Call this from fuse_range_is_writeback() and fuse_writepage_in_flight().
Turn a BUG_ON() into a WARN_ON() in the process.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
If a file has been copied up metadata only, and later data is copied up,
upper loses any security.capability xattr it has (underlying filesystem
clears it as upon file write).
From a user's point of view, this is just a file copy-up and that should
not result in losing security.capability xattr. Hence, before data copy
up, save security.capability xattr (if any) and restore it on upper after
data copy up is complete.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Fixes: 0c28887493 ("ovl: A new xattr OVL_XATTR_METACOPY for file on upper")
Cc: <stable@vger.kernel.org> # v4.19+
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Fixes: ac46d4f3c4 ("mm/mmu_notifier: use structure for invalidate_range_start/end calls v2")
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
This code is converted to use vmf_error().
Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Merge fixes from Andrew Morton:
"6 fixes"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
mm: proc: smaps_rollup: fix pss_locked calculation
Rename include/{uapi => }/asm-generic/shmparam.h really
Revert "mm: use early_pfn_to_nid in page_ext_init"
mm/gup: fix gup_pmd_range() for dax
Revert "mm: slowly shrink slabs with a relatively small number of objects"
Revert "mm: don't reclaim inodes with many attached pages"
The 'pss_locked' field of smaps_rollup was being calculated incorrectly.
It accumulated the current pss everytime a locked VMA was found. Fix
that by adding to 'pss_locked' the same time as that of 'pss' if the vma
being walked is locked.
Link: http://lkml.kernel.org/r/20190203065425.14650-1-sspatil@android.com
Fixes: 493b0e9d94 ("mm: add /proc/pid/smaps_rollup")
Signed-off-by: Sandeep Patil <sspatil@android.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Daniel Colascione <dancol@google.com>
Cc: <stable@vger.kernel.org> [4.14.x, 4.19.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This reverts commit a76cf1a474 ("mm: don't reclaim inodes with many
attached pages").
This change causes serious changes to page cache and inode cache
behaviour and balance, resulting in major performance regressions when
combining worklaods such as large file copies and kernel compiles.
https://bugzilla.kernel.org/show_bug.cgi?id=202441
This change is a hack to work around the problems introduced by changing
how agressive shrinkers are on small caches in commit 172b06c32b ("mm:
slowly shrink slabs with a relatively small number of objects"). It
creates more problems than it solves, wasn't adequately reviewed or
tested, so it needs to be reverted.
Link: http://lkml.kernel.org/r/20190130041707.27750-2-david@fromorbit.com
Fixes: a76cf1a474 ("mm: don't reclaim inodes with many attached pages")
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Cc: Wolfgang Walter <linux@stwm.de>
Cc: Roman Gushchin <guro@fb.com>
Cc: Spock <dairinin@gmail.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since the header is a fixed small maximum size, just use a stack variable
to avoid memory allocation in the write path.
Signed-off-by: Kees Cook <keescook@chromium.org>
If nfs_page_async_flush() removes the page from the mapping, then we can't
use page_file_mapping() on it as nfs_updatepate() is wont to do when
receiving an error. Instead, push the mapping to the stack before the page
is possibly truncated.
Fixes: 8fc75bed96 ("NFS: Fix up return value on fatal errors in nfs_page_async_flush()")
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
If zero-length header happened in ramoops_write_kmsg_hdr(), that means
we will not be able to read back dmesg record later, since it will be
treated as invalid header in ramoops_pstore_read(). So we should not
execute the following code but return the error.
Signed-off-by: Yue Hu <huyue2@yulong.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Since only one single ramoops area allowed at a time, other probes
(like device tree) are meaningless, as it will waste CPU resources.
So let's check for being already initialized first.
Signed-off-by: Yue Hu <huyue2@yulong.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Sometimes pstore_console_write() will write records with zero size
to persistent ram zone, which is unnecessary. It will only increase
resource consumption. Also adjust ramoops_write_kmsg_hdr() to have
same logic if memory allocation fails.
Signed-off-by: Yue Hu <huyue2@yulong.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
nojournal mode unsafe.
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAlxiSNUACgkQ8vlZVpUN
gaPcCgf+LTKVur85OUKqeEDRfiCYjNemzgSOve8EtiGnPzG5s8vVeVQ5n880lTzY
c/Q2/CotN1/0slWxF8WmZ6xTzdrdadtdJ2iv6O5R58uj3lz3UK4cXJf+XmvQRj3L
5UqMygRN1omKFCp+dB38SeZuE1amV/tV61Q3ATnVMLi+xZcH8lV+efulNi19p2Gd
EC7DYs2seAiNEJAh9nVSL1LWo3i0OH5JJQCv08+c71CPPqixAYuuEy59q55k6OvW
Jp8mhYd4/5ueoV+rxH8Gft52MaOPvPHaxWvxrIbpKb5u8EZ6WUXGZ74qdK3KG6hG
k29IX8XoXWlTHkfSiOOZLUJuW7n+XQ==
=w43x
-----END PGP SIGNATURE-----
Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 fix from Ted Ts'o:
"Revert a commit which landed in v5.0-rc1 since it makes fsync in ext4
nojournal mode unsafe"
* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
Revert "ext4: use ext4_write_inode() when fsyncing w/o a journal"
The v5 superblock format added various metadata fields (such as crc,
metadata lsn, owner uuid, etc.) to v4 metadata headers or created
new v5 headers for blocks where no such headers existed on v4. Where
v4 headers did exist, the v5 structures are careful to place v4
metadata at the original location. For example, the magic value is
expected to be at the same location in certain blocks to facilitate
version detection.
While failure of this invariant is likely to cause severe and
obvious problems at runtime, we can detect this condition at compile
time via the more recently added on-disk format check
infrastructure. Since there is no runtime cost, add some offset
checks that start with v5 structure definitions, traverse down to
the first bit of common metadata with v4 and ensure that common
metadata is at the expected offset. Note that we don't care about
blocks which had no v4 header because there is no common metadata in
those cases. No functional changes.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Now that we encode block magic numbers in all the buffer ops, use that
for block type detection in the ag header repair code instead of
encoding magics directly in the repair code.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Add dquot magic numbers to the buffer ops type, in case we ever want to
use them.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Use xfs_verify_magic to check the magic numbers of inodes.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
With the verifier magic value helper in place, we've left a bit more
duplicate code across the verifiers that involve struct
xfs_da3_blkinfo. This includes the da node, xattr leaf and dir leaf
verifiers, all of which perform similar checks for v4 and v5
filesystems.
Create a common helper to verify an xfs_da3_blkinfo structure,
taking care to only access v5 fields where appropriate, and refactor
the aforementioned verifiers to use the helper. No functional
changes.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Most buffer verifiers have hardcoded magic value checks
conditionalized on the version of the filesystem. The magic value
field of the verifier structure facilitates abstraction of some of
this code. Populate the ->magic field of various verifiers to take
advantage of this abstraction. No functional changes.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The dir2 leaf verifiers share the same underlying structure
verification code, but implement six accessor functions to multiplex
the code across the two verifiers. Further, the magic value isn't
sufficiently abstracted such that the common helper has to manually
fix up the magic from the caller on v5 filesystems.
Use the magic field in the verifier structure to eliminate the
duplicate code and clean this all up. No functional change.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The allocation btree verifiers share code that is unable to detect
cross-tree magic value corruptions such as a bnobt block with a
cntbt magic value. Populate the b_ops->magic field of the associated
verifier structures such that the structure verifier can check the
magic value against the expected value based on tree type.
The btree level check requires knowledge of the tree type to
determine the appropriate maximum value. This was previously part of
the hardcoded magic value checks. With that code removed, peek at
the first magic value in the verifier to determine the expected tree
type of the current block.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Similar to the inode btree verifier, the same allocation btree
verifier structure is shared between the by-bno (bnobt) and by-size
(cntbt) btrees. This prevents the ability to distinguish magic
values between them. Separate the verifier into two, one for each
tree, and assign them appropriately. No functional changes.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The inode btree verifier code is shared between the inode btree and
free inode btree because the underlying metadata formats are
essentially equivalent. A side effect of this is that the verifier
cannot determine whether a particular btree block should have an
inobt or finobt magic value.
This logic allows an unfortunate xfs_repair bug to escape detection
where certain level > 0 nodes of the finobt are stamped with inobt
magic by xfs_repair finobt reconstruction. This is fortunately not a
severe problem since the inode btree magic values do not contribute
to any changes in kernel behavior, but we do need a means to detect
and prevent this problem in the future.
Add a field to xfs_buf_ops to store the v4 and v5 superblock magic
values expected by a particular verifier. Add a helper to check an
on-disk magic value against the value expected by the verifier. Call
the helper from the shared [f]inobt verifier code for magic value
verification. This ensures that the inode btree blocks each have the
appropriate magic value based on specific tree type and superblock
version.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The inobt verifier is reused for the inobt and finobt, which
prevents the ability to distinguish between magic values on a
per-tree basis. Create a separate finobt structure in preparation
for changes to enforce the appropriate magic value for the
associated tree. This patch has no functional change.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Most verifiers that check on-disk magic values convert the CPU
endian magic value constant to disk endian to facilitate compile
time optimization of the byte swap and reduce the need for runtime
byte swaps in buffer verifiers. Several buffer verifiers do not
follow this pattern. Update those verifiers for consistency.
Also fix up a random typo in the inode readahead verifier name.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Improve the documentation around xfs_buf_ensure_ops, which is the
function that is responsible for cleaning up the b_ops state of buffers
that go through xrep_findroot_block but don't match anything. Rename
the function to xfs_buf_reverify.
[darrick: this started off as bfoster mods of a previous patch of mine,
but the renaming part is now this separate patch.]
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Brian Foster <bfoster@redhat.com>
Use a rhashtable to cache the unlinked list incore. This should speed
up unlinked processing considerably when there are a lot of inodes on
the unlinked list because iunlink_remove no longer has to traverse an
entire bucket list to find which inode points to the one being removed.
The incore list structure records "X.next_unlinked = Y" relations, with
the rhashtable using Y to index the records. This makes finding the
inode X that points to a inode Y very quick. If our cache fails to find
anything we can always fall back on the old method.
FWIW this drastically reduces the amount of time it takes to remove
inodes from the unlinked list. I wrote a program to open a lot of
O_TMPFILE files and then close them in the same order, which takes
a very long time if we have to traverse the unlinked lists. With the
ptach, I see:
+ /d/t/tmpfile/tmpfile
Opened 193531 files in 6.33s.
Closed 193531 files in 5.86s
real 0m12.192s
user 0m0.064s
sys 0m11.619s
+ cd /
+ umount /mnt
real 0m0.050s
user 0m0.004s
sys 0m0.030s
And without the patch:
+ /d/t/tmpfile/tmpfile
Opened 193588 files in 6.35s.
Closed 193588 files in 751.61s
real 12m38.853s
user 0m0.084s
sys 12m34.470s
+ cd /
+ umount /mnt
real 0m0.086s
user 0m0.000s
sys 0m0.060s
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Add tracepoints so we can associate high level operations with low level
updates. No functional changes.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
In xfs_iunlink_remove we have two identical calls to
xfs_iunlink_update_inode, so move it out of the if statement to simplify
the code some more.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
There's a loop that searches an unlinked bucket list to find the inode
that points to a given inode. Hoist this into a separate function;
later we'll use our iunlink backref cache to bypass the slow list
operation. No functional changes.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Hoist the functions that update an inode's unlinked pointer updates into
a helper. No functional changes.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Strengthen our checking of the AGI unlinked pointers when we start to
use them for updating the metadata.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Split the AGI unlinked bucket updates into a separate function. No
functional changes.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Add a new helper to check that a per-AG inode pointer is either null or
points somewhere valid within that AG.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Fix some indentation issues with the iunlink functions and reorganize
the tops of the functions to be identical. No functional changes.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
The writeback delalloc conversion code is racy with respect to
changes in the currently cached file mapping outside of the current
page. This is because the ilock is cycled between the time the
caller originally looked up the mapping and across each real
allocation of the provided file range. This code has collected
various hacks over the years to help combat the symptoms of these
races (i.e., truncate race detection, allocation into hole
detection, etc.), but none address the fundamental problem that the
imap may not be valid at allocation time.
Rather than continue to use race detection hacks, update writeback
delalloc conversion to a model that explicitly converts the delalloc
extent backing the current file offset being processed. The current
file offset is the only block we can trust to remain once the ilock
is dropped because any operation that can remove the block
(truncate, hole punch, etc.) must flush and discard pagecache pages
first.
Modify xfs_iomap_write_allocate() to use the xfs_bmapi_delalloc()
mechanism to request allocation of the entire delalloc extent
backing the current offset instead of assuming the extent passed by
the caller is unchanged. Record the range specified by the caller
and apply it to the resulting allocated extent so previous checks by
the caller for COW fork overlap are not lost. Finally, overload the
bmapi delalloc flag with the range reval flag behavior since this is
the only use case for both.
This ensures that writeback always picks up the correct
and current extent associated with the page, regardless of races
with other extent modifying operations. If operating on a data fork
and the COW overlap state has changed since the ilock was cycled,
the caller revalidates against the COW fork sequence number before
using the imap for the next block.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The writeback delalloc conversion code is racy with respect to
changes in the currently cached file mapping. This stems from the
fact that the bmapi allocation code requires a file range to
allocate and the writeback conversion code assumes the range of the
currently cached mapping is still valid with respect to the fork. It
may not be valid, however, because the ilock is cycled (potentially
multiple times) between the time the cached mapping was populated
and the delalloc conversion occurs.
To facilitate a solution to this problem, create a new
xfs_bmapi_delalloc() wrapper to xfs_bmapi_write() that takes a file
(FSB) offset and attempts to allocate whatever delalloc extent backs
the offset. Use a new bmapi flag to cause xfs_bmapi_write() to set
the range based on the extent backing the bno parameter unless bno
lands in a hole. If bno does land in a hole, fall back to the
current behavior (which may result in an error or quietly skipping
holes in the specified range depending on other parameters). This
patch does not change behavior.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Now that the cached writeback mapping is explicitly invalidated on
data fork changes, the EOF trimming band-aid is no longer necessary.
Remove xfs_trim_extent_eof() as well since it has no other users.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The writeback code caches the current extent mapping across multiple
xfs_do_writepage() calls to avoid repeated lookups for sequential
pages backed by the same extent. This is known to be slightly racy
with extent fork changes in certain difficult to reproduce
scenarios. The cached extent is trimmed to within EOF to help avoid
the most common vector for this problem via speculative
preallocation management, but this is a band-aid that does not
address the fundamental problem.
Now that we have an xfs_ifork sequence counter mechanism used to
facilitate COW writeback, we can use the same mechanism to validate
consistency between the data fork and cached writeback mappings. On
its face, this is somewhat of a big hammer approach because any
change to the data fork invalidates any mapping currently cached by
a writeback in progress regardless of whether the data fork change
overlaps with the range under writeback. In practice, however, the
impact of this approach is minimal in most cases.
First, data fork changes (delayed allocations) caused by sustained
sequential buffered writes are amortized across speculative
preallocations. This means that a cached mapping won't be
invalidated by each buffered write of a common file copy workload,
but rather only on less frequent allocation events. Second, the
extent tree is always entirely in-core so an additional lookup of a
usable extent mostly costs a shared ilock cycle and in-memory tree
lookup. This means that a cached mapping reval is relatively cheap
compared to the I/O itself. Third, spurious invalidations don't
impact ioend construction. This means that even if the same extent
is revalidated multiple times across multiple writepage instances,
we still construct and submit the same size ioend (and bio) if the
blocks are physically contiguous.
Update struct xfs_writepage_ctx with a new field to hold the
sequence number of the data fork associated with the currently
cached mapping. Check the wpc seqno against the data fork when the
mapping is validated and reestablish the mapping whenever the fork
has changed since the mapping was cached. This ensures that
writeback always uses a valid extent mapping and thus prevents lost
writebacks and stale delalloc block problems.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The sequence counter in the xfs_ifork structure is only updated on
COW forks. This is because the counter is currently only used to
optimize out repetitive COW fork checks at writeback time.
Tweak the extent code to update the seq counter regardless of the
fork type in preparation for using this counter on data forks as
well.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Currently we have a few PTAGs in place allowing us to transform a filesystem
error in a BUG() call. However, we don't have a panic tag for corrupt
metadata, so introduce XFS_PTAG_VERIFIER_ERROR so that the administrator can
use the fs.xfs.panic_mask sysctl knob to convert any error detected by buffer
verifiers into a kernel panic.
Signed-off-by: Marco Benatto <mbenatto@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
[darrick: light editing of commit message]
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Remove duplicated include.
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Check extended attribute entry names for invalid characters.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Check directory entry names for invalid characters.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Fix an off-by-one error in the realtime bitmap "is used" cross-reference
helper function if the realtime extent size is a single block.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Teach scrub to flag extent maps that exceed the range that can be mapped
with a xfs_dablk_t.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
The extended attribute scrubber should abort the "read all attrs" loop
if there's a fatal signal pending on the process.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Move all the confusing dinode mapping code that's split between
xchk_iallocbt_check_cluster and xchk_iallocbt_check_cluster_ifree into
the first function so that it's clearer how we find the dinode for a
given inode.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Teach scrub how to handle the case that there are one or more inobt
records covering a given inode cluster. This fixes the operation on big
block filesystems (e.g. 64k blocks, 512 byte inodes).
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
The code to check inobt records against inode clusters is a mess of
poorly named variables and unnecessary parameters. Clean the
unnecessary inode number parameters out of _check_cluster_freemask in
favor of computing them inside the function instead of making the caller
do it. In xchk_iallocbt_check_cluster, rename the variables to make it
more obvious just what chunk_ino and cluster_ino represent.
Add a tracepoint to make it easier to track each inode cluster as we
scrub it.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Hoist the inode cluster checks out of the inobt record check loop into
a separate function in preparation for refactoring of that loop. No
functional changes here; that's in the next patch.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
On a big block filesystem, there may be multiple inobt records covering
a single inode cluster. These records obviously won't be aligned to
cluster alignment rules, and they must cover the entire cluster. Teach
scrub to check for these things.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
In xchk_iallocbt_rec, check the alignment of ir_startino by converting
the inode cluster block alignment into units of inodes instead of the
other way around (converting ir_startino to blocks). This prevents us
from tripping over off-by-one errors in ir_startino which are obscured
by the inode -> block conversion.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Make sure we never check more than XFS_INODES_PER_CHUNK inodes for any
given inobt record since there can be more than one inobt record mapped
to an inode cluster.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
When computing maximum size of filesystem possible with given number of
group descriptor blocks, we forget to include s_first_data_block into
the number of blocks. Thus for filesystems with non-zero
s_first_data_block it can happen that computed maximum filesystem size
is actually lower than current filesystem size which confuses the code
and eventually leads to a BUG_ON in ext4_alloc_group_tables() hitting on
flex_gd->count == 0. The problem can be reproduced like:
truncate -s 100g /tmp/image
mkfs.ext4 -b 1024 -E resize=262144 /tmp/image 32768
mount -t ext4 -o loop /tmp/image /mnt
resize2fs /dev/loop0 262145
resize2fs /dev/loop0 300000
Fix the problem by properly including s_first_data_block into the
computed number of filesystem blocks.
Fixes: 1c6bd7173d "ext4: convert file system to meta_bg if needed..."
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
Refuse to mount a volume read-write without a coherent Logical Volume
Integrity Descriptor, because we can't generate truly unique IDs without
one.
This fixes a bug where all inodes created on a UDF filesystem following
mount without a coherent LVID are assigned unique ID 0 which can then
confuse other UDF implementations.
Signed-off-by: Steven J. Magnani <steve@digidescorp.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Make sure the CRC and tag checksum of the Logical Volume Integrity
Descriptor are valid before the structure is written out to disk.
Otherwise, unless the filesystem is unmounted gracefully, the on-disk
LVID will be invalid - which is unnecessary filesystem damage.
Signed-off-by: Steven J. Magnani <steve@digidescorp.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Centralize timestamping and CRC/checksum updating of the in-core
Logical Volume Integrity Descriptor, in preparation for adding
a third site where this functionality is needed.
Signed-off-by: Steven J. Magnani <steve@digidescorp.com>
Signed-off-by: Jan Kara <jack@suse.cz>
A malicious/clueless root user can use EXT4_IOC_SWAP_BOOT to force a
corner casew which can lead to the file system getting corrupted.
There's no usefulness to allowing this, so just prohibit this case.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
The reason is that while swapping two inode, we swap the flags too.
Some flags such as EXT4_JOURNAL_DATA_FL can really confuse the things
since we're not resetting the address operations structure. The
simplest way to keep things sane is to restrict the flags that can be
swapped.
Signed-off-by: yangerkun <yangerkun@huawei.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
While do swap between two inode, they swap i_data without update
quota information. Also, swap_inode_boot_loader can do "revert"
somtimes, so update the quota while all operations has been finished.
Signed-off-by: yangerkun <yangerkun@huawei.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
While do swap, we should make sure there has no new dirty page since we
should swap i_data between two inode:
1.We should lock i_mmap_sem with write to avoid new pagecache from mmap
read/write;
2.Change filemap_flush to filemap_write_and_wait and move them to the
space protected by inode lock to avoid new pagecache from buffer read/write.
Signed-off-by: yangerkun <yangerkun@huawei.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Before really do swap between inode and boot inode, something need to
check to avoid invalid or not permitted operation, like does this inode
has inline data. But the condition check should be protected by inode
lock to avoid change while swapping. Also some other condition will not
change between swapping, but there has no problem to do this under inode
lock.
Signed-off-by: yangerkun <yangerkun@huawei.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
In mpage_add_bh_to_extent(), when accumulated extents length is greater
than MAX_WRITEPAGES_EXTENT_LEN or buffer head's b_stat is not equal, we
will not continue to search unmapped area for this page, but note this
page is locked, and will only be unlocked in mpage_release_unused_pages()
after ext4_io_submit, if io also is throttled by blk-throttle or similar
io qos, we will hold this page locked for a while, it's unnecessary.
I think the best fix is to refactor mpage_add_bh_to_extent() to let it
return some hints whether to unlock this page, but given that we will
improve dioread_nolock later, we can let it done later, so currently
the simple fix would just call mpage_release_unused_pages() before
ext4_io_submit().
Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Now, we have already handle all cases of forgetting buffer in
jbd2_journal_forget(), the buffer should not be mapped to blockdevice
when reallocating it. So this patch remove all clean_bdev_aliases() and
clean_bdev_bh_alias() calls which were invoked by ext4 explicitly.
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: zhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
We do not unmap and clear dirty flag when forgetting a buffer without
journal or does not belongs to any transaction, so the invalid dirty
data may still be written to the disk later. It's fine if the
corresponding block is never used before the next mount, and it's also
fine that we invoke clean_bdev_aliases() related functions to unmap
the block device mapping when re-allocating such freed block as data
block. But this logic is somewhat fragile and risky that may lead to
data corruption if we forget to clean bdev aliases. So, It's better to
discard dirty data during forget time.
We have been already handled all the cases of forgetting journalled
buffer, this patch deal with the remaining two cases.
- buffer is not journalled yet,
- buffer is journalled but doesn't belongs to any transaction.
We invoke __bforget() instead of __brelese() when forgetting an
un-journalled buffer in jbd2_journal_forget(). After this patch we can
remove all clean_bdev_aliases() related calls in ext4.
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: zhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Now, we capture a data corruption problem on ext4 while we're truncating
an extent index block. Imaging that if we are revoking a buffer which
has been journaled by the committing transaction, the buffer's jbddirty
flag will not be cleared in jbd2_journal_forget(), so the commit code
will set the buffer dirty flag again after refile the buffer.
fsx kjournald2
jbd2_journal_commit_transaction
jbd2_journal_revoke commit phase 1~5...
jbd2_journal_forget
belongs to older transaction commit phase 6
jbddirty not clear __jbd2_journal_refile_buffer
__jbd2_journal_unfile_buffer
test_clear_buffer_jbddirty
mark_buffer_dirty
Finally, if the freed extent index block was allocated again as data
block by some other files, it may corrupt the file data after writing
cached pages later, such as during unmount time. (In general,
clean_bdev_aliases() related helpers should be invoked after
re-allocation to prevent the above corruption, but unfortunately we
missed it when zeroout the head of extra extent blocks in
ext4_ext_handle_unwritten_extents()).
This patch mark buffer as freed and set j_next_transaction to the new
transaction when it already belongs to the committing transaction in
jbd2_journal_forget(), so that commit code knows it should clear dirty
bits when it is done with the buffer.
This problem can be reproduced by xfstests generic/455 easily with
seeds (3246 3247 3248 3249).
Signed-off-by: zhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: stable@vger.kernel.org
There is a function which clearly conveys the objective of checking
i_writecount. Additionally the usage in ext4_mb_initialize_context was
wrong, since a node would have wrongfully been reported as writable if
i_writecount had a negative value (MMAP_DENY_WRITE).
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Waiman reported that on large systems with a large amount of interrupts the
readout of /proc/stat takes a long time to sum up the interrupt
statistics. In principle this is not a problem. but for unknown reasons
some enterprise quality software reads /proc/stat with a high frequency.
The reason for this is that interrupt statistics are accounted per cpu. So
the /proc/stat logic has to sum up the interrupt stats for each interrupt.
The interrupt core provides now a per interrupt summary counter which can
be used to avoid the summation loops completely except for interrupts
marked PER_CPU which are only a small fraction of the interrupt space if at
all.
Another simplification is to iterate only over the active interrupts and
skip the potentially large gaps in the interrupt number space and just
print zeros for the gaps without going into the interrupt core in the first
place.
Waiman provided test results from a 4-socket IvyBridge-EX system (60-core
120-thread, 3016 irqs) excuting a test program which reads /proc/stat
50,000 times:
Before: 18.436s (sys 18.380s)
After: 3.769s (sys 3.742s)
Reported-by: Waiman Long <longman@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Waiman Long <longman@redhat.com>
Reviewed-by: Marc Zyngier <marc.zyngier@arm.com>
Reviewed-by: Davidlohr Bueso <dbueso@suse.de>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: linux-fsdevel@vger.kernel.org
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Daniel Colascione <dancol@google.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Link: https://lkml.kernel.org/r/20190208135021.013828701@linutronix.de
This series finally gets us to the point of having system calls with
64-bit time_t on all architectures, after a long time of incremental
preparation patches.
There was actually one conversion that I missed during the summer,
i.e. Deepa's timex series, which I now updated based the 5.0-rc1 changes
and review comments.
The following system calls are now added on all 32-bit architectures
using the same system call numbers:
403 clock_gettime64
404 clock_settime64
405 clock_adjtime64
406 clock_getres_time64
407 clock_nanosleep_time64
408 timer_gettime64
409 timer_settime64
410 timerfd_gettime64
411 timerfd_settime64
412 utimensat_time64
413 pselect6_time64
414 ppoll_time64
416 io_pgetevents_time64
417 recvmmsg_time64
418 mq_timedsend_time64
419 mq_timedreceiv_time64
420 semtimedop_time64
421 rt_sigtimedwait_time64
422 futex_time64
423 sched_rr_get_interval_time64
Each one of these corresponds directly to an existing system call
that includes a 'struct timespec' argument, or a structure containing
a timespec or (in case of clock_adjtime) timeval. Not included here
are new versions of getitimer/setitimer and getrusage/waitid, which
are planned for the future but only needed to make a consistent API
rather than for correct operation beyond y2038. These four system
calls are based on 'timeval', and it has not been finally decided
what the replacement kernel interface will use instead.
So far, I have done a lot of build testing across most architectures,
which has found a number of bugs. Runtime testing so far included
testing LTP on 32-bit ARM with the existing system calls, to ensure
we do not regress for existing binaries, and a test with a 32-bit
x86 build of LTP against a modified version of the musl C library
that has been adapted to the new system call interface [3].
This library can be used for testing on all architectures supported
by musl-1.1.21, but it is not how the support is getting integrated
into the official musl release. Official musl support is planned
but will require more invasive changes to the library.
Link: https://lore.kernel.org/lkml/20190110162435.309262-1-arnd@arndb.de/T/
Link: https://lore.kernel.org/lkml/20190118161835.2259170-1-arnd@arndb.de/
Link: https://git.linaro.org/people/arnd/musl-y2038.git/ [2]
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCAAGBQJcXf7/AAoJEGCrR//JCVInPSUP/RhsQSCKMGtONB/vVICQhwep
PybhzBSpHWFxszzTi6BEPN1zS9B069G9mDollRBYZCckyPqL/Bv6sI/vzQZdNk01
Q6Nw92OnNE1QP8owZ5TjrZhpbtopWdqIXjsbGZlloUemvuJP2JwvKovQUcn5CPTQ
jbnqU04CVyFFJYVxAnGJ+VSeWNrjW/cm/m+rhLFjUcwW7Y3aodxsPqPP6+K9hY9P
yIWfcH42WBeEWGm1RSBOZOScQl4SGCPUAhFydl/TqyEQagyegJMIyMOv9wZ5AuTT
xK644bDVmNsrtJDZDpx+J8hytXCk1LrnKzkHR/uK80iUIraF/8D7PlaPgTmEEjko
XcrywEkvkXTVU3owCm2/sbV+8fyFKzSPipnNfN1JNxEX71A98kvMRtPjDueQq/GA
Yh81rr2YLF2sUiArkc2fNpENT7EGhrh1q6gviK3FB8YDgj1kSgPK5wC/X0uolC35
E7iC2kg4NaNEIjhKP/WKluCaTvjRbvV+0IrlJLlhLTnsqbA57ZKCCteiBrlm7wQN
4csUtCyxchR9Ac2o/lj+Mf53z68Zv74haIROp18K2dL7ZpVcOPnA3XHeauSAdoyp
wy2Ek6ilNvlNB+4x+mRntPoOsyuOUGv7JXzB9JvweLWUd9G7tvYeDJQp/0YpDppb
K4UWcKnhtEom0DgK08vY
=IZVb
-----END PGP SIGNATURE-----
Merge tag 'y2038-new-syscalls' of git://git.kernel.org:/pub/scm/linux/kernel/git/arnd/playground into timers/2038
Pull y2038 - time64 system calls from Arnd Bergmann:
This series finally gets us to the point of having system calls with 64-bit
time_t on all architectures, after a long time of incremental preparation
patches.
There was actually one conversion that I missed during the summer,
i.e. Deepa's timex series, which I now updated based the 5.0-rc1 changes
and review comments.
The following system calls are now added on all 32-bit architectures using
the same system call numbers:
403 clock_gettime64
404 clock_settime64
405 clock_adjtime64
406 clock_getres_time64
407 clock_nanosleep_time64
408 timer_gettime64
409 timer_settime64
410 timerfd_gettime64
411 timerfd_settime64
412 utimensat_time64
413 pselect6_time64
414 ppoll_time64
416 io_pgetevents_time64
417 recvmmsg_time64
418 mq_timedsend_time64
419 mq_timedreceiv_time64
420 semtimedop_time64
421 rt_sigtimedwait_time64
422 futex_time64
423 sched_rr_get_interval_time64
Each one of these corresponds directly to an existing system call that
includes a 'struct timespec' argument, or a structure containing a timespec
or (in case of clock_adjtime) timeval. Not included here are new versions
of getitimer/setitimer and getrusage/waitid, which are planned for the
future but only needed to make a consistent API rather than for correct
operation beyond y2038. These four system calls are based on 'timeval', and
it has not been finally decided what the replacement kernel interface will
use instead.
So far, I have done a lot of build testing across most architectures, which
has found a number of bugs. Runtime testing so far included testing LTP on
32-bit ARM with the existing system calls, to ensure we do not regress for
existing binaries, and a test with a 32-bit x86 build of LTP against a
modified version of the musl C library that has been adapted to the new
system call interface [3]. This library can be used for testing on all
architectures supported by musl-1.1.21, but it is not how the support is
getting integrated into the official musl release. Official musl support is
planned but will require more invasive changes to the library.
Link: https://lore.kernel.org/lkml/20190110162435.309262-1-arnd@arndb.de/T/
Link: https://lore.kernel.org/lkml/20190118161835.2259170-1-arnd@arndb.de/
Link: https://git.linaro.org/people/arnd/musl-y2038.git/ [2]
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAlxfARoQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpjsgEACP8vQzbvsOZOxHKi9Vcd8ziwyjyBebNh4F
cKOx2Blgv0ReVAqLOVp9VJOJQoVQumV1btaA2YrmevxnCMpNUBpbP6G02tAqe9Z+
D75FSpZXy4UvcMSlhfc/iB/RMI06benI9LnuL7zbzIQtrbtu+OFRnO6fpQOVGLxT
Qa1wt/Rgahc48L4aHnIgPn0nyBRsEvuhC6FjI2D8akDaNiaHzwtGbpx7yDdmLNml
fCzC2uSRJ31bXsO/5/fJorinaJ56r5N8aHaINYwXDv8zd8i94nQZhITAasXub1Km
0nyuAg/fSzIdkrGmPINTKFaGYsOfRwpS4C4vagreBhzjfolPY0z9sQEQ63gZzDrd
mAjHPxLTd165OLlR/RxoMC8AjZCZ0/YQaucxUOPkaIHfth5/dy5BFaCkWyA/I7/Z
VnAyq0SqeL4hgIOGxZM0HeehKx+palNdJNZTcY7vF/7MVPuh5WM6z/FWsFa8k+ss
B9YN4wchh7I8EVbLmfz9s/eqabRWF3Agh1dE+yAKwt1KIWHaMXWZTRQnj/69fs2e
s3pwVMiiSz6K/Xnoe12nmQ4K0XeyKNROO78IIGY/Oa0Pe/hzCAaJMRMDsLp5EcJj
dxpoi1OfGHMGoqYhL6tx6Atq5f6CMDrS28k/D44DHfO7T1qQGVy1A9SY7ZCfM5+c
HKxTuRh8mg==
=tuL6
-----END PGP SIGNATURE-----
Merge tag 'for-linus-20190209' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
- NVMe pull request from Christoph, fixing namespace locking when
dealing with the effects log, and a rapid add/remove issue (Keith)
- blktrace tweak, ensuring requests with -1 sectors are shown (Jan)
- link power management quirk for a Smasung SSD (Hans)
- m68k nfblock dynamic major number fix (Chengguang)
- series fixing blk-iolatency inflight counter issue (Liu)
- ensure that we clear ->private when setting up the aio kiocb (Mike)
- __find_get_block_slow() rate limit print (Tetsuo)
* tag 'for-linus-20190209' of git://git.kernel.dk/linux-block:
blk-mq: remove duplicated definition of blk_mq_freeze_queue
Blk-iolatency: warn on negative inflight IO counter
blk-iolatency: fix IO hang due to negative inflight counter
blktrace: Show requests without sector
fs: ratelimit __find_get_block_slow() failure message.
m68k: set proper major_num when specifying module param major_num
libata: Add NOLPM quirk for SAMSUNG MZ7TE512HMHP-000L1 SSD
nvme-pci: fix rapid add remove sequence
nvme: lock NS list changes while handling command effects
aio: initialize kiocb private in case any filesystems expect it.
An ipvlan bug fix in 'net' conflicted with the abstraction away
of the IPV6 specific support in 'net-next'.
Similarly, a bug fix for mlx5 in 'net' conflicted with the flow
action conversion in 'net-next'.
Signed-off-by: David S. Miller <davem@davemloft.net>
Here are some driver core fixes for 5.0-rc6.
Well, not so much "driver core" as "debugfs". There's a lot of
outstanding debugfs cleanup patches coming in through different
subsystem trees, and in that process the debugfs core was found that it
really should return errors when something bad happens, to prevent
random files from showing up in the root of debugfs afterward. So
debugfs was fixed up to handle this properly, and then two fixes for
the relay and blk-mq code was needed as it was making invalid
assumptions about debugfs return values.
There's also a cacheinfo fix in here that resolves a tiny issue.
All of these have been in linux-next for over a week with no reported
problems.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCXF069g8cZ3JlZ0Brcm9h
aC5jb20ACgkQMUfUDdst+yk0+gCgy9PTVAJR5ZbYtWTJOTdBnd7pfqMAoMuGxc+6
LLEbfSykLRxEf5SeOJun
=KP8e
-----END PGP SIGNATURE-----
Merge tag 'driver-core-5.0-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core fixes from Greg KH:
"Here are some driver core fixes for 5.0-rc6.
Well, not so much "driver core" as "debugfs". There's a lot of
outstanding debugfs cleanup patches coming in through different
subsystem trees, and in that process the debugfs core was found that
it really should return errors when something bad happens, to prevent
random files from showing up in the root of debugfs afterward. So
debugfs was fixed up to handle this properly, and then two fixes for
the relay and blk-mq code was needed as it was making invalid
assumptions about debugfs return values.
There's also a cacheinfo fix in here that resolves a tiny issue.
All of these have been in linux-next for over a week with no reported
problems"
* tag 'driver-core-5.0-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
blk-mq: protect debugfs_create_files() from failures
relay: check return of create_buf_file() properly
debugfs: debugfs_lookup() should return NULL if not found
debugfs: return error values, not NULL
debugfs: fix debugfs_rename parameter checking
cacheinfo: Keep the old value if of_property_read_u32 fails
- Fix cache coherency problem with writeback mappings
- Fix buffer deadlock when shutting fs down
- Fix a null pointer dereference when running online repair
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAlxYbyEACgkQ+H93GTRK
tOtLNg//VIiU6w1EgiQBgk8XxN3i9YQLBIzfbtgZGtLJWX8zYnGx9H4iVfZ9UDr8
eKFoPVLvqZo5mutuot+3+ps5H4z6g9BotS048FJjFoarQwtYqnG3tcFkmLDInKW1
jTPWBV/P7w+ODyPO082SyQ+Zn9pooyXkPoBbgA+vbQoqIsY5IF7VeFasrFMtRu21
PEm/CpMxK5VMly+5ceOoqtdlWvRPDfczLfzW/iDZ4Qs2itUqFA6TJo5TD7kd4A/f
yrjV6H5tWtp0uvBCBDq4W225uVUFVWC+wTrrII6qbvuDBNWfBsQ65GczYubAAu9X
kdJdY3xj/Br1dk6jLTciCjihbjJ49xaxXfLAokNkh1pjHqHyinB5ALXC0dG4o+eo
d5y5qo10zt3HZ8Kzr8753SzxRBjGbQhok+ytrBSpX8GckhAXmH5S6WZDFDh6PbJj
5PSwvL7FNbS4M/Myjl+dwk3kWLVrGV2SglOJxCCsqCZPxNzopIrNf1uazLTZV+/2
d+G7LQPXSvjK/iLfDQH/6sBIREx0nd3H/6mnmWBg/1xMD6z/Hgn8GJvAE8luRtOi
usXYcjlkSEOSwxbUC4fCo0CrPp8DOHbrEEO4pavTN+GVIYsIen0ghq/x0HfcOCEu
XguyRTYdQXnLZNo3zCqmnU4/C1W2L5Oce4IDznH8PEIgqAqXXrQ=
=XsPH
-----END PGP SIGNATURE-----
Merge tag 'xfs-5.0-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs fixes from Darrick Wong:
"Here are a handful of XFS fixes to fix a data corruption problem, a
crasher bug, and a deadlock.
Summary:
- Fix cache coherency problem with writeback mappings
- Fix buffer deadlock when shutting fs down
- Fix a null pointer dereference when running online repair"
* tag 'xfs-5.0-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: set buffer ops when repair probes for btree type
xfs: end sync buffer I/O properly on shutdown error
xfs: eof trim writeback mapping as soon as it is cached
Creating a new cache for kernfs_iattrs.
Currently, memory is allocated with kzalloc() which
always gives aligned memory. On ARM, this is 64 byte aligned.
To avoid the wastage of memory in aligning the size requested,
a new cache for kernfs_iattrs is created.
Size of struct kernfs_iattrs is 80 Bytes.
On ARM, it will come in kmalloc-128 slab.
and it will come in kmalloc-192 slab if debug info is enabled.
Extra bytes taken 48 bytes.
Total number of objects created : 4096
Total saving = 48*4096 = 192 KB
After creating new slab(When debug info is enabled) :
sh-3.2# cat /proc/slabinfo
...
kernfs_iattrs_cache 4069 4096 128 32 1 : tunables 0 0 0 : slabdata 128 128 0
...
All testing has been done on ARM target.
Signed-off-by: Ayush Mittal <ayush.m@samsung.com>
Signed-off-by: Vaneet Narang <v.narang@samsung.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This include is not needed (fs/sysfs/file.c builds just fine without
it). Remove it.
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Ondrej Mosnacek <omosnace@redhat.com>
Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJcXJY6AAoJECebzXlCjuG+scAP/jm9ETUxC3E9ZcVetSLs7vf1
AxHVsr3E3qf9uViq/+1NBRJ/BKE1Porzi1Uz01ie5MdY6SHWFMxvIvZrTRuEAhbc
VA9xeRcnX+7RkPWik3sepSUieVy8KgJDMxdE07HwwyzST14I1s5Ev79wo6XiYlTw
3+ZdZe19Y4owmTkbDiLsxJVsI2Y8b+9BIhZ9/ICRyFzZclnyLdO15HTDr4q7E9cw
ZEZMOoljX4cjY8cD7tqf68QECxZYm4a8Ba+L6P+oqKajq/6yUrocXA2UG65EMtIQ
LtMIdpkC+zHFagRQBC3ymqEWTpX9ED6TA4H4ZSdh6UH8NwYXHsxQUYfI4Oetju/B
iqWBbXpwg2jMNRDhXS/KdNezYcGjYGJZ03dezeSP/GwojoiKALD0iXxWU2zyq0Qs
crnhmc2j1wZZl5CFXLCYwjDjHbeH6gWGfLuzAv1Q9/jQUitQ2CpC4t1MCRdOlWBt
cqjCleF6Rd3oVk51BdYPm5OyCHyQjrrXmOsx8aHkY774p7TqsaHBjSreTBjQWhO8
wAfnr6yS4/21Hfrji52Nf4Q6UJ1FFEWB0wXJhYAHzem1RsPGSbnEDam6Bpd6Bh0d
ZxAU4spoVhezQPjF4JCmHfiAJ7rbegfltJa669rE5L1kbUYYE6TAAcOm8RTSWoPZ
iGxkg/en5XiGsOGH9AyB
=++rV
-----END PGP SIGNATURE-----
Merge tag 'nfsd-5.0-1' of git://linux-nfs.org/~bfields/linux
Pull nfsd fixes from Bruce Fields:
"Two small nfsd bugfixes for 5.0, for an RDMA bug and a file clone bug"
* tag 'nfsd-5.0-1' of git://linux-nfs.org/~bfields/linux:
svcrdma: Remove max_sge check at connect time
nfsd: Fix error return values for nfsd4_clone_file_range()
Taking a sleeping lock to _only_ increment a variable is quite the
overkill, and pretty much all users do this. Furthermore, some drivers
(ie: infiniband and scif) that need pinned semantics can go to quite
some trouble to actually delay via workqueue (un)accounting for pinned
pages when not possible to acquire it.
By making the counter atomic we no longer need to hold the mmap_sem and
can simply some code around it for pinned_vm users. The counter is 64-bit
such that we need not worry about overflows such as rdma user input
controlled from userspace.
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Christoph Lameter <cl@linux.com>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
dirent modification events (create/delete/move) do not carry the
child entry name/inode information. Instead, we report FAN_ONDIR
for mkdir/rmdir so user can differentiate them from creat/unlink.
This is consistent with inotify reporting IN_ISDIR with dirent events
and is useful for implementing recursive directory tree watcher.
We avoid merging dirent events referring to subdirs with dirent events
referring to non subdirs, otherwise, user won't be able to tell from a
mask FAN_CREATE|FAN_DELETE|FAN_ONDIR if it describes mkdir+unlink pair
or rmdir+create pair of events.
For backward compatibility and consistency, do not report FAN_ONDIR
to user in legacy fanotify mode (reporting fd) and report FAN_ONDIR
to user in FAN_REPORT_FID mode for all event types.
Cc: <linux-api@vger.kernel.org>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Add support for events with data type FSNOTIFY_EVENT_INODE
(e.g. create/attrib/move/delete) for inode and filesystem mark types.
The "inode" events do not carry enough information (i.e. path) to
report event->fd, so we do not allow setting a mask for those events
unless group supports reporting fid.
The "inode" events are not supported on a mount mark, because they do
not carry enough information (i.e. path) to be filtered by mount point.
The "dirent" events (create/move/delete) report the fid of the parent
directory where events took place without specifying the filename of the
child. In the future, fanotify may get support for reporting filename
information for those events.
Cc: <linux-api@vger.kernel.org>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
When event data type is FSNOTIFY_EVENT_INODE, we don't have a refernece
to the mount, so we will not be able to open a file descriptor when user
reads the event. However, if the listener has enabled reporting file
identifier with the FAN_REPORT_FID init flag, we allow reporting those
events and we use an identifier inode to encode fid.
The inode to use as identifier when reporting fid depends on the event.
For dirent modification events, we report the modified directory inode
and we report the "victim" inode otherwise.
For example:
FS_ATTRIB reports the child inode even if reported on a watched parent.
FS_CREATE reports the modified dir inode and not the created inode.
[JK: Fixup condition in fanotify_group_event_mask()]
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
All fsnotify hooks set the FS_ISDIR flag for events that happen
on directory victim inodes except for fsnotify_perm().
Add the missing FS_ISDIR flag in fsnotify_perm() hook and let
fanotify_group_event_mask() check the FS_ISDIR flag instead of
checking if path argument is a directory.
This is needed for fanotify support for event types that do not
carry path information.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
We need to report FS_ISDIR flag with MOVE_SELF and DELETE_SELF events
for fanotify, because fanotify API requires the user to explicitly
request events on directories by FAN_ONDIR flag.
inotify never reported IN_ISDIR with those events. It looks like an
oversight, but to avoid the risk of breaking existing inotify programs,
mask the FS_ISDIR flag out when reprting those events to inotify backend.
We also add the FS_ISDIR flag with FS_ATTRIB event in the case of rename
over an empty target directory. inotify did not report IN_ISDIR in this
case, but it normally does report IN_ISDIR along with IN_ATTRIB event,
so in this case, we do not mask out the FS_ISDIR flag.
[JK: Simplify the checks in fsnotify_move()]
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Wrapper around statfs() interface.
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
For FAN_REPORT_FID, we need to encode fid with fsid of the filesystem on
every event. To avoid having to call vfs_statfs() on every event to get
fsid, we store the fsid in fsnotify_mark_connector on the first time we
add a mark and on handle event we use the cached fsid.
Subsequent calls to add mark on the same object are expected to pass the
same fsid, so the call will fail on cached fsid mismatch.
If an event is reported on several mark types (inode, mount, filesystem),
all connectors should already have the same fsid, so we use the cached
fsid from the first connector.
[JK: Simplify code flow around fanotify_get_fid()
make fsid argument of fsnotify_add_mark_locked() unconditional]
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
When setting up an fanotify listener, user may request to get fid
information in event instead of an open file descriptor.
The fid obtained with event on a watched object contains the file
handle returned by name_to_handle_at(2) and fsid returned by statfs(2).
Restrict FAN_REPORT_FID to class FAN_CLASS_NOTIF, because we have have
no good reason to support reporting fid on permission events.
When setting a mark, we need to make sure that the filesystem
supports encoding file handles with name_to_handle_at(2) and that
statfs(2) encodes a non-zero fsid.
Cc: <linux-api@vger.kernel.org>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
If group requested FAN_REPORT_FID and event has file identifier,
copy that information to user reading the event after event metadata.
fid information is formatted as struct fanotify_event_info_fid
that includes a generic header struct fanotify_event_info_header,
so that other info types could be defined in the future using the
same header.
metadata->event_len includes the length of the fid information.
The fid information includes the filesystem's fsid (see statfs(2))
followed by an NFS file handle of the file that could be passed as
an argument to open_by_handle_at(2).
Cc: <linux-api@vger.kernel.org>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
When user requests the flag FAN_REPORT_FID in fanotify_init(),
a unique file identifier of the event target object will be reported
with the event.
The file identifier includes the filesystem's fsid (i.e. from statfs(2))
and an NFS file handle of the file (i.e. from name_to_handle_at(2)).
The file identifier makes holding the path reference and passing a file
descriptor to user redundant, so those are disabled in a group with
FAN_REPORT_FID.
Encode fid and store it in event for a group with FAN_REPORT_FID.
Up to 12 bytes of file handle on 32bit arch (16 bytes on 64bit arch)
are stored inline in fanotify_event struct. Larger file handles are
stored in an external allocated buffer.
On failure to encode fid, we print a warning and queue the event
without the fid information.
[JK: Fold part of later patched into this one to use
exportfs_encode_inode_fh() right away]
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
The helper is quite trivial and open coding it will make it easier
to implement copying event fid info to user.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCXFls+wAKCRDh3BK/laaZ
PC+hAQDRkyJeAmMzpHwvv/IASqpJgc6HrSzH0p201lDyARcKIAD+MWxZHYP4ltAn
WVTLIvYT1xsoqGG3plfZ/d1iNbAWcwU=
=cL/O
-----END PGP SIGNATURE-----
Merge tag 'fuse-fixes-5.0-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse
Pull fuse fixes from Miklos Szeredi:
"A fix for a CUSE regression introduced in v4.20, as well as fixes for
a couple of old bugs"
* tag 'fuse-fixes-5.0-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
fuse: decrement NR_WRITEBACK_TEMP on the right page
fuse: call pipe_buf_release() under pipe lock
cuse: fix ioctl
fuse: handle zero sized retrieve correctly
A lot of system calls that pass a time_t somewhere have an implementation
using a COMPAT_SYSCALL_DEFINEx() on 64-bit architectures, and have
been reworked so that this implementation can now be used on 32-bit
architectures as well.
The missing step is to redefine them using the regular SYSCALL_DEFINEx()
to get them out of the compat namespace and make it possible to build them
on 32-bit architectures.
Any system call that ends in 'time' gets a '32' suffix on its name for
that version, while the others get a '_time32' suffix, to distinguish
them from the normal version, which takes a 64-bit time argument in the
future.
In this step, only 64-bit architectures are changed, doing this rename
first lets us avoid touching the 32-bit architectures twice.
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
The get_backchannel_cred() used to return error pointers on error but
now it returns NULL pointers.
Fixes: 97f68c6b02 ("SUNRPC: add 'struct cred *' to auth_cred and rpc_cre")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
If the parameter 'count' is non-zero, nfsd4_clone_file_range() will
currently clobber all errors returned by vfs_clone_file_range() and
replace them with EINVAL.
Fixes: 42ec3d4c02 ("vfs: make remap_file_range functions take and...")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: stable@vger.kernel.org # v4.20+
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
When something let __find_get_block_slow() hit all_mapped path, it calls
printk() for 100+ times per a second. But there is no need to print same
message with such high frequency; it is just asking for stall warning, or
at least bloating log files.
[ 399.866302][T15342] __find_get_block_slow() failed. block=1, b_blocknr=8
[ 399.873324][T15342] b_state=0x00000029, b_size=512
[ 399.878403][T15342] device loop0 blocksize: 4096
[ 399.883296][T15342] __find_get_block_slow() failed. block=1, b_blocknr=8
[ 399.890400][T15342] b_state=0x00000029, b_size=512
[ 399.895595][T15342] device loop0 blocksize: 4096
[ 399.900556][T15342] __find_get_block_slow() failed. block=1, b_blocknr=8
[ 399.907471][T15342] b_state=0x00000029, b_size=512
[ 399.912506][T15342] device loop0 blocksize: 4096
This patch reduces frequency to up to once per a second, in addition to
concatenating three lines into one.
[ 399.866302][T15342] __find_get_block_slow() failed. block=1, b_blocknr=8, b_state=0x00000029, b_size=512, device loop0 blocksize: 4096
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Userspace translates EEXIST to "File exists" which isn't a very good
error message for the problem. "Device or resource busy" is a better
indication of what went wrong.
Signed-off-by: Matthew Wilcox <willy@infradead.org>
A recent optimization had left private uninitialized.
Fixes: 2bc4ca9bb6 ("aio: don't zero entire aio_kiocb aio_get_req()")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
struct fanotify_event_info "inherits" from struct fsnotify_event and
therefore a more appropriate (and short) name for it is fanotify_event.
Same for struct fanotify_perm_event_info, which now "inherits" from
struct fanotify_event.
We plan to reuse the name struct fanotify_event_info for user visible
event info record format.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Common fsnotify_event helpers have no need for the mask field.
It is only used by backend code, so move the field out of the
abstract fsnotify_event struct and into the concrete backend
event structs.
This change packs struct inotify_event_info better on 64bit
machine and will allow us to cram some more fields into
struct fanotify_event_info.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
So far, existence of super block marks was checked only on events with
data type FSNOTIFY_EVENT_PATH. Use the super block of the "to_tell" inode
to report the events of all event types to super block marks.
This change has no effect on current backends. Soon, this will allow
fanotify backend to receive all event types on a super block mark.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
This was an example for using the SCSI OSD protocol, which we're trying
to remove.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
If tmpfiles can be made persistent, then newly created tmpfiles need to
be treated like any other new files in policy.
This patch indicates which newly created tmpfiles are in policy, causing
the file hash to be calculated on __fput().
Reported-by: Ignaz Forster <ignaz.forster@gmx.de>
[rgoldwyn@suse.com: Call ima_post_create_tmpfile() in vfs_tmpfile() as
opposed to do_tmpfile(). This will help the case for overlayfs where
copy_up is denied while overwriting a file.]
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Mimi Zohar <zohar@linux.ibm.com>
When we umount f2fs, we need to avoid long delay due to discard commands, which
is actually taking tens of seconds, if storage is very slow on UNMAP. So, this
patch introduces timeout-based work on it.
By default, let me give 5 seconds for discard.
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If a file with capability set (and hence security.capability xattr) is
written kernel clears security.capability xattr. For overlay, during file
copy up if xattrs are copied up first and then data is, copied up. This
means data copy up will result in clearing of security.capability xattr
file on lower has. And this can result into surprises. If a lower file has
CAP_SETUID, then it should not be cleared over copy up (if nothing was
actually written to file).
This also creates problems with chown logic where it first copies up file
and then tries to clear setuid bit. But by that time security.capability
xattr is already gone (due to data copy up), and caller gets -ENODATA.
This has been reported by Giuseppe here.
https://github.com/containers/libpod/issues/2015#issuecomment-447824842
Fix this by copying up data first and then metadta. This is a regression
which has been introduced by my commit as part of metadata only copy up
patches.
TODO: There will be some corner cases where a file is copied up metadata
only and later data copy up happens and that will clear security.capability
xattr. Something needs to be done about that too.
Fixes: bd64e57586 ("ovl: During copy up, first copy up metadata and then data")
Cc: <stable@vger.kernel.org> # v4.19+
Reported-by: Giuseppe Scrivano <gscrivan@redhat.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
atomic_t variables are currently used to implement reference
counters with the following properties:
- counter is initialized to 1 using atomic_set()
- a resource is freed upon counter reaching zero
- once counter reaches zero, its further
increments aren't allowed
- counter schema uses basic atomic operations
(set, inc, inc_not_zero, dec_and_test, etc.)
Such atomic variables should be converted to a newly provided
refcount_t type and API that prevents accidental counter overflows
and underflows. This is important since overflows and underflows
can lead to use-after-free situation and be exploitable.
The variable sighand_struct.count is used as pure reference counter.
Convert it to refcount_t and fix up the operations.
** Important note for maintainers:
Some functions from refcount_t API defined in lib/refcount.c
have different memory ordering guarantees than their atomic
counterparts.
The full comparison can be seen in
https://lkml.org/lkml/2017/11/15/57 and it is hopefully soon
in state to be merged to the documentation tree.
Normally the differences should not matter since refcount_t provides
enough guarantees to satisfy the refcounting use cases, but in
some rare cases it might matter.
Please double check that you don't have some undocumented
memory guarantees for this variable usage.
For the sighand_struct.count it might make a difference
in following places:
- __cleanup_sighand: decrement in refcount_dec_and_test() only
provides RELEASE ordering and control dependency on success
vs. fully ordered atomic counterpart
Suggested-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: David Windsor <dwindsor@gmail.com>
Reviewed-by: Hans Liljestrand <ishkamiel@gmail.com>
Reviewed-by: Andrea Parri <andrea.parri@amarulasolutions.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: akpm@linux-foundation.org
Cc: viro@zeniv.linux.org.uk
Link: https://lkml.kernel.org/r/1547814450-18902-2-git-send-email-elena.reshetova@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
In xrep_findroot_block, we work out the btree type and correctness of a
given block by calling different btree verifiers on root block
candidates. However, we leave the NULL b_ops while ->verify_read
validates the block, which means that if the verifier calls
xfs_buf_verifier_error it'll crash on the null b_ops. Fix it to set
b_ops before calling the verifier and unsetting it if the verifier
fails.
Furthermore, improve the documentation around xfs_buf_ensure_ops, which
is the function that is responsible for cleaning up the b_ops state of
buffers that go through xrep_findroot_block but don't match anything.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
As of commit e339dd8d8b ("xfs: use sync buffer I/O for sync delwri
queue submission"), the delwri submission code uses sync buffer I/O
for sync delwri I/O. Instead of waiting on async I/O to unlock the
buffer, it uses the underlying sync I/O completion mechanism.
If delwri buffer submission fails due to a shutdown scenario, an
error is set on the buffer and buffer completion never occurs. This
can cause xfs_buf_delwri_submit() to deadlock waiting on a
completion event.
We could check the error state before waiting on such buffers, but
that doesn't serialize against the case of an error set via a racing
I/O completion. Instead, invoke I/O completion in the shutdown case
regardless of buffer I/O type.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The cached writeback mapping is EOF trimmed to try and avoid races
between post-eof block management and writeback that result in
sending cached data to a stale location. The cached mapping is
currently trimmed on the validation check, which leaves a race
window between the time the mapping is cached and when it is trimmed
against the current inode size.
For example, if a new mapping is cached by delalloc conversion on a
blocksize == page size fs, we could cycle various locks, perform
memory allocations, etc. in the writeback codepath before the
associated mapping is eventually trimmed to i_size. This leaves
enough time for a post-eof truncate and file append before the
cached mapping is trimmed. The former event essentially invalidates
a range of the cached mapping and the latter bumps the inode size
such the trim on the next writepage event won't trim all of the
invalid blocks. fstest generic/464 reproduces this scenario
occasionally and causes a lost writeback and stale delalloc blocks
warning on inode inactivation.
To work around this problem, trim the cached writeback mapping as
soon as it is cached in addition to on subsequent validation checks.
This is a minor tweak to tighten the race window as much as possible
until a proper invalidation mechanism is available.
Fixes: 40214d128e ("xfs: trim writepage mapping to within eof")
Cc: <stable@vger.kernel.org> # v4.14+
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
SO_RCVTIMEO and SO_SNDTIMEO socket options use struct timeval
as the time format. struct timeval is not y2038 safe.
The subsequent patches in the series add support for new socket
timeout options with _NEW suffix that will use y2038 safe
data structures. Although the existing struct timeval layout
is sufficiently wide to represent timeouts, because of the way
libc will interpret time_t based on user defined flag, these
new flags provide a way of having a structure that is the same
for all architectures consistently.
Rename the existing options with _OLD suffix forms so that the
right option is enabled for userspace applications according
to the architecture and time_t definition of libc.
Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Cc: ccaulfie@redhat.com
Cc: deller@gmx.de
Cc: paulus@samba.org
Cc: ralf@linux-mips.org
Cc: rth@twiddle.net
Cc: cluster-devel@redhat.com
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-alpha@vger.kernel.org
Cc: linux-arch@vger.kernel.org
Cc: linux-mips@vger.kernel.org
Cc: linux-parisc@vger.kernel.org
Cc: sparclinux@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAlxWtJgACgkQxWXV+ddt
WDsDow//ZpnyDwQWvSIfF2UUQOPlcBjbHKuBA7rU0wdybW635QYGR0mqnI+1VnMj
7ssUkeN6N0a2gQzrUG4Y+zpdzOWv2xQ4jKZ9GMOp9gwyzEFyPkcFXOnmM8UfYtVu
e3fK65e8BZHmTeu0kGKah4Dt1g0t4fUmhsKR4Pfp5YNJC+zuuGTwUW1K/ZQHXJ+3
kTHc7WP1lsF7wgaZ+Gl+Kvp8fVrHVdygMVTdRBW8QaBgPLa/KExvK62jW+NmCYhj
7OIkWdew7e8IXc3Ie5IbOomHAv7IgqqgiO9VO9+n0EpyV4UobUgxrgBKJ+0yc1Ya
eidbKhMslwUE50y00JVm+vw0gwQHkR+hZDn/mRB6xiIeI8tu/yQIJZ6AhYJXoByR
cu8+SNO5Z5dOZ1f146ZH8lnkr24tuSnkDUhbRDR5pdb4tAHHej2ALzhbfbwbPEpF
IverYLw5fOMGeRU/mBsjkVadpSZ4S0HVNU85ERdhLtVLK1PSaY2UkUaA+Ii5y7au
EYDjaGMflmJ8cAVqgtgedEff2n8OKDnzRZlz4IPLI73MVSITGZkM7PmYmYsLm2Bs
mDPnmyqR8kzcd1RRtSeZTvqOpAIZG+QUOmD2jiKrchmp54Sz0V/HJWRs3aQybD6Q
ph0yAcbkvgp/ewe5IFgaI0kcyH7zYdL6GtiI2WUE3/8DObgrgsA=
=E2PP
-----END PGP SIGNATURE-----
Merge tag 'for-5.0-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- regression fix: transaction commit can run away due to delayed ref
waiting heuristic, this is not necessary now because of the proper
reservation mechanism introduced in 5.0
- regression fix: potential crash due to use-before-check of an ERR_PTR
return value
- fix for transaction abort during transaction commit that needs to
properly clean up pending block groups
- fix deadlock during b-tree node/leaf splitting, when this happens on
some of the fundamental trees, we must prevent new tree block
allocation to re-enter indirectly via the block group flushing path
- potential memory leak after errors during mount
* tag 'for-5.0-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: On error always free subvol_name in btrfs_mount
btrfs: clean up pending block groups when transaction commit aborts
btrfs: fix potential oops in device_list_add
btrfs: don't end the transaction for delayed refs in throttle
Btrfs: fix deadlock when allocating tree block during leaf/node split
Merge misc fixes from Andrew Morton:
"24 fixes"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (24 commits)
autofs: fix error return in autofs_fill_super()
autofs: drop dentry reference only when it is never used
fs/drop_caches.c: avoid softlockups in drop_pagecache_sb()
mm: migrate: don't rely on __PageMovable() of newpage after unlocking it
psi: clarify the Kconfig text for the default-disable option
mm, memory_hotplug: __offline_pages fix wrong locking
mm: hwpoison: use do_send_sig_info() instead of force_sig()
kasan: mark file common so ftrace doesn't trace it
init/Kconfig: fix grammar by moving a closing parenthesis
lib/test_kmod.c: potential double free in error handling
mm, oom: fix use-after-free in oom_kill_process
mm/hotplug: invalid PFNs from pfn_to_online_page()
mm,memory_hotplug: fix scan_movable_pages() for gigantic hugepages
psi: fix aggregation idle shut-off
mm, memory_hotplug: test_pages_in_a_zone do not pass the end of zone
mm, memory_hotplug: is_mem_section_removable do not pass the end of a zone
oom, oom_reaper: do not enqueue same task twice
mm: migrate: make buffer_migrate_page_norefs() actually succeed
kernel/exit.c: release ptraced tasks before zap_pid_ns_processes
x86_64: increase stack size for KASAN_EXTRA
...
-----BEGIN PGP SIGNATURE-----
iQGyBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAlxUf8kACgkQiiy9cAdy
T1Fy8Qv4rlgVRBPcSJpvZKhFjMd5KskVIDnP0tOc5od+Mg+547eAg7UKKnJVDgKS
OUtiP8uC/UErWvQ214hoXF1sMwoG+rDdTRdIOVDLD2a2gIIJzaIPADhYt4w5kiAX
nO9NwPwk03KTzNUQmpdFPofbLK4csmJUR4kBNE+XhZTulmw9BkplGXnNkQGjsEdf
nY6YdfqofE//DmR/KB1BylD0lxDtnfuB5zoELPmM6X4iWtD9W6pKVg23huEv+Dmh
80LdYt77vI4RKEzOAmsYEpROdAmlCksVQC1nlh5EHOuMN/9TCdpdWRahJWWhbbKr
/b8TuD+0QUUKEQdGvwM8DM/OCjnwoJphzI8CK1VzRpM4XkZxpQTzng2zVFPNzs4Y
wwKCsxxTO7ePGg6hE/DTnRn9XJtKbR1nQsIjHcw6rE0ydl+P6YRZDUEehqyRU+jd
Z09XCPS1+zeUPE6PRE9wlrIt7QdxOynufhaWRKf18b2b4cfKbNGcd+rMMZPagUQe
x2CMkEU=
=qX/z
-----END PGP SIGNATURE-----
Merge tag '5.0-rc4-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6
Pull smb3 fixes from Steve French:
"SMB3 fixes, some from this week's SMB3 test evemt, 5 for stable and a
particularly important one for queryxattr (see xfstests 70 and 117)"
* tag '5.0-rc4-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6:
cifs: update internal module version number
CIFS: fix use-after-free of the lease keys
CIFS: Do not consider -ENODATA as stat failure for reads
CIFS: Do not count -ENODATA as failure for query directory
CIFS: Fix trace command logging for SMB2 reads and writes
CIFS: Fix possible oops and memory leaks in async IO
cifs: limit amount of data we request for xattrs to CIFSMaxBufSize
cifs: fix computation for MAX_SMB2_HDR_SIZE
In autofs_fill_super() on error of get inode/make root dentry the return
should be ENOMEM as this is the only failure case of the called
functions.
Link: http://lkml.kernel.org/r/154725123240.11260.796773942606871359.stgit@pluto-themaw-net
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
autofs_expire_run() calls dput(dentry) to drop the reference count of
dentry. However, dentry is read via autofs_dentry_ino(dentry) after
that. This may result in a use-free-bug. The patch drops the reference
count of dentry only when it is never used.
Link: http://lkml.kernel.org/r/154725122396.11260.16053424107144453867.stgit@pluto-themaw-net
Signed-off-by: Pan Bian <bianpan2016@163.com>
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When superblock has lots of inodes without any pagecache (like is the
case for /proc), drop_pagecache_sb() will iterate through all of them
without dropping sb->s_inode_list_lock which can lead to softlockups
(one of our customers hit this).
Fix the problem by going to the slow path and doing cond_resched() in
case the process needs rescheduling.
Link: http://lkml.kernel.org/r/20190114085343.15011-1-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- fix page migration when using iomap for pagecache management
- fix a use-after-free bug in the directio code
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAlxPK1IACgkQ+H93GTRK
tOvzeA/+I4bWVmovfV+EGFzHSV6zRsy17v6c4ncMrdia41rhmvxl+sAJgj2+uFCb
J37cCMpPrmAOz+JGIW7PbCt8uzmwaXOfB9p9N58wM+hSSxtlN+wZFsIaoepOUTDK
t3e2L7QQxQjN9HXZU0RNUi/zgS3poDfzap7cZ71spBxX5hVd1zVQa0q/o5OXr7OI
sBlZLsIOhmS8WU2TmfkwzUVi+/FR4dCgyP8eDAGho/KbwvO9sfWzLNxf0U/ORWfA
JG+2LX42eKKjX6wo39zW1mAXwOBhnLnCqOOgKrDy8XRXPARiNAzLNn0AwhxJSAqD
z3qE298Oag6gZo0lNJjDXIko5D9y2koWjQe7z4fJ2JYpV5mSyq8/F4XDNO2FKIap
07p0OBGa1yfwQYmS5TrhJvYwvsHqTNs122jpowqeD3o0Xh64y2TZLEMdhhIsRjga
+S9OSVQ15JDf/QeI5LPGK6Oc6B3JnRYgrYf7g7DYu4eqEsJ3V3pqtbXzjxGkzUjx
5xf85ujuRUQoKCPZQ00ewmsfZMfOcaYqfhosx6LvR6ZPvPH3Ex3nLHks5ZV/eXfR
Auusq6XMiHb5ljukfVCa0WStntUl5gMaRha5QJy1Vg5Zd1ikvTkB5CkJFlODJ8hS
GaEIa58Gf75dkHOvm4bFEOoy2EcE3I+qdwR+0xgg+6gDlNeZZNQ=
=KYgc
-----END PGP SIGNATURE-----
Merge tag 'iomap-5.0-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull iomap fixes from Darrick Wong:
"A couple of iomap fixes to eliminate some memory corruption and hang
problems that were reported:
- fix page migration when using iomap for pagecache management
- fix a use-after-free bug in the directio code"
* tag 'iomap-5.0-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
iomap: fix a use after free in iomap_dio_rw
iomap: get/put the page in iomap_page_create/release()
Al Viro pointed out that since there is only one pipe buffer type to which
new data can be appended, it isn't necessary to have a ->can_merge field in
struct pipe_buf_operations, we can just check for a magic type.
Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Before this patch, it was possible for two pipes to affect each other after
data had been transferred between them with tee():
============
$ cat tee_test.c
int main(void) {
int pipe_a[2];
if (pipe(pipe_a)) err(1, "pipe");
int pipe_b[2];
if (pipe(pipe_b)) err(1, "pipe");
if (write(pipe_a[1], "abcd", 4) != 4) err(1, "write");
if (tee(pipe_a[0], pipe_b[1], 2, 0) != 2) err(1, "tee");
if (write(pipe_b[1], "xx", 2) != 2) err(1, "write");
char buf[5];
if (read(pipe_a[0], buf, 4) != 4) err(1, "read");
buf[4] = 0;
printf("got back: '%s'\n", buf);
}
$ gcc -o tee_test tee_test.c
$ ./tee_test
got back: 'abxx'
$
============
As suggested by Al Viro, fix it by creating a separate type for
non-mergeable pipe buffers, then changing the types of buffers in
splice_pipe_to_pipe() and link_pipe().
Cc: <stable@vger.kernel.org>
Fixes: 7c77f0b3f9 ("splice: implement pipe to pipe splicing")
Fixes: 70524490ee ("[PATCH] splice: add support for sys_tee()")
Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
On ppc64le, When a string with PAGE_SIZE - 1 (i.e. 64k-1) length is
passed as a "filesystem type" argument to the mount(2) syscall,
copy_mount_string() ends up allocating 64k (the PAGE_SIZE on ppc64le)
worth of space for holding the string in kernel's address space.
Later, in set_precision() (invoked by get_fs_type() ->
__request_module() -> vsnprintf()), we end up assigning
strlen(fs-type-string) i.e. 65535 as the
value to 'struct printf_spec'->precision member. This field has a width
of 16 bits and it is a signed data type. Hence an invalid value ends
up getting assigned. This causes the "WARN_ONCE(spec->precision != prec,
"precision %d too large", prec)" statement inside set_precision() to be
executed.
This commit fixes the bug by limiting the length of the string passed by
copy_mount_string() to strndup_user() to PATH_MAX.
Signed-off-by: Chandan Rajendra <chandan@linux.ibm.com>
Reported-by: Abdul Haleem <abdhalee@linux.ibm.com>
Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
generic_fillattr is an optional helper that isn't used by all file
systems, move handling purely based on inode flags to vfs_getattr_nosec,
which is common code.
This fixes setting this flag for file systems not using generic_fillattr
like xfs.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The caller already initializes it to the basic stats. Just
clear not supported default bits where needed.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
This issue was found when I tried to put checkpoint work in a separate thread,
the deadlock below happened:
Thread1 | Thread2
__jbd2_log_wait_for_space |
jbd2_log_do_checkpoint (hold j_checkpoint_mutex)|
if (jh->b_transaction != NULL) |
... |
jbd2_log_start_commit(journal, tid); |jbd2_update_log_tail
| will lock j_checkpoint_mutex,
| but will be blocked here.
|
jbd2_log_wait_commit(journal, tid); |
wait_event(journal->j_wait_done_commit, |
!tid_gt(tid, journal->j_commit_sequence)); |
... |wake_up(j_wait_done_commit)
} |
then deadlock occurs, Thread1 will never be waken up.
To fix this issue, drop j_checkpoint_mutex in jbd2_log_do_checkpoint()
when we are going to wait for transaction commit.
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
This reverts commit ad211f3e94.
As Jan Kara pointed out, this change was unsafe since it means we lose
the call to sync_mapping_buffers() in the nojournal case. The
original point of the commit was avoid taking the inode mutex (since
it causes a lockdep warning in generic/113); but we need the mutex in
order to call sync_mapping_buffers().
The real fix to this problem was discussed here:
https://lore.kernel.org/lkml/20181025150540.259281-4-bvanassche@acm.org
The proposed patch was to fix a syzbot complaint, but the problem can
also demonstrated via "kvm-xfstests -c nojournal generic/113".
Multiple solutions were discused in the e-mail thread, but none have
landed in the kernel as of this writing. Anyway, commit
ad211f3e94 is absolutely the wrong way to suppress the lockdep, so
revert it.
Fixes: ad211f3e94 ("ext4: use ext4_write_inode() when fsyncing w/o a journal")
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reported: Jan Kara <jack@suse.cz>
This reverts commit 2d29f6b96d.
It turns out that the fix can lead to a ~20 percent performance regression
in initial writes to the page cache according to iozone. Let's revert this
for now to have more time for a proper fix.
Cc: stable@vger.kernel.org # v3.13+
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Stable bugfix:
- Fix up return value on fatal errors in nfs_page_async_flush()
Other bugfix:
- Fix NULL pointer dereference of dev_name
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAlxTOEsACgkQ18tUv7Cl
QOuS2A//U2J1xz2N8R/k9I4puMXss+DpUAfryNRrDul0qL4tsr7UhHzHezJVl17X
coPGA/YD+voybyT+eYeACCHUhDMNN8gj2KoCMlE1ueWAbiCOxrS4NgFM2djO3lka
dlfqgSbVS1Z7+KtEEiFGq/HiF6y0WxanMBHnfhllNbXBDE6W0/+EPdgjX7fZF3FF
AS6QQmruXL/b1/hJasfTsF3wcHs3y+Y23RP85j4F8aYrcWLOyPUhhuzv/o6Zoh37
fqltMxueWy+2qpn8dBE+9ILuKnUxnIsIwpF4YFhI7XrQlqMIWYMrShiqSDqYeVUP
3qdX8LtRR2VsNCTDR9HamVtCkbi9DkJRXQA/fChVPiLA+P0W2Q2uiKsNKEijuZdl
9fvl9aIL/+glczHrZeJTKellFSEocaZ/L5gVmpM6Fk8zyFitP0+nkO40g/qou+A0
O77A+EK9v4XPe8z87kwrZhphT12QZK2oIPMAZDnjitktbuObip0Wva4w92KnIqK0
QPIN081oxNF7BnWEUESCTeqXl670lV83Xek1eVHSCTnFOI68riP1YoUQlIhujV/R
82J+y6HJYtLDj87NuJrAAXtUrtzAPDr39TJr3V2aH0kdpPajUAhkC3gLix13ORyM
cmP3K1M3U5f3HAElrywqQrGcxaYKN/Hpfb2427vEnbxieTKElVo=
=ZOXa
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-5.0-3' of git://git.linux-nfs.org/projects/anna/linux-nfs
Pull NFS client fixes from Anna Schumaker:
"This addresses two bugs, one in the error code handling of
nfs_page_async_flush() and one to fix a potential NULL pointer
dereference in nfs_parse_devname().
Stable bugfix:
- Fix up return value on fatal errors in nfs_page_async_flush()
Other bugfix:
- Fix NULL pointer dereference of dev_name"
* tag 'nfs-for-5.0-3' of git://git.linux-nfs.org/projects/anna/linux-nfs:
NFS: Fix up return value on fatal errors in nfs_page_async_flush()
nfs: Fix NULL pointer dereference of dev_name
The request buffers are freed right before copying the pointers.
Use the func args instead which are identical and still valid.
Simple reproducer (requires KASAN enabled) on a cifs mount:
echo foo > foo ; tail -f foo & rm foo
Cc: <stable@vger.kernel.org> # 4.20
Fixes: 179e44d49c ("smb3: add tracepoint for sending lease break responses to server")
Signed-off-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Paulo Alcantara <palcantara@suse.de>
When ext2 filesystem is created with 64k block size, ext2_max_size()
will return value less than 0. Also, we cannot write any file in this fs
since the sb->maxbytes is less than 0. The core of the problem is that
the size of block index tree for such large block size is more than
i_blocks can carry. So fix the computation to count with this
possibility.
File size limits computed with the new function for the full range of
possible block sizes look like:
bits file_size
10 17247252480
11 275415851008
12 2196873666560
13 2197948973056
14 2198486220800
15 2198754754560
16 2198888906752
CC: stable@vger.kernel.org
Reported-by: yangerkun <yangerkun@huawei.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Don't fetch fcaps when umount2 is called to avoid a process hang while
it waits for the missing resource to (possibly never) re-appear.
Note the comment above user_path_mountpoint_at():
* A umount is a special case for path walking. We're not actually interested
* in the inode in this situation, and ESTALE errors can be a problem. We
* simply want track down the dentry and vfsmount attached at the mountpoint
* and avoid revalidating the last component.
This can happen on ceph, cifs, 9p, lustre, fuse (gluster) or NFS.
Please see the github issue tracker
https://github.com/linux-audit/audit-kernel/issues/100
Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
[PM: merge fuzz in audit_log_fcaps()]
Signed-off-by: Paul Moore <paul@paul-moore.com>
This is an eventual replacement for vfs_submount() uses. Unlike the
"mount" and "remount" cases, the users of that thing are not in VFS -
they are buried in various ->d_automount() instances and rather than
converting them all at once we introduce the (thankfully small and
simple) infrastructure here and deal with the prospective users in
afs, nfs, etc. parts of the series.
Here we just introduce a new constructor (fs_context_for_submount())
along with the corresponding enum constant to be put into fc->purpose
for those.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Replace do_remount_sb() with a function, reconfigure_super(), that's
fs_context aware. The fs_context is expected to be parameterised already
and have ->root pointing to the superblock to be reconfigured.
A legacy wrapper is provided that is intended to be called from the
fs_context ops when those appear, but for now is called directly from
reconfigure_super(). This wrapper invokes the ->remount_fs() superblock op
for the moment. It is intended that the remount_fs() op will be phased
out.
The fs_context->purpose is set to FS_CONTEXT_FOR_RECONFIGURE to indicate
that the context is being used for reconfiguration.
do_umount_root() is provided to consolidate remount-to-R/O for umount and
emergency remount by creating a context and invoking reconfiguration.
do_remount(), do_umount() and do_emergency_remount_callback() are switched
to use the new process.
[AV -- fold UMOUNT and EMERGENCY_REMOUNT in; fixes the
umount / bug, gets rid of pointless complexity]
[AV -- set ->net_ns in all cases; nfs remount will need that]
[AV -- shift security_sb_remount() call into reconfigure_super(); the callers
that didn't do security_sb_remount() have NULL fc->security anyway, so it's
a no-op for them]
Signed-off-by: David Howells <dhowells@redhat.com>
Co-developed-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Right now vfs_get_tree() calls security_sb_kern_mount() (i.e.
mount MAC) unless it gets MS_KERNMOUNT or MS_SUBMOUNT in flags.
Doing it that way is both clumsy and imprecise.
Consider the callers' tree of vfs_get_tree():
vfs_get_tree()
<- do_new_mount()
<- vfs_kern_mount()
<- simple_pin_fs()
<- vfs_submount()
<- kern_mount_data()
<- init_mount_tree()
<- btrfs_mount()
<- vfs_get_tree()
<- nfs_do_root_mount()
<- nfs4_try_mount()
<- nfs_fs_mount()
<- vfs_get_tree()
<- nfs4_referral_mount()
do_new_mount() always does need MAC (we are guaranteed that neither
MS_KERNMOUNT nor MS_SUBMOUNT will be passed there).
simple_pin_fs(), vfs_submount() and kern_mount_data() pass explicit
flags inhibiting that check. So does nfs4_referral_mount() (the
flags there are ulimately coming from vfs_submount()).
init_mount_tree() is called too early for anything LSM-related; it
doesn't matter whether we attempt those checks, they'll do nothing.
Finally, in case of btrfs_mount() and nfs_fs_mount(), doing MAC
is pointless - either the caller will do it, or the flags are
such that we wouldn't have done it either.
In other words, the one and only case when we want that check
done is when we are called from do_new_mount(), and there we
want it unconditionally.
So let's simply move it there. The superblock is still locked,
so nobody is going to get access to it (via ustat(2), etc.)
until we get a chance to apply the checks - we are free to
move them to any point up to where we drop ->s_umount (in
do_new_mount_fc()).
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Create an fs_context-aware version of do_new_mount(). This takes an
fs_context with a superblock already attached to it.
Make do_new_mount() use do_new_mount_fc() rather than do_new_mount(); this
allows the consolidation of the mount creation, check and add steps.
To make this work, mount_too_revealing() is changed to take a superblock
rather than a mount (which the fs_context doesn't have available), allowing
this check to be done before the mount object is created.
Signed-off-by: David Howells <dhowells@redhat.com>
Co-developed-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Roll the handling of subtypes into do_new_mount() and vfs_get_tree(). The
former determines any subtype string and hangs it off the fs_context; the
latter applies it.
Make do_new_mount() create, parameterise and commit an fs_context and
create a mount for itself rather than calling vfs_kern_mount().
[AV -- missing kstrdup()]
[AV -- ... and no kstrdup() if we get to setting ->s_submount - we
simply transfer it from fc, leaving NULL behind]
[AV -- constify ->s_submount, while we are at it]
Reviewed-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Create a new helper, vfs_create_mount(), that creates a detached vfsmount
object from an fs_context that has a superblock attached to it.
Almost all uses will be paired with immediately preceding vfs_get_tree();
add a helper for such combination.
Switch vfs_kern_mount() to use this.
NOTE: mild behaviour change; passing NULL as 'device name' to
something like procfs will change /proc/*/mountstats - "device none"
instead on "no device". That is consistent with /proc/mounts et.al.
[do'h - EXPORT_SYMBOL_GPL slipped in by mistake; removed]
[AV -- remove confused comment from vfs_create_mount()]
[AV -- removed the second argument]
Reviewed-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Introduce a filesystem context concept to be used during superblock
creation for mount and superblock reconfiguration for remount. This is
allocated at the beginning of the mount procedure and into it is placed:
(1) Filesystem type.
(2) Namespaces.
(3) Source/Device names (there may be multiple).
(4) Superblock flags (SB_*).
(5) Security details.
(6) Filesystem-specific data, as set by the mount options.
Accessor functions are then provided to set up a context, parameterise it
from monolithic mount data (the data page passed to mount(2)) and tear it
down again.
A legacy wrapper is provided that implements what will be the basic
operations, wrapping access to filesystems that aren't yet aware of the
fs_context.
Finally, vfs_kern_mount() is changed to make use of the fs_context and
mount_fs() is replaced by vfs_get_tree(), called from vfs_kern_mount().
[AV -- add missing kstrdup()]
[AV -- put_cred() can be unconditional - fc->cred can't be NULL]
[AV -- take legacy_validate() contents into legacy_parse_monolithic()]
[AV -- merge KERNEL_MOUNT and USER_MOUNT]
[AV -- don't unlock superblock on success return from vfs_get_tree()]
[AV -- kill 'reference' argument of init_fs_context()]
Signed-off-by: David Howells <dhowells@redhat.com>
Co-developed-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
mount_subtree() creates (and soon destroys) a temporary namespace,
so that automounts could function normally. These beasts should
never become anyone's current namespaces; they don't, but it would
be better to make prevention of that more straightforward. And
since they don't become anyone's current namespace, we don't need
to bother with reserving procfs inums for those.
Teach alloc_mnt_ns() to skip inum allocation if told so, adjust
put_mnt_ns() accordingly, make mount_subtree() use temporary
(anon) namespace. is_anon_ns() checks if a namespace is such.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The current dentry number tracking code doesn't distinguish between
positive & negative dentries. It just reports the total number of
dentries in the LRU lists.
As excessive number of negative dentries can have an impact on system
performance, it will be wise to track the number of positive and
negative dentries separately.
This patch adds tracking for the total number of negative dentries in
the system LRU lists and reports it in the 5th field in the
/proc/sys/fs/dentry-state file. The number, however, does not include
negative dentries that are in flight but not in the LRU yet as well as
those in the shrinker lists which are on the way out anyway.
The number of positive dentries in the LRU lists can be roughly found by
subtracting the number of negative dentries from the unused count.
Matthew Wilcox had confirmed that since the introduction of the
dentry_stat structure in 2.1.60, the dummy array was there, probably for
future extension. They were not replacements of pre-existing fields.
So no sane applications that read the value of /proc/sys/fs/dentry-state
will do dummy thing if the last 2 fields of the sysctl parameter are not
zero. IOW, it will be safe to use one of the dummy array entry for
negative dentry count.
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The nr_dentry_unused per-cpu counter tracks dentries in both the LRU
lists and the shrink lists where the DCACHE_LRU_LIST bit is set.
The shrink_dcache_sb() function moves dentries from the LRU list to a
shrink list and subtracts the dentry count from nr_dentry_unused. This
is incorrect as the nr_dentry_unused count will also be decremented in
shrink_dentry_list() via d_shrink_del().
To fix this double decrement, the decrement in the shrink_dcache_sb()
function is taken out.
Fixes: 4e717f5c10 ("list_lru: remove special case function list_lru_dispose_all."
Cc: stable@kernel.org
Signed-off-by: Waiman Long <longman@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The subvol_name is allocated in btrfs_parse_subvol_options and is
consumed and freed in mount_subvol. Add a free to the error paths that
don't call mount_subvol so that it is guaranteed that subvol_name is
freed when an error happens.
Fixes: 312c89fbca ("btrfs: cleanup btrfs_mount() using btrfs_mount_root()")
Cc: stable@vger.kernel.org # v4.19+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
alloc_fs_devices() can return ERR_PTR(-ENOMEM), so dereferencing its
result before the check for IS_ERR() is a bad idea.
Fixes: d1a6300282 ("btrfs: add members to fs_devices to track fsid changes")
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Lots of callers of debugfs_lookup() were just checking NULL to see if
the file/directory was found or not. By changing this in ff9fb72bc0
("debugfs: return error values, not NULL") we caused some subsystems to
easily crash.
Fixes: ff9fb72bc0 ("debugfs: return error values, not NULL")
Reported-by: syzbot+b382ba6a802a3d242790@syzkaller.appspotmail.com
Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Omar Sandoval <osandov@fb.com>
Cc: Jens Axboe <axboe@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When doing reads beyound the end of a file the server returns
error STATUS_END_OF_FILE error which is mapped to -ENODATA.
Currently we report it as a failure which confuses read stats.
Change it to not consider -ENODATA as failure for stat purposes.
Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
CC: Stable <stable@vger.kernel.org>
Currently we log success once we send an async IO request to
the server. Instead we need to analyse a response and then log
success or failure for a particular command. Also fix argument
list for read logging.
Cc: <stable@vger.kernel.org> # 4.18
Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Allocation of a page array for non-cached IO was separated from
allocation of rdata and wdata structures and this introduced memory
leaks and a possible null pointer dereference. This patch fixes
these problems.
Cc: <stable@vger.kernel.org>
Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
minus the various headers and blobs that will be part of the reply.
or else we might trigger a session reconnect.
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
The size of the fixed part of the create response is 88 bytes not 56.
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Ensure that we return the fatal error value that caused us to exit
nfs_page_async_flush().
Fixes: c373fff7bd ("NFSv4: Don't special case "launder"")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: stable@vger.kernel.org # v4.12+
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
When an error happens, debugfs should return an error pointer value, not
NULL. This will prevent the totally theoretical error where a debugfs
call fails due to lack of memory, returning NULL, and that dentry value
is then passed to another debugfs call, which would end up succeeding,
creating a file at the root of the debugfs tree, but would then be
impossible to remove (because you can not remove the directory NULL).
So, to make everyone happy, always return errors, this makes the users
of debugfs much simpler (they do not have to ever check the return
value), and everyone can rest easy.
Reported-by: Gary R Hook <ghook@amd.com>
Reported-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Reported-by: Masami Hiramatsu <mhiramat@kernel.org>
Reported-by: Michal Hocko <mhocko@kernel.org>
Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reported-by: Ulf Hansson <ulf.hansson@linaro.org>
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
There is a NULL pointer dereference of dev_name in nfs_parse_devname()
The oops looks something like:
BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
...
RIP: 0010:nfs_fs_mount+0x3b6/0xc20 [nfs]
...
Call Trace:
? ida_alloc_range+0x34b/0x3d0
? nfs_clone_super+0x80/0x80 [nfs]
? nfs_free_parsed_mount_data+0x60/0x60 [nfs]
mount_fs+0x52/0x170
? __init_waitqueue_head+0x3b/0x50
vfs_kern_mount+0x6b/0x170
do_mount+0x216/0xdc0
ksys_mount+0x83/0xd0
__x64_sys_mount+0x25/0x30
do_syscall_64+0x65/0x220
entry_SYSCALL_64_after_hwframe+0x49/0xbe
Fix this by adding a NULL check on dev_name
Signed-off-by: Yao Liu <yotta.liu@ucloud.cn>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
When best_desc keeps NULL, best_group keeps -1, too. So we can
return best_group directly.
Signed-off-by: Liu Xiang <liu.xiang6@zte.com.cn>
Signed-off-by: Jan Kara <jack@suse.cz>
There is a plan to build the kernel with -Wimplicit-fallthrough and
these places in the code produced warnings (W=1).
This commit removes the following warnings:
fs/ext2/inode.c:1237:7: warning: this statement may fall through [-Wimplicit-fallthrough=]
fs/ext2/inode.c:1244:7: warning: this statement may fall through [-Wimplicit-fallthrough=]
Signed-off-by: Mathieu Malaterre <malat@debian.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Previously callers to btrfs_end_transaction_throttle() would commit the
transaction if there wasn't enough delayed refs space. This happens in
relocation, and if the fs is relatively empty we'll run out of delayed
refs space basically immediately, so we'll just be stuck in this loop of
committing the transaction over and over again.
This code existed because we didn't have a good feedback mechanism for
running delayed refs, but with the delayed refs rsv we do now. Delete
this throttling code and let the btrfs_start_transaction() in relocation
deal with putting pressure on the delayed refs infrastructure. With
this patch we no longer take 5 minutes to balance a metadata only fs.
Qu has submitted a fstest to catch slow balance or excessive transaction
commits. Steps to reproduce:
* create subvolume
* create many (eg. 16000) inlined files, of size 2KiB
* iteratively snapshot and touch several files to trigger metadata
updates
* start balance -m
Reported-by: Qu Wenruo <wqu@suse.com>
Fixes: 64403612b7 ("btrfs: rework btrfs_check_space_for_delayed_refs")
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
[ add tags and steps to reproduce ]
Signed-off-by: David Sterba <dsterba@suse.com>
When splitting a leaf or node from one of the trees that are modified when
flushing pending block groups (extent, chunk, device and free space trees),
we need to allocate a new tree block, which in turn can result in the need
to allocate a new block group. After allocating the new block group we may
need to flush new block groups that were previously allocated during the
course of the current transaction, which is what may cause a deadlock due
to attempts to write lock twice the same leaf or node, as when splitting
a leaf or node we are holding a write lock on it and its parent node.
The same type of deadlock can also happen when increasing the tree's
height, since we are holding a lock on the existing root while allocating
the tree block to use as the new root node.
An example trace when the deadlock happens during the leaf split path is:
[27175.293054] CPU: 0 PID: 3005 Comm: kworker/u17:6 Tainted: G W 4.19.16 #1
[27175.293942] Hardware name: Penguin Computing Relion 1900/MD90-FS0-ZB-XX, BIOS R15 06/25/2018
[27175.294846] Workqueue: btrfs-extent-refs btrfs_extent_refs_helper [btrfs]
(...)
[27175.298384] RSP: 0018:ffffab2087107758 EFLAGS: 00010246
[27175.299269] RAX: 0000000000000bbd RBX: ffff9fadc7141c48 RCX: 0000000000000001
[27175.300155] RDX: 0000000000000001 RSI: 0000000000000002 RDI: ffff9fadc7141c48
[27175.301023] RBP: 0000000000000001 R08: ffff9faeb6ac1040 R09: ffff9fa9c0000000
[27175.301887] R10: 0000000000000000 R11: 0000000000000040 R12: ffff9fb21aac8000
[27175.302743] R13: ffff9fb1a64d6a20 R14: 0000000000000001 R15: ffff9fb1a64d6a18
[27175.303601] FS: 0000000000000000(0000) GS:ffff9fb21fa00000(0000) knlGS:0000000000000000
[27175.304468] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[27175.305339] CR2: 00007fdc8743ead8 CR3: 0000000763e0a006 CR4: 00000000003606f0
[27175.306220] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[27175.307087] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[27175.307940] Call Trace:
[27175.308802] btrfs_search_slot+0x779/0x9a0 [btrfs]
[27175.309669] ? update_space_info+0xba/0xe0 [btrfs]
[27175.310534] btrfs_insert_empty_items+0x67/0xc0 [btrfs]
[27175.311397] btrfs_insert_item+0x60/0xd0 [btrfs]
[27175.312253] btrfs_create_pending_block_groups+0xee/0x210 [btrfs]
[27175.313116] do_chunk_alloc+0x25f/0x300 [btrfs]
[27175.313984] find_free_extent+0x706/0x10d0 [btrfs]
[27175.314855] btrfs_reserve_extent+0x9b/0x1d0 [btrfs]
[27175.315707] btrfs_alloc_tree_block+0x100/0x5b0 [btrfs]
[27175.316548] split_leaf+0x130/0x610 [btrfs]
[27175.317390] btrfs_search_slot+0x94d/0x9a0 [btrfs]
[27175.318235] btrfs_insert_empty_items+0x67/0xc0 [btrfs]
[27175.319087] alloc_reserved_file_extent+0x84/0x2c0 [btrfs]
[27175.319938] __btrfs_run_delayed_refs+0x596/0x1150 [btrfs]
[27175.320792] btrfs_run_delayed_refs+0xed/0x1b0 [btrfs]
[27175.321643] delayed_ref_async_start+0x81/0x90 [btrfs]
[27175.322491] normal_work_helper+0xd0/0x320 [btrfs]
[27175.323328] ? move_linked_works+0x6e/0xa0
[27175.324160] process_one_work+0x191/0x370
[27175.324976] worker_thread+0x4f/0x3b0
[27175.325763] kthread+0xf8/0x130
[27175.326531] ? rescuer_thread+0x320/0x320
[27175.327284] ? kthread_create_worker_on_cpu+0x50/0x50
[27175.328027] ret_from_fork+0x35/0x40
[27175.328741] ---[ end trace 300a1b9f0ac30e26 ]---
Fix this by preventing the flushing of new blocks groups when splitting a
leaf/node and when inserting a new root node for one of the trees modified
by the flushing operation, similar to what is done when COWing a node/leaf
from on of these trees.
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=202383
Reported-by: Eli V <eliventer@gmail.com>
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Introduce a local wait_for_completion variable to avoid an access to the
potentially freed dio struture after dropping the last reference count.
Also use the chance to document the completion behavior to make the
refcounting clear to the reader of the code.
Fixes: ff6a9292e6 ("iomap: implement direct I/O")
Reported-by: Chandan Rajendra <chandan@linux.ibm.com>
Reported-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Chandan Rajendra <chandan@linux.ibm.com>
Tested-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
migrate_page_move_mapping() expects pages with private data set to have
a page_count elevated by 1. This is what used to happen for xfs through
the buffer_heads code before the switch to iomap in commit 82cb14175e
("xfs: add support for sub-pagesize writeback without buffer_heads").
Not having the count elevated causes move_pages() to fail on memory
mapped files coming from xfs.
Make iomap compatible with the migrate_page_move_mapping() assumption by
elevating the page count as part of iomap_page_create() and lowering it
in iomap_page_release().
It causes the move_pages() syscall to misbehave on memory mapped files
from xfs. It does not not move any pages, which I suppose is "just" a
perf issue, but it also ends up returning a positive number which is out
of spec for the syscall. Talking to Michal Hocko, it sounds like
returning positive numbers might be a necessary update to move_pages()
anyway though.
Fixes: 82cb14175e ("xfs: add support for sub-pagesize writeback without buffer_heads")
Signed-off-by: Piotr Jaroszynski <pjaroszynski@nvidia.com>
[hch: actually get/put the page iomap_migrate_page() to make it work
properly]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAlxLV8IACgkQiiy9cAdy
T1HRIgv/fieDI6HBXAoK9mXzVLJXCZ1IxVZTygC3Xn/bCaxHerS+QSe88w0r1QTZ
KLCoT7pBBy06uTowX4/cAF0IYTStsYWngBsH3+HL4EEchkZAVe02wAJ8QOCXs9YU
UjoMvNLmcyrzwNt8EduNM2jUUpOL8EHiZOQKUvLt1Vdo3xBtyB3Nkttji92YLOQb
Cxs0RJdKNaalMphOQwAtbdbZwlL79KGSA0gMm9W2VD744YKMjQ3uGXaTuoe6+w19
voYKVTSelgyZg51yzF20x1Pl3OV/8L6tWgN1BIcKZbfGBshJhhcD93Pef7bmhxNS
Y2yF3SPr/zcdgEkwigVlNs1uBCBBeU04/yQr+YvJwNcjsRTcTKOMuquxq4gXM1P8
7f8gA1u0n1H70YKK29sd3sl7Iu8rHVacwlxjfoAcCjMe0VQvOJtR+V/zwP2uWr8T
IjX05unBe/B1v+NXqNdkOpE6ZaiJtt2ny5NDZhfeubOouBLsey7/g/gSx48ICrq+
U20kKsb6
=CCH2
-----END PGP SIGNATURE-----
Merge tag '5.0-rc3-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6
Pull smb3 fixes from Steve French:
"A set of small smb3 fixes, some fixing various crediting issues
discovered during xfstest runs, five for stable"
* tag '5.0-rc3-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6:
cifs: print CIFSMaxBufSize as part of /proc/fs/cifs/DebugData
smb3: add credits we receive from oplock/break PDUs
CIFS: Fix mounts if the client is low on credits
CIFS: Do not assume one credit for async responses
CIFS: Fix credit calculations in compound mid callback
CIFS: Fix credit calculation for encrypted reads with errors
CIFS: Fix credits calculations for reads with errors
CIFS: Do not reconnect TCP session in add_credits()
smb3: Cleanup license mess
CIFS: Fix possible hang during async MTU reads and writes
cifs: fix memory leak of an allocated cifs_ntsd structure
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAlxLdgsQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpoVGD/4sGYQqfiXogQIJYbPH2RRPrJuLIIITjiAv
lPXX1wx/tvz/ktwKJE2OiWTES0JjdH1HlC+7/0L/fLb8DXiKBmFuUHlwhureoL9Y
o//BIQKuaje35kyHITTy2UAJOOqnNtJTaAP2AfkL+eOcj/V/G5rJIfLGs9QtuAR7
sRJ+uhg1EbW/+uO0bULDmG6WUjxFu8mqcw3i6g0VVLVOnXB2EKcZTl3KPrdAXrUp
XtmouERga6OfAUSJyZmTSV136mL+opRB2WFFVeIzjQfLmyItDGbSX/YPS8oJ2pow
v7630F+CMrd4aKpqqtnAhfWpGqd0Xw7cYfZ9MKTJmZPmGzf9a1fQFpmgZosD4Dh3
7MrhboU4TUt9PdXESA7CmE7LkTp99ghfj5/ysKrSV5h3HsH2RbLxJk91Rx3vmAWD
u1xWRYL+GYLH6ZwOLvM1iqBrrLN3mUyrx98SaMgoXuqNzmQmgz9LPeA0Gt09FJbo
uj+ebg4dRwuThjni4xQhl3zL2RQy7nlTDFDdKOz/XoiYk2NUVksss+sxGjNarHj0
b5pCD4HOp57OreGExaOARpBRah5HSNdQtBRsIOnbyEq6f/e1LsIY23Z9nNF0deGO
sZzgsbnsn+zg8bC6T/Gk4UY6XdCcgaS3SL04SVKAE3lO6A4C/Awo8DgD9Bl1zpC1
HQlNkl5fBg==
=iucY
-----END PGP SIGNATURE-----
Merge tag 'for-linus-20190125' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
"A collection of fixes for this release. This contains:
- Silence sparse rightfully complaining about non-static wbt
functions (Bart)
- Fixes for the zoned comments/ioctl documentation (Damien)
- direct-io fix that's been lingering for a while (Ernesto)
- cgroup writeback fix (Tejun)
- Set of NVMe patches for nvme-rdma/tcp (Sagi, Hannes, Raju)
- Block recursion tracking fix (Ming)
- Fix debugfs command flag naming for a few flags (Jianchao)"
* tag 'for-linus-20190125' of git://git.kernel.dk/linux-block:
block: Fix comment typo
uapi: fix ioctl documentation
blk-wbt: Declare local functions static
blk-mq: fix the cmd_flag_name array
nvme-multipath: drop optimization for static ANA group IDs
nvmet-rdma: fix null dereference under heavy load
nvme-rdma: rework queue maps handling
nvme-tcp: fix timeout handler
nvme-rdma: fix timeout handler
writeback: synchronize sync(2) against cgroup writeback membership switches
block: cover another queue enter recursion via BIO_QUEUE_ENTERED
direct-io: allow direct writes to empty inodes
loginuid and sessionid (and audit_log_session_info) should be part of
CONFIG_AUDIT scope and not CONFIG_AUDITSYSCALL since it is used in
CONFIG_CHANGE, ANOM_LINK, FEATURE_CHANGE (and INTEGRITY_RULE), none of
which are otherwise dependent on AUDITSYSCALL.
Please see github issue
https://github.com/linux-audit/audit-kernel/issues/104
Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
[PM: tweaked subject line for better grep'ing]
Signed-off-by: Paul Moore <paul@paul-moore.com>
debugfs_rename() needs to check that the dentries passed into it really
are valid, as sometimes they are not (i.e. if the return value of
another debugfs call is passed into this one.) So fix this up by
properly checking if the two parent directories are errors (they are
allowed to be NULL), and if the dentry to rename is not NULL or an
error.
Cc: stable <stable@vger.kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
CRYPTO_TFM_REQ_WEAK_KEY confuses newcomers to the crypto API because it
sounds like it is requesting a weak key. Actually, it is requesting
that weak keys be forbidden (for algorithms that have the notion of
"weak keys"; currently only DES and XTS do).
Also it is only one letter away from CRYPTO_TFM_RES_WEAK_KEY, with which
it can be easily confused. (This in fact happened in the UX500 driver,
though just in some debugging messages.)
Therefore, make the intent clear by renaming it to
CRYPTO_TFM_REQ_FORBID_WEAK_KEYS.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Was helpful in debug for some recent problems.
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Otherwise we gradually leak credits leading to potential
hung session.
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
CC: Stable <stable@vger.kernel.org>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
If the server doesn't grant us at least 3 credits during the mount
we won't be able to complete it because query path info operation
requires 3 credits. Use the cached file handle if possible to allow
the mount to succeed.
Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
If we don't receive a response we can't assume that the server
granted one credit. Assume zero credits in such cases.
Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
The current code doesn't do proper accounting for credits
in SMB1 case: it adds one credit per response only if we get
a complete response while it needs to return it unconditionally.
Fix this and also include malformed responses for SMB2+ into
accounting for credits because such responses have Credit
Granted field, thus nothing prevents to get a proper credit
value from them.
Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
We do need to account for credits received in error responses
to read requests on encrypted sessions.
Cc: <stable@vger.kernel.org>
Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Currently we mark MID as malformed if we get an error from server
in a read response. This leads to not properly processing credits
in the readv callback. Fix this by marking such a response as
normal received response and process it appropriately.
Cc: <stable@vger.kernel.org>
Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
When executing add_credits() we currently call cifs_reconnect()
if the number of credits is zero and there are no requests in
flight. In this case we may call cifs_reconnect() recursively
twice and cause memory corruption given the following sequence
of functions:
mid1.callback() -> add_credits() -> cifs_reconnect() ->
-> mid2.callback() -> add_credits() -> cifs_reconnect().
Fix this by avoiding to call cifs_reconnect() in add_credits()
and checking for zero credits in the demultiplex thread.
Cc: <stable@vger.kernel.org>
Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
d_delete only unhashes an entry if it is reached with
dentry->d_lockref.count != 1. Prior to commit 8ead9dd547 ("devpts:
more pty driver interface cleanups"), d_delete was called on a dentry
from devpts_pty_kill with two references held, which would trigger the
unhashing, and the subsequent dputs would release it.
Commit 8ead9dd547 reworked devpts_pty_kill to stop acquiring the second
reference from d_find_alias, and the d_delete call left the dentries
still on the hashed list without actually ever being dropped from dcache
before explicit cleanup. This causes the number of negative dentries for
devpts to pile up, and an `ls /dev/pts` invocation can take seconds to
return.
Provide always_delete_dentry() from simple_dentry_operations
as .d_delete for devpts, to make the dentry be dropped from dcache.
Without this cleanup, the number of dentries in /dev/pts/ can be grown
arbitrarily as:
`python -c 'import pty; pty.spawn(["ls", "/dev/pts"])'`
A systemtap probe on dcache_readdir to count d_subdirs shows this count
to increase with each pty spawn invocation above:
probe kernel.function("dcache_readdir") {
subdirs = &@cast($file->f_path->dentry, "dentry")->d_subdirs;
p = subdirs;
p = @cast(p, "list_head")->next;
i = 0
while (p != subdirs) {
p = @cast(p, "list_head")->next;
i = i+1;
}
printf("number of dentries: %d\n", i);
}
Fixes: 8ead9dd547 ("devpts: more pty driver interface cleanups")
Signed-off-by: Varad Gautam <vrd@amazon.de>
Reported-by: Zheng Wang <wanz@amazon.de>
Reported-by: Brandon Schwartz <bsschwar@amazon.de>
Root-caused-by: Maximilian Heyne <mheyne@amazon.de>
Root-caused-by: Nicolas Pernas Maradei <npernas@amazon.de>
CC: David Woodhouse <dwmw@amazon.co.uk>
CC: Maximilian Heyne <mheyne@amazon.de>
CC: Stefan Nuernberger <snu@amazon.de>
CC: Amit Shah <aams@amazon.de>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
CC: Al Viro <viro@ZenIV.linux.org.uk>
CC: Christian Brauner <christian.brauner@ubuntu.com>
CC: Eric W. Biederman <ebiederm@xmission.com>
CC: Matthew Wilcox <willy@infradead.org>
CC: Eric Biggers <ebiggers@google.com>
CC: <stable@vger.kernel.org> # 4.9+
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAlxJ6LUACgkQnJ2qBz9k
QNn75Qf/U60RPju54Kdyepo4iMFoQXS/VJg87bpRetUdN9vKU3CbnyYAx43ftPOk
GH+UU2u+24DX2AWbzy4yx2CJP1YUUF1oa2QkSW2P9WWn1NYU3goWha1Il8bc96Kq
TQU/ypl83RoaFtw4JwpL27ot39UH5WWafYiackqzOYIuAdpgei2aaZFNvVeb9Ije
9udt7ygcTn94DVdYAFI6NeH5KAA+kzpIyWWRCWlmESsNYnCUQntSLm3tQG8sTHZt
sXk2Kp1r7XFl/zkg9iFSOhSIIyVizSC4cHrE0iJ/0kLgoJjfjbOa1jZ5Eeq8yTc3
UsXI5ICNS3RuZWwSWkh759JnhQ1R/A==
=cB63
-----END PGP SIGNATURE-----
Merge tag 'fsnotify_for_v5.0-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull inotify fix from Jan Kara:
"Fix a file refcount leak in an inotify error path"
* tag 'fsnotify_for_v5.0-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
inotify: Fix fd refcount leak in inotify_add_watch().
Precise and non-ambiguous license information is important. The recently
added aegis header file has a SPDX license identifier, which is nice, but
at the same time it has a contradictionary license boiler plate text.
SPDX-License-Identifier: GPL-2.0
versus
* This program is free software; you can redistribute it and/or modify
* it under the terms of the GNU General Public License as published by
* the Free Software Foundation; either version 2 of the License, or
* (at your option) any later version.
Oh well.
Assuming that the SPDX identifier is correct and according to x86/hyper-v
contributions from Microsoft GPL V2 only is the usual license.
Remove the boiler plate as it is wrong and even if correct it is redundant.
Fixes: eccb4422cf ("smb3: Add ftrace tracepoints for improved SMB3 debugging")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Steve French <sfrench@samba.org>
Cc: linux-cifs@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
When doing MTU i/o we need to leave some credits for
possible reopen requests and other operations happening
in parallel. Currently we leave 1 credit which is not
enough even for reopen only: we need at least 2 credits
if durable handle reconnect fails. Also there may be
other operations at the same time including compounding
ones which require 3 credits at a time each. Fix this
by leaving 8 credits which is big enough to cover most
scenarios.
Was able to reproduce this when server was configured
to give out fewer credits than usual.
The proper fix would be to reconnect a file handle first
and then obtain credits for an MTU request but this leads
to bigger code changes and should happen in other patches.
Cc: <stable@vger.kernel.org>
Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
The call to SMB2_queary_acl can allocate memory to pntsd and also
return a failure via a call to SMB2_query_acl (and then query_info).
This occurs when query_info allocates the structure and then in
query_info the call to smb2_validate_and_copy_iov fails. Currently the
failure just returns without kfree'ing pntsd hence causing a memory
leak.
Currently, *data is allocated if it's not already pointing to a buffer,
so it needs to be kfree'd only if was allocated in query_info, so the
fix adds an allocated flag to track this. Also set *dlen to zero on
an error just to be safe since *data is kfree'd.
Also set errno to -ENOMEM if the allocation of *data fails.
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Dan Carpener <dan.carpenter@oracle.com>
Currently, trying to rename or link a regular file, directory, or
symlink into an encrypted directory fails with EPERM when the source
file is unencrypted or is encrypted with a different encryption policy,
and is on the same mountpoint. It is correct for the operation to fail,
but the choice of EPERM breaks tools like 'mv' that know to copy rather
than rename if they see EXDEV, but don't know what to do with EPERM.
Our original motivation for EPERM was to encourage users to securely
handle their data. Encrypting files by "moving" them into an encrypted
directory can be insecure because the unencrypted data may remain in
free space on disk, where it can later be recovered by an attacker.
It's much better to encrypt the data from the start, or at least try to
securely delete the source data e.g. using the 'shred' program.
However, the current behavior hasn't been effective at achieving its
goal because users tend to be confused, hack around it, and complain;
see e.g. https://github.com/google/fscrypt/issues/76. And in some cases
it's actually inconsistent or unnecessary. For example, 'mv'-ing files
between differently encrypted directories doesn't work even in cases
where it can be secure, such as when in userspace the same passphrase
protects both directories. Yet, you *can* already 'mv' unencrypted
files into an encrypted directory if the source files are on a different
mountpoint, even though doing so is often insecure.
There are probably better ways to teach users to securely handle their
files. For example, the 'fscrypt' userspace tool could provide a
command that migrates unencrypted files into an encrypted directory,
acting like 'shred' on the source files and providing appropriate
warnings depending on the type of the source filesystem and disk.
Receiving errors on unimportant files might also force some users to
disable encryption, thus making the behavior counterproductive. It's
desirable to make encryption as unobtrusive as possible.
Therefore, change the error code from EPERM to EXDEV so that tools
looking for EXDEV will fall back to a copy.
This, of course, doesn't prevent users from still doing the right things
to securely manage their files. Note that this also matches the
behavior when a file is renamed between two project quota hierarchies;
so there's precedent for using EXDEV for things other than mountpoints.
xfstests generic/398 will require an update with this change.
[Rewritten from an earlier patch series by Michael Halcrow.]
Cc: Michael Halcrow <mhalcrow@google.com>
Cc: Joe Richey <joerichey@google.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
In order to have a common code base for fscrypt "post read" processing
for all filesystems which support encryption, this commit removes
filesystem specific build config option (e.g. CONFIG_EXT4_FS_ENCRYPTION)
and replaces it with a build option (i.e. CONFIG_FS_ENCRYPTION) whose
value affects all the filesystems making use of fscrypt.
Reviewed-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
This commit removes the f2fs specific f2fs_encrypted_inode() and makes
use of the generic IS_ENCRYPTED() macro to check for the encryption
status of an inode.
Acked-by: Chao Yu <yuchao0@huawei.com>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
This commit removes the ext4 specific ext4_encrypted_inode() and makes
use of the generic IS_ENCRYPTED() macro to check for the encryption
status of an inode.
Reviewed-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
fscrypt doesn't use the CTR mode of operation for anything, so there's
no need to select CRYPTO_CTR. It was added by commit 71dea01ea2
("ext4 crypto: require CONFIG_CRYPTO_CTR if ext4 encryption is
enabled"). But, I've been unable to identify the arm64 crypto bug it
was supposedly working around.
I suspect the issue was seen only on some old Android device kernel
(circa 3.10?). So if the fix wasn't mistaken, the real bug is probably
already fixed. Or maybe it was actually a bug in a non-upstream crypto
driver.
So, remove the dependency. If it turns out there's actually still a
bug, we'll fix it properly.
Signed-off-by: Eric Biggers <ebiggers@google.com>
When calling debugfs functions, there is no need to ever check the
return value. The function can work or not, but the code logic should
never do something different based on this.
There is no need to save the dentries for the debugfs files, so drop
those variables to save a bit of space and make the code simpler.
Cc: Bob Peterson <rpeterso@redhat.com>
Cc: Andreas Gruenbacher <agruenba@redhat.com>
Cc: cluster-devel@redhat.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
In order to record direct IO count, we add two additional type in
enum count_type: F2FS_DIO_{WRITE,READ}, but those IO won't dirty
filesystem metadata, so we don't need to set filesystem dirty in
inc_page_count(), fix it.
Fixes: 02b16d0a34 ("f2fs: add to account direct IO")
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
As Dan Carpenter as below:
The patch df634f444ee9: "f2fs: use rb_*_cached friends" from Oct 4,
2018, leads to the following static checker warning:
fs/f2fs/extent_cache.c:606 f2fs_update_extent_tree_range()
error: uninitialized symbol 'leftmost'.
And also Eric Biggers, and Kyungtae Kim reported, there is an UBSAN
warning described as below:
We report a bug in linux-4.20.2: "UBSAN: Undefined behaviour in
fs/f2fs/extent_cache.c"
kernel config: https://kt0755.github.io/etc/config_v4.20_stable
repro: https://kt0755.github.io/etc/repro.4a3e7.c (f2fs is mounted on
/mnt/f2fs/)
This arose in f2fs_update_extent_tree_range (fs/f2fs/extent_cache.c:605).
It seems that, for some reason, its last argument became "24"
although that was supposed to be bool type.
=========================================
UBSAN: Undefined behaviour in fs/f2fs/extent_cache.c:605:4
load of value 24 is not a valid value for type '_Bool'
CPU: 0 PID: 6774 Comm: syz-executor5 Not tainted 4.20.2 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0xb1/0x118 lib/dump_stack.c:113
ubsan_epilogue+0x12/0x94 lib/ubsan.c:159
__ubsan_handle_load_invalid_value+0x17a/0x1be lib/ubsan.c:457
f2fs_update_extent_tree_range+0x1d4a/0x1d50 fs/f2fs/extent_cache.c:605
f2fs_update_extent_cache+0x2b6/0x350 fs/f2fs/extent_cache.c:804
f2fs_update_data_blkaddr+0x61/0x70 fs/f2fs/data.c:656
f2fs_outplace_write_data+0x1d6/0x4b0 fs/f2fs/segment.c:3140
f2fs_convert_inline_page+0x86d/0x2060 fs/f2fs/inline.c:163
f2fs_convert_inline_inode+0x6b5/0xad0 fs/f2fs/inline.c:208
f2fs_preallocate_blocks+0x78b/0xb00 fs/f2fs/data.c:982
f2fs_file_write_iter+0x31b/0xf40 fs/f2fs/file.c:3062
call_write_iter include/linux/fs.h:1857 [inline]
new_sync_write fs/read_write.c:474 [inline]
__vfs_write+0x538/0x6e0 fs/read_write.c:487
vfs_write+0x1b3/0x520 fs/read_write.c:549
ksys_write+0xde/0x1c0 fs/read_write.c:598
__do_sys_write fs/read_write.c:610 [inline]
__se_sys_write fs/read_write.c:607 [inline]
__x64_sys_write+0x7e/0xc0 fs/read_write.c:607
do_syscall_64+0xbe/0x4f0 arch/x86/entry/common.c:290
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x4497b9
Code: e8 8c 9f 02 00 48 83 c4 18 c3 0f 1f 80 00 00 00 00 48 89 f8 48
89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d
01 f0 ff ff 0f 83 9b 6b fc ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007f1ea15edc68 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 00007f1ea15ee6cc RCX: 00000000004497b9
RDX: 0000000000001000 RSI: 0000000020000140 RDI: 0000000000000013
RBP: 000000000071bea0 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00000000ffffffff
R13: 000000000000bb50 R14: 00000000006f4bf0 R15: 00007f1ea15ee700
=========================================
As I checked, this uninitialized variable won't cause extent cache
corruption, but in order to avoid such kind of warning of both UBSAN
and smatch, fix to initialize related variable.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Reported-by: Eric Biggers <ebiggers@google.com>
Reported-by: Kyungtae Kim <kt0755@gmail.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Dentry bitmap is not enough to detect incorrect dentries. So this patch
also checks the namelen value of a dentry.
Signed-off-by: Gong Chen <gongchen4@huawei.com>
Signed-off-by: Sheng Yong <shengyong1@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
While traversing dirents in f2fs_fill_dentries(), if bitmap is valid,
filename length should not be zero, otherwise, directory structure
consistency could be corrupted, in this case, let's print related
info and set SBI_NEED_FSCK to trigger fsck for repairing.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAlxFDv0eHHRvcnZhbGRz
QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGBPsH/3Ij47fut8kwxGSX
Tmx7Y+VYftRiKSwK3+HxsCvde3scqfkxAukb3HeJDzZdpnouT0k4nqUYQabAANi/
MdaO+NSBRp/NjzZcpFG9QAroIQ2G2sRQ4E8ldFcNmdsjZWlUfKIHPfYHzvvc06L4
MhvdkpMa/p51Jz9egQs0kfSvrb6fh4OEDTI19/aaGR0oJBhoGhLrqTI+vdYhMiyO
wWtUXgZfsmlCBdAQLRh04CxGTc/32VApoB/SwP9sF+xD3gcL0mPFNKUociio6K2Y
a7u7yuzUKvVwuafVgX9QT+f+je5/5u+WFsG/26cfXzizZoNWW5oDl3sBD3hRNkvt
J13lB1w=
=ch+/
-----END PGP SIGNATURE-----
Merge tag 'v5.0-rc3' into next-general
Sync to Linux 5.0-rc3 to pull in the VFS changes which impacted a lot
of the LSM code.
sync_inodes_sb() can race against cgwb (cgroup writeback) membership
switches and fail to writeback some inodes. For example, if an inode
switches to another wb while sync_inodes_sb() is in progress, the new
wb might not be visible to bdi_split_work_to_wbs() at all or the inode
might jump from a wb which hasn't issued writebacks yet to one which
already has.
This patch adds backing_dev_info->wb_switch_rwsem to synchronize cgwb
switch path against sync_inodes_sb() so that sync_inodes_sb() is
guaranteed to see all the target wbs and inodes can't jump wbs to
escape syncing.
v2: Fixed misplaced rwsem init. Spotted by Jiufei.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Jiufei Xue <xuejiufei@gmail.com>
Link: http://lkml.kernel.org/r/dc694ae2-f07f-61e1-7097-7c8411cee12d@gmail.com
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
On a DIO_SKIP_HOLES filesystem, the ->get_block() method is currently
not allowed to create blocks for an empty inode. This confusion comes
from trying to bit shift a negative number, so check the size of the
inode first.
The problem is most visible for hfsplus, because the fallback to
buffered I/O doesn't happen and the write fails with EIO. This is in
part the fault of the module, because it gives a wrong return value on
->get_block(); that will be fixed in a separate patch.
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Ernesto A. Fernández <ernesto.mnd.fernandez@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When calling debugfs functions, there is no need to ever check the
return value. The function can work or not, but the code logic should
never do something different based on this.
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Cc: linux-f2fs-devel@lists.sourceforge.net
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When setting the first xattr, we automatically enable
EXT2_FEATURE_COMPAT_EXT_ATTR. However we forget to call
ext2_update_dynamic_rev() so in theory if the filesystem was created as
ancient one without features support, this could be missed. The
consequences are minor anyway - since the feature is compat one, only
old e2fsck which does not understand xattrs could do something bad.
Reported-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Jan Kara <jack@suse.cz>
The case of (EXT2_INODE_SIZE(sb) == 0) is included in
(sbi->s_inode_size < EXT2_GOOD_OLD_INODE_SIZE).
So there is no need to check again.
Signed-off-by: Liu Xiang <liu.xiang6@zte.com.cn>
Signed-off-by: Jan Kara <jack@suse.cz>
Set proper return code when failing from allocating
memory in ext2_fill_super().
Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Signed-off-by: Jan Kara <jack@suse.cz>
debugfs_use_file_start() and debugfs_use_file_finish() do not exist
since commit c9afbec270 ("debugfs: purge obsolete SRCU based removal
protection"); tweak debugfs_create_file_unsafe() comment.
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
In ramoops_register_dummy() dummy_data is allocated via kzalloc()
then it will always occupy the heap space after register platform
device via platform_device_register_data(), but it will not be
used any more. So let's free it for system usage, replace it with
stack memory is better due to small size.
Signed-off-by: Yue Hu <huyue2@yulong.com>
[kees: add required memset and adjust sizeof() argument]
Signed-off-by: Kees Cook <keescook@chromium.org>
Deduplicate the ext2 file type conversion implementation and remove
EXT2_FT_* definitions - file systems that use the same file types as
defined by POSIX do not need to define their own versions and can
use the common helper functions decared in fs_types.h and implemented
in fs_types.c
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Phillip Potter <phil@philpotter.co.uk>
Signed-off-by: Jan Kara <jack@suse.cz>
Many file systems use a copy&paste implementation
of dirent to on-disk file type conversions.
Create a common implementation to be used by file systems
with some useful conversion helpers to reduce open coded
file type conversions in file system code.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Phillip Potter <phil@philpotter.co.uk>
Signed-off-by: Jan Kara <jack@suse.cz>
Precise and non-ambiguous license information is important. The recently
added quota.c file has a SPDX license identifier, which is nice, but
at the same time it has a contradictionary license boiler plate text.
SPDX-License-Identifier: GPL-2.0
versus
* This program is free software; you can redistribute it and/or
* modify it under the terms of the GNU General Public License
* as published by the Free Software Foundation; either version 2
* of the License, or (at your option) any later version.
Oh well.
As the other ceph related files are licensed under the GPL v2 only, it's
assumed that the SPDX id is correct and the boiler plate was randomly
copied into that patch.
Remove the boiler plate as it is wrong and even if correct it is redundant.
Fixes: fb18a57568 ("ceph: quota: add initial infrastructure to support cephfs quotas")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Luis Henriques <lhenriques@suse.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: "Yan, Zheng" <zyan@redhat.com>
Cc: Sage Weil <sage@redhat.com>
Cc: Ilya Dryomov <idryomov@gmail.com>
Cc: ceph-devel@vger.kernel.org
Acked-by: Luis Henriques <lhenriques@suse.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
snap realm and corresponding inode have pointers to each other.
The two pointer should get clear at the same time. Otherwise,
snap realm's pointer may reference freed inode.
Cc: stable@vger.kernel.org # 4.17+
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Luis Henriques <lhenriques@suse.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
- Fix console ramoops to show the previous boot logs (Sai Prakash Ranjan)
- Avoid allocation and leak of platform data
-----BEGIN PGP SIGNATURE-----
Comment: Kees Cook <kees@outflux.net>
iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAlxE/QYWHGtlZXNjb29r
QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJsVXD/94pJsNHogOle3XnZJst/1cV4+R
KTyQ6QQ1184lI5FPGJliV3veCuKnTY4Q7e5vCl2mzEvds+VRxu2yUm+LFxKVj04W
yiRNv8Yn3+3eKxNufqLa9ySbsSZyIpoCKQ/20nbC5cNM3r+Gq12GUnDBixtGdl9V
fB316emIaKR70Qvi9KlfwDTcjh+SWBR2u8L0YsmTaoJkjtaCJhekN3hfya6Wnjf5
1sj1fCDzteda/Ld0gXoImHPvt0UcHPJDp1+ZhY0ja64QIuoWwDxsisp4X4nS8iFD
oak9BWRDhRwSzPnyCu8BI3Hc4ayxcBYR/vmBS12LTkIJVdR1CaqfeDYvZexyX04n
TYZnmPeZ+2R3wzA9aCLjUDEAYNi403sLqIvR99jlrOoiMo+5Bf05TvBG8lND2dQg
A13854vg9ssxfiuEmvxJTg8SLtcFiqEtrzZhqaLVfuf4cqaN3o0Ed+A0gjZadwwD
TPxiMh4YEZuw+qszZwyzi7uECKOvJIJeLCL8CApPvihCNFy2vKF5wj58ZizXXmoT
Orq6DJ1ITx4DTkYIgtiTC9HKDuvNDZ/ncdb5QoqlpoTe01r8GQOzffZ+tD3xRTNc
8+yAdK0/bTjLVNbFX/Q4dwagh0I8EQSrdw8CmmQs0lsCvfyYTfHz8uPeBQ7rrzeN
2r1H7a/OKpL/duTkEw==
=m6sX
-----END PGP SIGNATURE-----
Merge tag 'pstore-v5.0-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull pstore fixes from Kees Cook:
- Fix console ramoops to show the previous boot logs (Sai Prakash
Ranjan)
- Avoid allocation and leak of platform data
* tag 'pstore-v5.0-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
pstore/ram: Avoid allocation and leak of platform data
pstore/ram: Fix console ramoops to show the previous boot logs
Yue Hu noticed that when parsing device tree the allocated platform data
was never freed. Since it's not used beyond the function scope, this
switches to using a stack variable instead.
Reported-by: Yue Hu <huyue2@yulong.com>
Fixes: 35da60941e ("pstore/ram: add Device Tree bindings")
Cc: stable@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAlxElHsACgkQxWXV+ddt
WDsF8Q/6A+l/Ku8D+xSmw+JGXGNroX1V62sxVYWqIgrkYNg6iSMjicqF3aN1bCbR
LqvOQ5skerrKteNYIPbTTOD5Xp37ccinSeEWEF0ktFkeU1G6yJo7aRsnQlO1sduk
2PKUOA1/PdeTBiOj1bej/1ybhtIW+d0MaoPtnUCMC8DD/ihfmU332+KC8VmRUYGZ
4kT1DvKfBOSVz1UTl6OJdWo76crvjz0eGVnH1YG7DoFbpVzAbphHE7+aC/WkHQme
X5Ux2NtYVMe/0IGAC7kJnj4jCZ1weAdlmvzzagjzGtWdhWgTxntiJ/FVJs9nO8Dm
G/pVtD8RVjgMkPOWcfw5fdderrBJqjVGgl4VDrDLqjO9OTGNFJs+HgcPRi6Oli28
sA+HG+U34YzSfKY0L9eAmpkNxMjWywBuXTQIAlMhHNZCL0vZ2K5EYa3dTp9OswAW
IIcOh/LfZxiomvMvUqQWcRCy5y/b+cYjOjbHwkrw+ewd3IWXVLG8YLMyZI3vnHKu
/f1xn6KCap9a1cS4LwyK6gzstEugn0MYmnmD/Jx8I1BJFBt55Q31ES6tPHgTmh/d
QjveRjMkxNCql4h5Hq0+LiXoSoocBmsO0wrs2QrWSx4PBnJsvjySnWr/8GfAOj79
BhnuQFxbr/BkNyBzvKrjoI+zZnrVm0cBU59lP6PzN75+kQTaIfs=
=hGlM
-----END PGP SIGNATURE-----
Merge tag 'for-5.0-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"A handful of fixes (some of them in testing for a long time):
- fix some test failures regarding cleanup after transaction abort
- revert of a patch that could cause a deadlock
- delayed iput fixes, that can help in ENOSPC situation when there's
low space and a lot data to write"
* tag 'for-5.0-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: wakeup cleaner thread when adding delayed iput
btrfs: run delayed iputs before committing
btrfs: wait on ordered extents on abort cleanup
btrfs: handle delayed ref head accounting cleanup in abort
Revert "btrfs: balance dirty metadata pages in btrfs_finish_ordered_io"
Stable bugfixes:
- Fix TCP receive code on archs with flush_dcache_page()
Other bugfixes:
- Fix error code in rpcrdma_buffer_create()
- Fix a double free in rpcrdma_send_ctxs_create()
- Fix kernel BUG at kernel/cred.c:825
- Fix unnecessary retry in nfs42_proc_copy_file_range()
- Ensure rq_bytes_sent is reset before request transmission
- Ensure we respect the RPCSEC_GSS sequence number limit
- Address Kerberos performance/behavior regression
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAlxCH1kACgkQ18tUv7Cl
QOs4rBAAymqyhUzNgap1TX/KezFxqii7CVMVabrA5eGN+ZXbSVAZkwy7BMZWwVIp
tEvD7lxWtF11x7bQDw7Xz+ruBCjLdD0RQIFnlBpVKqsRy9oSRA4PsgSbuIFaw+gX
Bun4Z0xmOCPF7knRv6gQonArEZfHeokIIN8AtSBtWVByaOrnZwgDkNTIub8akpUl
FQlzgq7lTydVzNcju2ImBeubU7KgFEu0F2Zub5z/iR+F2Mx/bAju8Q4YeVlPyD8U
QJoIBlXAvgK8LK4bZCh40zPeEt0TMWXnW7o0JHgVQ0g6VbT+hp17I7fz91xEazye
qbjpIJIjv5daEv0REM8t5ZCZB3tEatVjb4EQWXp0gJYb0l5E3I/O+7MO44n4uMYx
s3UTxzM6NjwCtlgmn4tYUj+vEIExQHUUnwOl02e5iEa7bqNNY75ehAhj5Rh7iQBH
H4b+OVuqc608q87rNePdK1LRyh0/u1cDI1kDAQoIP2omlb5hJQGk0Nuz9G2BodIj
rP0x7nV+ykOXZtr6TR+RvaksL1W39PzVKYA0aL+e2gbcv4YO+Oq1phvNKwRWPM4a
g08r/kvifS5h6/Jq8Wmn83f1vAOX7Sf23RtEoj+t9hc4S4JbsV2iYK3PY3eWbSYE
Oz0Vt4gvBBJ+0rHJ10BsQ7686OQkyMKpIlvmx6O5mWVlthovbJM=
=6Nzz
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-5.0-2' of git://git.linux-nfs.org/projects/anna/linux-nfs
Pull NFS client fixes from Anna Schumaker:
"These are mostly fixes for SUNRPC bugs, with a single v4.2
copy_file_range() fix mixed in.
Stable bugfixes:
- Fix TCP receive code on archs with flush_dcache_page()
Other bugfixes:
- Fix error code in rpcrdma_buffer_create()
- Fix a double free in rpcrdma_send_ctxs_create()
- Fix kernel BUG at kernel/cred.c:825
- Fix unnecessary retry in nfs42_proc_copy_file_range()
- Ensure rq_bytes_sent is reset before request transmission
- Ensure we respect the RPCSEC_GSS sequence number limit
- Address Kerberos performance/behavior regression"
* tag 'nfs-for-5.0-2' of git://git.linux-nfs.org/projects/anna/linux-nfs:
SUNRPC: Address Kerberos performance/behavior regression
SUNRPC: Ensure we respect the RPCSEC_GSS sequence number limit
SUNRPC: Ensure rq_bytes_sent is reset before request transmission
NFSv4.2 fix unnecessary retry in nfs4_copy_file_range
sunrpc: kernel BUG at kernel/cred.c:825!
SUNRPC: Fix TCP receive code on archs with flush_dcache_page()
xprtrdma: Double free in rpcrdma_sendctxs_create()
xprtrdma: Fix error code in rpcrdma_buffer_create()
The cleaner thread usually takes care of delayed iputs, with the
exception of the btrfs_end_transaction_throttle path. Delaying iputs
means we are potentially delaying the eviction of an inode and it's
respective space. The cleaner thread only gets woken up every 30
seconds, or when we require space. If there are a lot of inodes that
need to be deleted we could induce a serious amount of latency while we
wait for these inodes to be evicted. So instead wakeup the cleaner if
it's not already awake to process any new delayed iputs we add to the
list. If we suddenly need space we will less likely be backed up
behind a bunch of inodes that are waiting to be deleted, and we could
possibly free space before we need to get into the flushing logic which
will save us some latency.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Delayed iputs means we can have final iputs of deleted inodes in the
queue, which could potentially generate a lot of pinned space that could
be free'd. So before we decide to commit the transaction for ENOPSC
reasons, run the delayed iputs so that any potential space is free'd up.
If there is and we freed enough we can then commit the transaction and
potentially be able to make our reservation.
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This reverts commit e73e81b6d0.
This patch causes a few problems:
- adds latency to btrfs_finish_ordered_io
- as btrfs_finish_ordered_io is used for free space cache, generating
more work from btrfs_btree_balance_dirty_nodelay could end up in the
same workque, effectively deadlocking
12260 kworker/u96:16+btrfs-freespace-write D
[<0>] balance_dirty_pages+0x6e6/0x7ad
[<0>] balance_dirty_pages_ratelimited+0x6bb/0xa90
[<0>] btrfs_finish_ordered_io+0x3da/0x770
[<0>] normal_work_helper+0x1c5/0x5a0
[<0>] process_one_work+0x1ee/0x5a0
[<0>] worker_thread+0x46/0x3d0
[<0>] kthread+0xf5/0x130
[<0>] ret_from_fork+0x24/0x30
[<0>] 0xffffffffffffffff
Transaction commit will wait on the freespace cache:
838 btrfs-transacti D
[<0>] btrfs_start_ordered_extent+0x154/0x1e0
[<0>] btrfs_wait_ordered_range+0xbd/0x110
[<0>] __btrfs_wait_cache_io+0x49/0x1a0
[<0>] btrfs_write_dirty_block_groups+0x10b/0x3b0
[<0>] commit_cowonly_roots+0x215/0x2b0
[<0>] btrfs_commit_transaction+0x37e/0x910
[<0>] transaction_kthread+0x14d/0x180
[<0>] kthread+0xf5/0x130
[<0>] ret_from_fork+0x24/0x30
[<0>] 0xffffffffffffffff
And then writepages ends up waiting on transaction commit:
9520 kworker/u96:13+flush-btrfs-1 D
[<0>] wait_current_trans+0xac/0xe0
[<0>] start_transaction+0x21b/0x4b0
[<0>] cow_file_range_inline+0x10b/0x6b0
[<0>] cow_file_range.isra.69+0x329/0x4a0
[<0>] run_delalloc_range+0x105/0x3c0
[<0>] writepage_delalloc+0x119/0x180
[<0>] __extent_writepage+0x10c/0x390
[<0>] extent_write_cache_pages+0x26f/0x3d0
[<0>] extent_writepages+0x4f/0x80
[<0>] do_writepages+0x17/0x60
[<0>] __writeback_single_inode+0x59/0x690
[<0>] writeback_sb_inodes+0x291/0x4e0
[<0>] __writeback_inodes_wb+0x87/0xb0
[<0>] wb_writeback+0x3bb/0x500
[<0>] wb_workfn+0x40d/0x610
[<0>] process_one_work+0x1ee/0x5a0
[<0>] worker_thread+0x1e0/0x3d0
[<0>] kthread+0xf5/0x130
[<0>] ret_from_fork+0x24/0x30
[<0>] 0xffffffffffffffff
Eventually, we have every process in the system waiting on
balance_dirty_pages(), and nobody is able to make progress on page
writeback.
The original patch tried to fix an OOM condition, that happened on 4.4 but no
success reproducing that on later kernels (4.19 and 4.20). This is more likely
a problem in OOM itself.
Link: https://lore.kernel.org/linux-btrfs/20180528054821.9092-1-ethanlien@synology.com/
Reported-by: Chris Mason <clm@fb.com>
CC: stable@vger.kernel.org # 4.18+
CC: ethanlien <ethanlien@synology.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIVAwUAXECcyPu3V2unywtrAQIpGw//ctcGHg2sfUEra17pvlKEbpZe9bMeJxp6
UY2YR5gPpiYMmvNe8hLl3I68b43h03jGOx+KmqowqX7anNq3o2nMy0pDbuGmKtuS
5NmIOECAml8k2uPSpacAF2s6TsxB2lTDYwZdyeuRmZ4scOTujNby33RlijGIxX4s
87WJFRuCacm9I1KkiKKn4PWoYGjDdsZ7rsDyEeBmQ/MiKOSLG4QP5XuNr4X9zFMX
r8uF3N8h/NzJWefEirc2DPFfiWLJqkyclq9tgsTB1Z3l+x5u/MHnIg3rZpZhH0uC
GhGWjlGYqTxOwzYCaOOsNIDRF4rAGPi3lzuJXjONnhvbOO7DCGJ+Mo2obxkAqLL6
PrtFuQvgXIOl/k8y6AdckuPPu/OMHT1hyY0PQXmTGhHAAfPP3RHoPl7owEjmAbRg
hvRkFDSIKZ4Kr1nKP4vwaJYEKtxUQrkOwZKmN6ve31ZeJsyrH12MsCgWvp7oQwRJ
fVbk8DWVRtYzy4RaO3Xr+0WfD+03dDi6KKCPPiC2gtNKOO+1Kco/EtIqPi6SXg0m
ee/mFOkRsmEh++iNxS58qLxH37On6GSYOElSIMN0NJDNA2TzLrvidXlMSOTD1hj4
n6gL38E3br/CIimKPYm87qi6yC59CAtrJCulYiPOoMc20eaEUP4DIXN8yjgjLNQp
Hj5M9GRTwXk=
=cUrU
-----END PGP SIGNATURE-----
Merge tag 'afs-fixes-20190117' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
Pull AFS fixes from David Howells:
"Here's a set of fixes for AFS:
- Use struct_size() for kzalloc() size calculation.
- When calling YFS.CreateFile rather than AFS.CreateFile, it is
possible to create a file with a file lock already held. The
default value indicating no lock required is actually -1, not 0.
- Fix an oops in inode/vnode validation if the target inode doesn't
have a server interest assigned (ie. a server that will notify us
of changes by third parties).
- Fix refcounting of keys in file locking.
- Fix a race in refcounting asynchronous operations in the event of
an error during request transmission. The provision of a dedicated
function to get an extra ref on a call is split into a separate
commit"
* tag 'afs-fixes-20190117' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
afs: Fix race in async call refcounting
afs: Provide a function to get a ref on a call
afs: Fix key refcounting in file locking code
afs: Don't set vnode->cb_s_break in afs_validate()
afs: Set correct lock type for the yfs CreateFile
afs: Use struct_size() in kzalloc()
commit b05c950698 ("pstore/ram: Simplify ramoops_get_next_prz()
arguments") changed update assignment in getting next persistent ram zone
by adding a check for record type. But the check always returns true since
the record type is assigned 0. And this breaks console ramoops by showing
current console log instead of previous log on warm reset and hard reset
(actually hard reset should not be showing any logs).
Fix this by having persistent ram zone type check instead of record type
check. Tested this on SDM845 MTP and dragonboard 410c.
Reproducing this issue is simple as below:
1. Trigger hard reset and mount pstore. Will see console-ramoops
record in the mounted location which is the current log.
2. Trigger warm reset and mount pstore. Will see the current
console-ramoops record instead of previous record.
Fixes: b05c950698 ("pstore/ram: Simplify ramoops_get_next_prz() arguments")
Signed-off-by: Sai Prakash Ranjan <saiprakash.ranjan@codeaurora.org>
Acked-by: Joel Fernandes (Google) <joel@joelfernandes.org>
[kees: dropped local variable usage]
Signed-off-by: Kees Cook <keescook@chromium.org>
same story as with last May fixes in sysfs (7b745a4e40
"unfuck sysfs_mount()"); new_sb is left uninitialized
in case of early errors in kernfs_mount_ns() and papering
over it by treating any error from kernfs_mount_ns() as
equivalent to !new_ns ends up conflating the cases when
objects had never been transferred to a superblock with
ones when that has happened and resulting new superblock
had been dropped. Easily fixed (same way as in sysfs
case). Additionally, there's a superblock leak on
kernfs_node_dentry() failure *and* a dentry leak inside
kernfs_node_dentry() itself - the latter on probably
impossible errors, but the former not impossible to trigger
(as the matter of fact, injecting allocation failures
at that point *does* trigger it).
Cc: stable@kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
There's a race between afs_make_call() and afs_wake_up_async_call() in the
case that an error is returned from rxrpc_kernel_send_data() after it has
queued the final packet.
afs_make_call() will try and clean up the mess, but the call state may have
been moved on thereby causing afs_process_async_call() to also try and to
delete the call.
Fix this by:
(1) Getting an extra ref for an asynchronous call for the call itself to
hold. This makes sure the call doesn't evaporate on us accidentally
and will allow the call to be retained by the caller in a future
patch. The ref is released on leaving afs_make_call() or
afs_wait_for_call_to_complete().
(2) In the event of an error from rxrpc_kernel_send_data():
(a) Don't set the call state to AFS_CALL_COMPLETE until *after* the
call has been aborted and ended. This prevents
afs_deliver_to_call() from doing anything with any notifications
it gets.
(b) Explicitly end the call immediately to prevent further callbacks.
(c) Cancel any queued async_work and wait for the work if it's
executing. This allows us to be sure the race won't recur when we
change the state. We put the work queue's ref on the call if we
managed to cancel it.
(d) Put the call's ref that we got in (1). This belongs to us as long
as the call is in state AFS_CALL_CL_REQUESTING.
Fixes: 341f741f04 ("afs: Refcount the afs_call struct")
Signed-off-by: David Howells <dhowells@redhat.com>
Fix the refcounting of the authentication keys in the file locking code.
The vnode->lock_key member points to a key on which it expects to be
holding a ref, but it isn't always given an extra ref, however.
Fixes: 0fafdc9f88 ("afs: Fix file locking")
Signed-off-by: David Howells <dhowells@redhat.com>
A cb_interest record is not necessarily attached to the vnode on entry to
afs_validate(), which can cause an oops when we try to bring the vnode's
cb_s_break up to date in the default case (ie. no current callback promise
and the vnode has not been deleted).
Fix this by simply removing the line, as vnode->cb_s_break will be set when
needed by afs_register_server_cb_interest() when we next get a callback
promise from RPC call.
The oops looks something like:
BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
...
RIP: 0010:afs_validate+0x66/0x250 [kafs]
...
Call Trace:
afs_d_revalidate+0x8d/0x340 [kafs]
? __d_lookup+0x61/0x150
lookup_dcache+0x44/0x70
? lookup_dcache+0x44/0x70
__lookup_hash+0x24/0xa0
do_unlinkat+0x11d/0x2c0
__x64_sys_unlink+0x23/0x30
do_syscall_64+0x4d/0xf0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
Fixes: ae3b7361dc ("afs: Fix validation/callback interaction")
Signed-off-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
NR_WRITEBACK_TEMP is accounted on the temporary page in the request, not
the page cache page.
Fixes: 8b284dc472 ("fuse: writepages: handle same page rewrites")
Cc: <stable@vger.kernel.org> # v3.13
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Some of the pipe_buf_release() handlers seem to assume that the pipe is
locked - in particular, anon_pipe_buf_release() accesses pipe->tmp_page
without taking any extra locks. From a glance through the callers of
pipe_buf_release(), it looks like FUSE is the only one that calls
pipe_buf_release() without having the pipe locked.
This bug should only lead to a memory leak, nothing terrible.
Fixes: dd3bb14f44 ("fuse: support splice() writing to fuse device")
Cc: stable@vger.kernel.org
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Currently nfs42_proc_copy_file_range() can not return EAGAIN.
Fixes: e4648aa4f9 ("NFS recover from destination server reboot for copies")
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
bd_set_size() updates also block device's block size. This is somewhat
unexpected from its name and at this point, only blkdev_open() uses this
functionality. Furthermore, this can result in changing block size under
a filesystem mounted on a loop device which leads to livelocks inside
__getblk_gfp() like:
Sending NMI from CPU 0 to CPUs 1:
NMI backtrace for cpu 1
CPU: 1 PID: 10863 Comm: syz-executor0 Not tainted 4.18.0-rc5+ #151
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google
01/01/2011
RIP: 0010:__sanitizer_cov_trace_pc+0x3f/0x50 kernel/kcov.c:106
...
Call Trace:
init_page_buffers+0x3e2/0x530 fs/buffer.c:904
grow_dev_page fs/buffer.c:947 [inline]
grow_buffers fs/buffer.c:1009 [inline]
__getblk_slow fs/buffer.c:1036 [inline]
__getblk_gfp+0x906/0xb10 fs/buffer.c:1313
__bread_gfp+0x2d/0x310 fs/buffer.c:1347
sb_bread include/linux/buffer_head.h:307 [inline]
fat12_ent_bread+0x14e/0x3d0 fs/fat/fatent.c:75
fat_ent_read_block fs/fat/fatent.c:441 [inline]
fat_alloc_clusters+0x8ce/0x16e0 fs/fat/fatent.c:489
fat_add_cluster+0x7a/0x150 fs/fat/inode.c:101
__fat_get_block fs/fat/inode.c:148 [inline]
...
Trivial reproducer for the problem looks like:
truncate -s 1G /tmp/image
losetup /dev/loop0 /tmp/image
mkfs.ext4 -b 1024 /dev/loop0
mount -t ext4 /dev/loop0 /mnt
losetup -c /dev/loop0
l /mnt
Fix the problem by moving initialization of a block device block size
into a separate function and call it when needed.
Thanks to Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> for help with
debugging the problem.
Reported-by: syzbot+9933e4476f365f5d5a1b@syzkaller.appspotmail.com
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAlw7Y68ACgkQxWXV+ddt
WDs+Jw/9Hn71GLb1YNpNUlzJSQHUzbvFtXbtvEVyYeuuLkmjTG4FT2bkGqYhAeC9
2jvaqPaxI0VJ7vNulAaMVZayFwNEQrF9p+z+vzV/Ty2lu0Ep/7Uuwp9KL8X49req
5hEtb5p9Dbj9hXqa8XNOfF2GPDRTQ6D4eknYiNM7MY+aTkvMEtpuZg9kfwXrZ2au
JeFBzZo9SA5nxqrXZg1XlRYttDnOf44h6YJmEFOZOJuAouKcd5I7C8BshCRKDuWo
iJBjYTBTFqjgUgCrn10UM92T19hKufHiDE3WWQ/7zykqtThpKgBFR1opAUcNBJBf
HvOOsYnZcGbxIKOjFpucSpButTmjFcnI83f/7dmZYXUsyIzP/xH51uLiz/CLJqWj
JsRgtgtJ7l5s1M8GkFG6B9Tp89KxHnVKqNC5HyX+4AuFWiIJ+CWWSddGqNSfAIHe
o2ceQRMumvwbVKgfd0AfPcZ9v4sRM/DfwxWXgEQmCSNJikSuUjP9b3D51ttnXQ8c
q6lz7L6nRKK4mgBnfJCmpus3IArdvNwJ7CF8C1RejfRwL824RzH+yHjWdg36nVXB
oaBYKJf8GRRn+3In1W6npr65NxCQERIZV4M89EgST/RK7tW5u0Fotd1KJCmo+ayO
cSRbp16On9G6gBV+qrBs8X65KIWALZpdTycwyFplCU581DXdtnU=
=aCB2
-----END PGP SIGNATURE-----
Merge tag 'for-5.0-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- two regression fixes in clone/dedupe ioctls, the generic check
callback needs to lock extents properly and wait for io to avoid
problems with writeback and relocation
- fix deadlock when using free space tree due to block group creation
- a recently added check refuses a valid fileystem with seeding device,
make that work again with a quickfix, proper solution needs more
intrusive changes
* tag 'for-5.0-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: Use real device structure to verify dev extent
Btrfs: fix deadlock when using free space tree due to block group creation
Btrfs: fix race between reflink/dedupe and relocation
Btrfs: fix race between cloning range ending at eof and writeback
Here is one small sysfs change, and a documentation update for 5.0-rc2
The sysfs change moves from using BUG_ON to WARN_ON, as discussed in an
email thread on lkml while trying to track down another driver bug.
sysfs should not be crashing and preventing people from seeing where
they went wrong. Now it properly recovers and warns the developer.
The documentation update removes the use of BUS_ATTR() as the kernel is
moving away from this to use the specific BUS_ATTR_RW() and friends
instead. There are pending patches in all of the different subsystems
to remove the last users of this macro, but for now, don't advertise it
should be used anymore to keep new ones from being introduced.
Both have been in linux-next with no reported issues.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCXDsRFA8cZ3JlZ0Brcm9h
aC5jb20ACgkQMUfUDdst+ynbgACfTY/rAZpWTgMdPDfoOmF+s/XHQXsAoJKZ+v+f
Tpkcw76Wo1ESpPLuT1u1
=W0D+
-----END PGP SIGNATURE-----
Merge tag 'driver-core-5.0-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core fixes from Greg KH:
"Here is one small sysfs change, and a documentation update for 5.0-rc2
The sysfs change moves from using BUG_ON to WARN_ON, as discussed in
an email thread on lkml while trying to track down another driver bug.
sysfs should not be crashing and preventing people from seeing where
they went wrong. Now it properly recovers and warns the developer.
The documentation update removes the use of BUS_ATTR() as the kernel
is moving away from this to use the specific BUS_ATTR_RW() and friends
instead. There are pending patches in all of the different subsystems
to remove the last users of this macro, but for now, don't advertise
it should be used anymore to keep new ones from being introduced.
Both have been in linux-next with no reported issues"
* tag 'driver-core-5.0-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
Documentation: driver core: remove use of BUS_ATTR
sysfs: convert BUG_ON to WARN_ON
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAlw6VMAACgkQiiy9cAdy
T1GA4AwAiLxAK9gYkv7oCi4xkr+GeDr9/J0R4JZOglXrJQfbEDAgejpTqIVOeiTW
n2fDEe4JCj3Dk9VnhGIAcV3Lyp5pcXuiqKwBJsxlJIIM81rDa98K8NgpJqjFDt1P
QEsGHijO7/5rlefnMPUy0YrFdoGr9ZjDEYDx2RdTkuRAX2rqLI02ytqyB6AHMjSM
zx5Ck1JdltrZa9vN1k2QoUW92FLrHWHDQM6ooRTiTv1n5v3CtVCFNFEUD9AWCdKy
APFXva3cWPAh6Dvsh6Qtryn2WdSfDdy69iiqfkBrDaaLsJC0oCogK9TsQOL66Szu
onpXQROqxvOA0UoyJJgRaDhRjGzRLglChpitS9ps4ZqziXRMpKBsn9SzaZiJ3Jna
/u5Pv5CFr4JkXRnvLAx2h6coZ+FyfVmSgxtoCU+tdn8G0t2AXl7swUw2e2UX7k4y
FR20/w0hA9pDVSlrY1Z+Yg7TCZLS5677+6xFo+b+rnF6VW8XZQc/2F9MYlxyGAj5
30R/Q0ck
=410H
-----END PGP SIGNATURE-----
Merge tag '5.0-rc1-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs fixes from Steve French:
"A set of cifs/smb3 fixes, 4 for stable, most from Pavel. His patches
fix an important set of crediting (flow control) problems, and also
two problems in cifs_writepages, ddressing some large i/o and also
compounding issues"
* tag '5.0-rc1-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6:
cifs: update internal module version number
CIFS: Fix error paths in writeback code
CIFS: Move credit processing to mid callbacks for SMB3
CIFS: Fix credits calculation for cancelled requests
cifs: Fix potential OOB access of lock element array
cifs: Limit memory used by lock request calls to a page
cifs: move large array from stack to heap
CIFS: Do not hide EINTR after sending network packets
CIFS: Fix credit computation for compounded requests
CIFS: Do not set credits to 1 if the server didn't grant anything
CIFS: Fix adjustment of credits for MTU requests
cifs: Fix a tiny potential memory leak
cifs: Fix a debug message
edge case, marked for stable.
-----BEGIN PGP SIGNATURE-----
iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAlw4yfETHGlkcnlvbW92
QGdtYWlsLmNvbQAKCRBKf944AhHzi9loB/4iTBs/c5vI8oXSi+n4IAAG/LunP8Ff
HfxMr7Mpbpp0cLdzmJWO63Wejii95w7IkdepJTF5VlpH6dwJDUL6jSAyAXPO1a2G
phZHmnGYjFBKLGPftCZmGQCJeYeS9JJyRgTC4IZECVkQRiPRGXn8cAbEqMYOy6Am
mXmy0u4Me3RJTpcOCqaPPugontvtsZdfqvwVE2dGqXTj9Kh1hfum5ddrvf1CXugQ
MpXzW3mJCMMfcXlZOCVuQInKqPT4HRhTpzJxdECiJTxBXDQUd6e0ekOXmKW6jOYk
ZpNVnXHGee0uvzwQC0/VMWkRF6d+pOsYFQkDlX79h04rqnoBkUNc2Doe
=dJev
-----END PGP SIGNATURE-----
Merge tag 'ceph-for-5.0-rc2' of git://github.com/ceph/ceph-client
Pull ceph updates from Ilya Dryomov:
"A patch to allow setting abort_on_full and a fix for an old "rbd
unmap" edge case, marked for stable"
* tag 'ceph-for-5.0-rc2' of git://github.com/ceph/ceph-client:
rbd: don't return 0 on unmap if RBD_DEV_FLAG_REMOVING is set
ceph: use vmf_error() in ceph_filemap_fault()
libceph: allow setting abort_on_full for rbd
This patch aims to address writeback code problems related to error
paths. In particular it respects EINTR and related error codes and
stores and returns the first error occurred during writeback.
Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com>
Acked-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Currently we account for credits in the thread initiating a request
and waiting for a response. The demultiplex thread receives the response,
wakes up the thread and the latter collects credits from the response
buffer and add them to the server structure on the client. This approach
is not accurate, because it may race with reconnect events in the
demultiplex thread which resets the number of credits.
Fix this by moving credit processing to new mid callbacks that collect
credits granted by the server from the response in the demultiplex thread.
Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
If a request is cancelled, we can't assume that the server returns
1 credit back. Instead we need to wait for a response and process
the number of credits granted by the server.
Create a separate mid callback for cancelled request, parse the number
of credits in a response buffer and add them to the client's credits.
If the didn't get a response (no response buffer available) assume
0 credits granted. The latter most probably happens together with
session reconnect, so the client's credits are adjusted anyway.
Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
If maxBuf is small but non-zero, it could result in a zero sized lock
element array which we would then try and access OOB.
Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
CC: Stable <stable@vger.kernel.org>
The code tries to allocate a contiguous buffer with a size supplied by
the server (maxBuf). This could fail if memory is fragmented since it
results in high order allocations for commonly used server
implementations. It is also wasteful since there are probably
few locks in the usual case. Limit the buffer to be no larger than a
page to avoid memory allocation failures due to fragmentation.
Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
This addresses some compile warnings that you can
see depending on configuration settings.
Signed-off-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Currently we hide EINTR code returned from sock_sendmsg()
and return 0 instead. This makes a caller think that we
successfully completed the network operation which is not
true. Fix this by properly returning EINTR to callers.
Cc: <stable@vger.kernel.org>
Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
In SMB3 protocol every part of the compound chain consumes credits
individually, so we need to call wait_for_free_credits() for each
of the PDUs in the chain. If an operation is interrupted, we must
ensure we return all credits taken from the server structure back.
Without this patch server can sometimes disconnect the session
due to credit mismatches, especially when first operation(s)
are large writes.
Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
CC: Stable <stable@vger.kernel.org>
Currently we reset the number of total credits granted by the server
to 1 if the server didn't grant us anything int the response. This
violates the SMB3 protocol - we need to trust the server and use
the credit values from the response. Fix this by removing the
corresponding code.
Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
CC: Stable <stable@vger.kernel.org>
Currently for MTU requests we allocate maximum possible credits
in advance and then adjust them according to the request size.
While we were adjusting the number of credits belonging to the
server, we were skipping adjustment of credits belonging to the
request. This patch fixes it by setting request credits to
CreditCharge field value of SMB2 packet header.
Also ask 1 credit more for async read and write operations to
increase parallelism and match the behavior of other operations.
Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
CC: Stable <stable@vger.kernel.org>
The most recent "it" allocation is leaked on this error path. I
believe that small allocations always succeed in current kernels so
this doesn't really affect run time.
Fixes: 54be1f6c1c ("cifs: Add DFS cache routines")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
This debug message was never shown because it was checking for NULL
returns but extract_hostname() returns error pointers.
Fixes: 93d5cb517d ("cifs: Add support for failover in cifs_reconnect()")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Paulo Alcantara <palcantara@suse.de>
A lock type of 0 is "LockRead", which makes the fileserver record an
unintentional read lock on the new file. This will cause problems
later on if the file is the subject of locking operations.
The correct default value should be -1 ("LockNone").
Fix the operation marshalling code to set the value and provide an enum to
symbolise the values whilst we're at it.
Fixes: 30062bd13e ("afs: Implement YFS support in the fs client")
Signed-off-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
One of the more common cases of allocation size calculations is finding the
size of a structure that has a zero-sized array at the end, along with
memory for some number of elements for that array. For example:
struct foo {
int stuff;
void *entry[];
};
instance = kzalloc(sizeof(struct foo) + sizeof(void *) * count, GFP_KERNEL);
Instead of leaving these open-coded and prone to type mistakes, we can now
use the new struct_size() helper:
instance = kzalloc(struct_size(instance, entry, count), GFP_KERNEL);
This code was detected with the help of Coccinelle.
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
There are new types and helpers that are supposed to be used in new code.
As a preparation to get rid of legacy types and API functions do
the conversion here.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
[BUG]
Linux v5.0-rc1 will fail fstests/btrfs/163 with the following kernel
message:
BTRFS error (device dm-6): dev extent devid 1 physical offset 13631488 len 8388608 is beyond device boundary 0
BTRFS error (device dm-6): failed to verify dev extents against chunks: -117
BTRFS error (device dm-6): open_ctree failed
[CAUSE]
Commit cf90d884b3 ("btrfs: Introduce mount time chunk <-> dev extent
mapping check") introduced strict check on dev extents.
We use btrfs_find_device() with dev uuid and fs uuid set to NULL, and
only dependent on @devid to find the real device.
For seed devices, we call clone_fs_devices() in open_seed_devices() to
allow us search seed devices directly.
However clone_fs_devices() just populates devices with devid and dev
uuid, without populating other essential members, like disk_total_bytes.
This makes any device returned by btrfs_find_device(fs_info, devid,
NULL, NULL) is just a dummy, with 0 disk_total_bytes, and any dev
extents on the seed device will not pass the device boundary check.
[FIX]
This patch will try to verify the device returned by btrfs_find_device()
and if it's a dummy then re-search in seed devices.
Fixes: cf90d884b3 ("btrfs: Introduce mount time chunk <-> dev extent mapping check")
CC: stable@vger.kernel.org # 4.19+
Reported-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If new mode is the same as old mode we don't have to reset
inode mode in the rest of the code, so compare old and new
mode before setting update_mode flag.
Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
There is a comment in struct jfs_superblock that incorrectly labels
a 128-byte boundary. It has never been correct.
Shenghui Wang proposed moving it to the correct spot, before s_xlogpxd,
but at this point, I believe it is best just to remove it.
Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
Reported-by: Shenghui Wang <shhuiw@foxmail.com>
When modifying the free space tree we can end up COWing one of its extent
buffers which in turn might result in allocating a new chunk, which in
turn can result in flushing (finish creation) of pending block groups. If
that happens we can deadlock because creating a pending block group needs
to update the free space tree, and if any of the updates tries to modify
the same extent buffer that we are COWing, we end up in a deadlock since
we try to write lock twice the same extent buffer.
So fix this by skipping pending block group creation if we are COWing an
extent buffer from the free space tree. This is a case missed by commit
5ce555578e ("Btrfs: fix deadlock when writing out free space caches").
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=202173
Fixes: 5ce555578e ("Btrfs: fix deadlock when writing out free space caches")
CC: stable@vger.kernel.org # 4.18+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The recent rework that makes btrfs' remap_file_range operation use the
generic helper generic_remap_file_range_prep() introduced a race between
relocation and reflinking (for both cloning and deduplication) the file
extents between the source and destination inodes.
This happens because we no longer lock the source range anymore, and we do
not lock it anymore because we wait for direct IO writes and writeback to
complete early on the code path right after locking the inodes, which
guarantees no other file operations interfere with the reflinking. However
there is one exception which is relocation, since it replaces the byte
number of file extents items in the fs tree after locking the range the
file extent items represent. This is a problem because after finding each
file extent to clone in the fs tree, the reflink process copies the file
extent item into a local buffer, releases the search path, inserts new
file extent items in the destination range and then increments the
reference count for the extent mentioned in the file extent item that it
previously copied to the buffer. If right after copying the file extent
item into the buffer and releasing the path the relocation process
updates the file extent item to point to the new extent, the reflink
process ends up creating a delayed reference to increment the reference
count of the old extent, for which the relocation process already created
a delayed reference to drop it. This results in failure to run delayed
references because we will attempt to increment the count of a reference
that was already dropped. This is illustrated by the following diagram:
CPU 1 CPU 2
relocation is running
btrfs_clone_files()
btrfs_clone()
--> finds extent item
in source range
point to extent
at bytenr X
--> copies it into a
local buffer
--> releases path
replace_file_extents()
--> successfully locks the
range represented by
the file extent item
--> replaces disk_bytenr
field in the file
extent item with some
other value Y
--> creates delayed reference
to increment reference
count for extent at
bytenr Y
--> creates delayed reference
to drop the extent at
bytenr X
--> starts transaction
--> creates delayed
reference to
increment extent
at bytenr X
<delayed references are run, due to a transaction
commit for example, and the transaction is aborted
with -EIO because we attempt to increment reference
count for the extent at bytenr X after we freed it>
When this race is hit the running transaction ends up getting aborted with
an -EIO error and a trace like the following is produced:
[ 4382.553858] WARNING: CPU: 2 PID: 3648 at fs/btrfs/extent-tree.c:1552 lookup_inline_extent_backref+0x4f4/0x650 [btrfs]
(...)
[ 4382.556293] CPU: 2 PID: 3648 Comm: btrfs Tainted: G W 4.20.0-rc6-btrfs-next-41 #1
[ 4382.556294] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014
[ 4382.556308] RIP: 0010:lookup_inline_extent_backref+0x4f4/0x650 [btrfs]
(...)
[ 4382.556310] RSP: 0018:ffffac784408f738 EFLAGS: 00010202
[ 4382.556311] RAX: 0000000000000001 RBX: ffff8980673c3a48 RCX: 0000000000000001
[ 4382.556312] RDX: 0000000000000008 RSI: 0000000000000000 RDI: 0000000000000000
[ 4382.556312] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000001
[ 4382.556313] R10: 0000000000000001 R11: ffff897f40000000 R12: 0000000000001000
[ 4382.556313] R13: 00000000c224f000 R14: ffff89805de9bd40 R15: ffff8980453f4548
[ 4382.556315] FS: 00007f5e759178c0(0000) GS:ffff89807b300000(0000) knlGS:0000000000000000
[ 4382.563130] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 4382.563562] CR2: 00007f2e9789fcbc CR3: 0000000120512001 CR4: 00000000003606e0
[ 4382.564005] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 4382.564451] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 4382.564887] Call Trace:
[ 4382.565343] insert_inline_extent_backref+0x55/0xe0 [btrfs]
[ 4382.565796] __btrfs_inc_extent_ref.isra.60+0x88/0x260 [btrfs]
[ 4382.566249] ? __btrfs_run_delayed_refs+0x93/0x1650 [btrfs]
[ 4382.566702] __btrfs_run_delayed_refs+0xa22/0x1650 [btrfs]
[ 4382.567162] btrfs_run_delayed_refs+0x7e/0x1d0 [btrfs]
[ 4382.567623] btrfs_commit_transaction+0x50/0x9c0 [btrfs]
[ 4382.568112] ? _raw_spin_unlock+0x24/0x30
[ 4382.568557] ? block_rsv_release_bytes+0x14e/0x410 [btrfs]
[ 4382.569006] create_subvol+0x3c8/0x830 [btrfs]
[ 4382.569461] ? btrfs_mksubvol+0x317/0x600 [btrfs]
[ 4382.569906] btrfs_mksubvol+0x317/0x600 [btrfs]
[ 4382.570383] ? rcu_sync_lockdep_assert+0xe/0x60
[ 4382.570822] ? __sb_start_write+0xd4/0x1c0
[ 4382.571262] ? mnt_want_write_file+0x24/0x50
[ 4382.571712] btrfs_ioctl_snap_create_transid+0x117/0x1a0 [btrfs]
[ 4382.572155] ? _copy_from_user+0x66/0x90
[ 4382.572602] btrfs_ioctl_snap_create+0x66/0x80 [btrfs]
[ 4382.573052] btrfs_ioctl+0x7c1/0x30e0 [btrfs]
[ 4382.573502] ? mem_cgroup_commit_charge+0x8b/0x570
[ 4382.573946] ? do_raw_spin_unlock+0x49/0xc0
[ 4382.574379] ? _raw_spin_unlock+0x24/0x30
[ 4382.574803] ? __handle_mm_fault+0xf29/0x12d0
[ 4382.575215] ? do_vfs_ioctl+0xa2/0x6f0
[ 4382.575622] ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
[ 4382.576020] do_vfs_ioctl+0xa2/0x6f0
[ 4382.576405] ksys_ioctl+0x70/0x80
[ 4382.576776] __x64_sys_ioctl+0x16/0x20
[ 4382.577137] do_syscall_64+0x60/0x1b0
[ 4382.577488] entry_SYSCALL_64_after_hwframe+0x49/0xbe
(...)
[ 4382.578837] RSP: 002b:00007ffe04bf64c8 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
[ 4382.579174] RAX: ffffffffffffffda RBX: 00005564136f3050 RCX: 00007f5e74724dd7
[ 4382.579505] RDX: 00007ffe04bf64d0 RSI: 000000005000940e RDI: 0000000000000003
[ 4382.579848] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000044
[ 4382.580164] R10: 0000000000000541 R11: 0000000000000202 R12: 00005564136f3010
[ 4382.580477] R13: 0000000000000003 R14: 00005564136f3035 R15: 00005564136f3050
[ 4382.580792] irq event stamp: 0
[ 4382.581106] hardirqs last enabled at (0): [<0000000000000000>] (null)
[ 4382.581441] hardirqs last disabled at (0): [<ffffffff8d085842>] copy_process.part.32+0x6e2/0x2320
[ 4382.581772] softirqs last enabled at (0): [<ffffffff8d085842>] copy_process.part.32+0x6e2/0x2320
[ 4382.582095] softirqs last disabled at (0): [<0000000000000000>] (null)
[ 4382.582413] ---[ end trace d3c188e3e9367382 ]---
[ 4382.623855] BTRFS: error (device sdc) in btrfs_run_delayed_refs:2981: errno=-5 IO failure
[ 4382.624295] BTRFS info (device sdc): forced readonly
Fix this by locking the source range before searching for the file extent
items in the fs tree, since the relocation process will try to lock the
range a file extent item represents before updating it with the new extent
location.
Fixes: 34a28e3d77 ("Btrfs: use generic_remap_file_range_prep() for cloning and deduplication")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The recent rework that makes btrfs' remap_file_range operation use the
generic helper generic_remap_file_range_prep() introduced a race between
writeback and cloning a range that covers the eof extent of the source
file into a destination offset that is greater then the same file's size.
This happens because we now wait for writeback to complete before doing
the truncation of the eof block, while previously we did the truncation
and then waited for writeback to complete. This leads to a race between
writeback of the truncated block and cloning the file extents in the
source range, because we copy each file extent item we find in the fs
root into a buffer, then release the path and then increment the reference
count for the extent referred in that file extent item we copied, which
can no longer exist if writeback of the truncated eof block completes
after we copied the file extent item into the buffer and before we
incremented the reference count. This is illustrated by the following
diagram:
CPU 1 CPU 2
btrfs_clone_files()
btrfs_cont_expand()
btrfs_truncate_block()
--> zeroes part of the
page containg eof,
marking it for
delalloc
btrfs_clone()
--> finds extent item
covering eof,
points to extent
at bytenr X
--> copies it into a
local buffer
--> releases path
writeback starts
btrfs_finish_ordered_io()
insert_reserved_file_extent()
__btrfs_drop_extents()
--> creates delayed
reference to drop
the extent at
bytenr X
--> starts transaction
--> creates delayed
reference to
increment extent
at bytenr X
<delayed references are run, due to a transaction
commit for example, and the transaction is aborted
with -EIO because we attempt to increment reference
count for the extent at bytenr X after we freed it>
When this race is hit the running transaction ends up getting aborted with
an -EIO error and a trace like the following is produced:
[ 4382.553858] WARNING: CPU: 2 PID: 3648 at fs/btrfs/extent-tree.c:1552 lookup_inline_extent_backref+0x4f4/0x650 [btrfs]
(...)
[ 4382.556293] CPU: 2 PID: 3648 Comm: btrfs Tainted: G W 4.20.0-rc6-btrfs-next-41 #1
[ 4382.556294] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014
[ 4382.556308] RIP: 0010:lookup_inline_extent_backref+0x4f4/0x650 [btrfs]
(...)
[ 4382.556310] RSP: 0018:ffffac784408f738 EFLAGS: 00010202
[ 4382.556311] RAX: 0000000000000001 RBX: ffff8980673c3a48 RCX: 0000000000000001
[ 4382.556312] RDX: 0000000000000008 RSI: 0000000000000000 RDI: 0000000000000000
[ 4382.556312] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000001
[ 4382.556313] R10: 0000000000000001 R11: ffff897f40000000 R12: 0000000000001000
[ 4382.556313] R13: 00000000c224f000 R14: ffff89805de9bd40 R15: ffff8980453f4548
[ 4382.556315] FS: 00007f5e759178c0(0000) GS:ffff89807b300000(0000) knlGS:0000000000000000
[ 4382.563130] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 4382.563562] CR2: 00007f2e9789fcbc CR3: 0000000120512001 CR4: 00000000003606e0
[ 4382.564005] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 4382.564451] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 4382.564887] Call Trace:
[ 4382.565343] insert_inline_extent_backref+0x55/0xe0 [btrfs]
[ 4382.565796] __btrfs_inc_extent_ref.isra.60+0x88/0x260 [btrfs]
[ 4382.566249] ? __btrfs_run_delayed_refs+0x93/0x1650 [btrfs]
[ 4382.566702] __btrfs_run_delayed_refs+0xa22/0x1650 [btrfs]
[ 4382.567162] btrfs_run_delayed_refs+0x7e/0x1d0 [btrfs]
[ 4382.567623] btrfs_commit_transaction+0x50/0x9c0 [btrfs]
[ 4382.568112] ? _raw_spin_unlock+0x24/0x30
[ 4382.568557] ? block_rsv_release_bytes+0x14e/0x410 [btrfs]
[ 4382.569006] create_subvol+0x3c8/0x830 [btrfs]
[ 4382.569461] ? btrfs_mksubvol+0x317/0x600 [btrfs]
[ 4382.569906] btrfs_mksubvol+0x317/0x600 [btrfs]
[ 4382.570383] ? rcu_sync_lockdep_assert+0xe/0x60
[ 4382.570822] ? __sb_start_write+0xd4/0x1c0
[ 4382.571262] ? mnt_want_write_file+0x24/0x50
[ 4382.571712] btrfs_ioctl_snap_create_transid+0x117/0x1a0 [btrfs]
[ 4382.572155] ? _copy_from_user+0x66/0x90
[ 4382.572602] btrfs_ioctl_snap_create+0x66/0x80 [btrfs]
[ 4382.573052] btrfs_ioctl+0x7c1/0x30e0 [btrfs]
[ 4382.573502] ? mem_cgroup_commit_charge+0x8b/0x570
[ 4382.573946] ? do_raw_spin_unlock+0x49/0xc0
[ 4382.574379] ? _raw_spin_unlock+0x24/0x30
[ 4382.574803] ? __handle_mm_fault+0xf29/0x12d0
[ 4382.575215] ? do_vfs_ioctl+0xa2/0x6f0
[ 4382.575622] ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
[ 4382.576020] do_vfs_ioctl+0xa2/0x6f0
[ 4382.576405] ksys_ioctl+0x70/0x80
[ 4382.576776] __x64_sys_ioctl+0x16/0x20
[ 4382.577137] do_syscall_64+0x60/0x1b0
[ 4382.577488] entry_SYSCALL_64_after_hwframe+0x49/0xbe
(...)
[ 4382.578837] RSP: 002b:00007ffe04bf64c8 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
[ 4382.579174] RAX: ffffffffffffffda RBX: 00005564136f3050 RCX: 00007f5e74724dd7
[ 4382.579505] RDX: 00007ffe04bf64d0 RSI: 000000005000940e RDI: 0000000000000003
[ 4382.579848] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000044
[ 4382.580164] R10: 0000000000000541 R11: 0000000000000202 R12: 00005564136f3010
[ 4382.580477] R13: 0000000000000003 R14: 00005564136f3035 R15: 00005564136f3050
[ 4382.580792] irq event stamp: 0
[ 4382.581106] hardirqs last enabled at (0): [<0000000000000000>] (null)
[ 4382.581441] hardirqs last disabled at (0): [<ffffffff8d085842>] copy_process.part.32+0x6e2/0x2320
[ 4382.581772] softirqs last enabled at (0): [<ffffffff8d085842>] copy_process.part.32+0x6e2/0x2320
[ 4382.582095] softirqs last disabled at (0): [<0000000000000000>] (null)
[ 4382.582413] ---[ end trace d3c188e3e9367382 ]---
[ 4382.623855] BTRFS: error (device sdc) in btrfs_run_delayed_refs:2981: errno=-5 IO failure
[ 4382.624295] BTRFS info (device sdc): forced readonly
Fix this by waiting for writeback to complete after truncating the eof
block.
Fixes: 34a28e3d77 ("Btrfs: use generic_remap_file_range_prep() for cloning and deduplication")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When calling debugfs functions, there is no need to ever check the
return value. The function can work or not, but the code logic should
never do something different based on this.
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Chao Yu <yuchao0@huawei.com>
Cc: linux-f2fs-devel@lists.sourceforge.net
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Type of inject_rate is unsigned int, let's check new value's
validity during configuring.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This reverts c86aa7bbfd
The reverted commit caused ABBA deadlocks when file migration raced with
file eviction for specific hugetlbfs files. This was discovered with a
modified version of the LTP move_pages12 test.
The purpose of the reverted patch was to close a long existing race
between hugetlbfs file truncation and page faults. After more analysis
of the patch and impacted code, it was determined that i_mmap_rwsem can
not be used for all required synchronization. Therefore, revert this
patch while working an another approach to the underlying issue.
Link: http://lkml.kernel.org/r/20190103235452.29335-1-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reported-by: Jan Stancek <jstancek@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Back in 2007 I made what turned out to be a rather serious
mistake in the implementation of the Smack security module.
The SELinux module used an interface in /proc to manipulate
the security context on processes. Rather than use a similar
interface, I used the same interface. The AppArmor team did
likewise. Now /proc/.../attr/current will tell you the
security "context" of the process, but it will be different
depending on the security module you're using.
This patch provides a subdirectory in /proc/.../attr for
Smack. Smack user space can use the "current" file in
this subdirectory and never have to worry about getting
SELinux attributes by mistake. Programs that use the
old interface will continue to work (or fail, as the case
may be) as before.
The proposed S.A.R.A security module is dependent on
the mechanism to create its own attr subdirectory.
The original implementation is by Kees Cook.
Signed-off-by: Casey Schaufler <casey@schaufler-ca.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Fixes gcc '-Wunused-but-set-variable' warning:
fs/f2fs/data.c: In function 'f2fs_dio_submit_bio':
fs/f2fs/data.c:2585:6: warning:
variable 'err' set but not used [-Wunused-but-set-variable]
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The error case of failing allocating memory should
return -ENOMEM.
Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This fixes wrong access of address spaces of node and meta inodes after iput.
Fixes: 60aa4d5536 ("f2fs: fix use-after-free issue when accessing sbi->stat_info")
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Otherwise, we can get wrong counts incurring checkpoint hang.
IO_W (CP: -24, Data: 24, Flush: ( 0 0 1), Discard: ( 0 0))
Thread A Thread B
- f2fs_write_data_pages
- __write_data_page
- f2fs_submit_page_write
- inc_page_count(F2FS_WB_DATA)
type is F2FS_WB_DATA due to file is non-atomic one
- f2fs_ioc_start_atomic_write
- set_inode_flag(FI_ATOMIC_FILE)
- f2fs_write_end_io
- dec_page_count(F2FS_WB_CP_DATA)
type is F2FS_WB_DATA due to file becomes
atomic one
Cc: <stable@vger.kernel.org>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This code is converted to use vmf_error().
Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Introduce a new option abort_on_full, default to false. Then
we can get -ENOSPC when the pool is full, or reaches quota.
[ Don't show abort_on_full in /proc/mounts. ]
Signed-off-by: Dongsheng Yang <dongsheng.yang@easystack.cn>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
It's rude to crash the system just because the developer did something
wrong, as it prevents them from usually even seeing what went wrong.
So convert the few BUG_ON() calls that have snuck into the sysfs code
over the years to WARN_ON() to make it more "friendly". All of these
are able to be recovered from, so it makes no sense to crash.
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAlwyBbEACgkQ8vlZVpUN
gaNrawgAhYWrPwsEFM17dziRWRm8Ub9QgQUK6JRt+vE5KCRRVdXgJSLVH4esW9rJ
X+QQ0diT8ZMKjdbsyz0cVmwP7nqQ5EKzjxts6J8vtbWDB6+nvaDLNdicJgUOprcT
jIi8/45XKmyGUVO9Au6Wdda/zZi4dQBkXd+zUFGWYQRYL0LgmboWHKlaWueu7Qha
xVtavYPSKUSMH8+r1F+HU6P41+1IBiuK4tCwfKfAqJ367Ushzk9xVKHNGrGDAQNi
BTbn4NOOFaYvmVudJbQjD3tHtuQu2JsxlclB5KAtLBm1r3+vb3fMGsNyPBUmNp6Y
YE/xKhACP4kYlk9xCG7vWcWGyTu90g==
=HR7f
-----END PGP SIGNATURE-----
Merge tag 'fscrypt_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/fscrypt
Pull fscrypt updates from Ted Ts'o:
"Add Adiantum support for fscrypt"
* tag 'fscrypt_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/fscrypt:
fscrypt: add Adiantum support
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAlwyAi0ACgkQ8vlZVpUN
gaMtEgf9GyIJ5UnbR3J5+tpOAjx2GFJmTgpinWcfqVBrWwQDTigxiLm5sRIz5ToY
hoWvCIxWm9LgrdS0unYOUNzRyzSZisNAtceowbPErlV5NPS9zcftVt4pRYZ6hIZK
L3wKQ/PdVxIaekP9SXvFx5tfnHSB6CTGOJu1YMlF8ERm4tXUXHIwHzwUwrYqPYN0
5i0uWxbx7qMlEzTf/9sEMYrmdHjsrPlXe0kIP0sMyd7hJl28l0QTNQ2s126fRNLK
wkVMduacGuFGLwqbh7O1QrayWtcni7PKgTW9MfTsjLbg/EWx77auZBTSLfEO+ZKq
2gxxCbM0sID5sgVaw6ku8QJkfiU2fw==
=aQSK
-----END PGP SIGNATURE-----
Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 bug fixes from Ted Ts'o:
"Fix a number of ext4 bugs"
* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
ext4: fix special inode number checks in __ext4_iget()
ext4: track writeback errors using the generic tracking infrastructure
ext4: use ext4_write_inode() when fsyncing w/o a journal
ext4: avoid kernel warning when writing the superblock to a dead device
ext4: fix a potential fiemap/page fault deadlock w/ inline_data
ext4: make sure enough credits are reserved for dioread_nolock writes
Add support for the Adiantum encryption mode to fscrypt. Adiantum is a
tweakable, length-preserving encryption mode with security provably
reducible to that of XChaCha12 and AES-256, subject to a security bound.
It's also a true wide-block mode, unlike XTS. See the paper
"Adiantum: length-preserving encryption for entry-level processors"
(https://eprint.iacr.org/2018/720.pdf) for more details. Also see
commit 059c2a4d8e ("crypto: adiantum - add Adiantum support").
On sufficiently long messages, Adiantum's bottlenecks are XChaCha12 and
the NH hash function. These algorithms are fast even on processors
without dedicated crypto instructions. Adiantum makes it feasible to
enable storage encryption on low-end mobile devices that lack AES
instructions; currently such devices are unencrypted. On ARM Cortex-A7,
on 4096-byte messages Adiantum encryption is about 4 times faster than
AES-256-XTS encryption; decryption is about 5 times faster.
In fscrypt, Adiantum is suitable for encrypting both file contents and
names. With filenames, it fixes a known weakness: when two filenames in
a directory share a common prefix of >= 16 bytes, with CTS-CBC their
encrypted filenames share a common prefix too, leaking information.
Adiantum does not have this problem.
Since Adiantum also accepts long tweaks (IVs), it's also safe to use the
master key directly for Adiantum encryption rather than deriving
per-file keys, provided that the per-file nonce is included in the IVs
and the master key isn't used for any other encryption mode. This
configuration saves memory and improves performance. A new fscrypt
policy flag is added to allow users to opt-in to this configuration.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAlwwD7EACgkQiiy9cAdy
T1HVLwwAkjsxPoGSZD1S+corFexh50UZvEQXbEeuIa70h7nVFxGuONOB3TiOebWm
xz//s6twlduOv+l82riV41W4iEcLT36If7lEl7JX1wVycY6MIm0KwzUqJKnDw60p
zxNEkk6sQ+1wZ8fj6jonDsdtctGoGhNoWkPybyaYkersQFRBFSkPKnpvWF9bXRz0
uqioONLpE2rajUJHv5gLE3dswTBvE2SGvCQfy0ANp3ZBVY5JGjyWlUQqC0riiemY
P541VbJlXmcLeU5COg7Wz45BRlHHYxJVAKc74owS823kMUfHW2Zyae4gWmOhKT0S
86CEYoZR+7nyegggNJr/sIC2ISftz1cr0q9VWeqXJSesNjGJFAbbqdh7qBf1sUMp
SseUPwlhhe6LnfYt5scwj49V/BZQqPcxThBNNXfFQayCWSNUuIPSBQ+euGeGaVJf
j1hhUq28DhbuGcz1464JnI06MCEpSyMAsKHNToK+EZLBH6lu6UcgcA2NHyF8mp/L
IPyJhjTe
=SZtN
-----END PGP SIGNATURE-----
Merge tag '4.21-smb3-small-fixes' of git://git.samba.org/sfrench/cifs-2.6
Pull smb3 fixes from Steve French:
"Three fixes, one for stable, one adds the (most secure) SMB3.1.1
dialect to default list requested"
* tag '4.21-smb3-small-fixes' of git://git.samba.org/sfrench/cifs-2.6:
smb3: add smb3.1.1 to default dialect list
cifs: fix confusing warning message on reconnect
smb3: fix large reads on encrypted connections
- Remove a couple of unnecessary local variables.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAlwnwYcACgkQ+H93GTRK
tOu3NQ/9Gc4+xy2zZMAv1O6E3O1NctQgi8LxvzXcpgAcjopYoFXdCfBDrA2hBFG7
bNjzcfC4iw66tzq7Fv7jC00st3R9QOqLrjdpoeSjf07vC4kqG8pOyHTnaQCHhP6j
gqDthDZQb5WhsAT0ytzWxzR0orA2dhCn4UDL8LpYMfVljkexqHUUXHqMCNMr6bxK
q6ICsBiqbiR/xD/GHgzGlsAm8jC9oajm9dSuRP7e0pPGtRz8SrykotI0SdG8w/+d
ujT9Q0PDGGXTSN58o+RDH16XTRVTTQCY8JY0X/sF4n6nH0wjUwqja50NipOzjGXD
nA+e0GEPeFYFo2mrmkQDqvPg2AkCHoEwtw0Pt+Kkqiu0/a3etHpCYEDsJfvW6CAT
Wrz1dAb4mCIAUmK0VkL0aIBiS6IDAwuWQ/CMIJA2AkVs0qZAzMTURLKEa36+qAEr
gjERnoQst1s/mrrTCfdC+1k+yBI/RrpEhoyV1I9S48SkiXHYlCcub/oLAyc7Pb5L
aeLrsRz/n6kGZJBM3tvW5kJglDHBcRjqrV+JinVt0zOOQHGZem65qPTxfdmwUFXo
46ZsEoJxw7eq4V/hXKAzmnmlRFlTbwcoouhimZUqjjiIQRmZ7GDQdXDYwLoT5pTv
wHZKmynAuVIBzkLcoyKdGPZ4znRftPxbI+NI9CqCzwuS+VHfBIc=
=gXVU
-----END PGP SIGNATURE-----
Merge tag 'xfs-4.21-merge-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs fixlets from Darrick Wong:
"Remove a couple of unnecessary local variables"
* tag 'xfs-4.21-merge-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: xfs_fsops: drop useless LIST_HEAD
xfs: xfs_buf: drop useless LIST_HEAD
from myself and a few cap handling fixes from Zheng.
-----BEGIN PGP SIGNATURE-----
iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAlwuI7ATHGlkcnlvbW92
QGdtYWlsLmNvbQAKCRBKf944AhHzizcvB/9GqpAzR+Yy1iIQGNeijPSeuXsrlcQF
WErfaG8tUwZY3vqv3+OSZBwuMgq6wAyCo3wJmh0GCZoy02WLJbPB/G8AiHtoZUAh
wAWfL8feZkzx3L7JV0OrPG0GGYkhKu5PebM4rq3cXvlL0OiTKPs8bmbTvh0mSv3z
gH1odW0j2mAb1/3tqm9M5+7XhrGSnmSfA028NeKx6I4nE0ONd9BEcHZDoRBBQeNf
tgyxH4IJuuQ+x4/FKIn6+hBbMYiVrTBlz4wQHrJvvzDUeCkWu+E8JZ4utxxNdfmS
uGsPDRqi4LSMwt1q0HLHhkCP0lg5yf9NByGoy+VH5/gS8ma6be9+IbfX
=puaN
-----END PGP SIGNATURE-----
Merge tag 'ceph-for-4.21-rc1' of git://github.com/ceph/ceph-client
Pull ceph updates from Ilya Dryomov:
"A fairly quiet round: a couple of messenger performance improvements
from myself and a few cap handling fixes from Zheng"
* tag 'ceph-for-4.21-rc1' of git://github.com/ceph/ceph-client:
ceph: don't encode inode pathes into reconnect message
ceph: update wanted caps after resuming stale session
ceph: skip updating 'wanted' caps if caps are already issued
ceph: don't request excl caps when mount is readonly
ceph: don't update importing cap's mseq when handing cap export
libceph: switch more to bool in ceph_tcp_sendmsg()
libceph: use MSG_SENDPAGE_NOTLAST with ceph_tcp_sendpage()
libceph: use sock_no_sendpage() as a fallback in ceph_tcp_sendpage()
libceph: drop last_piece logic from write_partial_message_data()
ceph: remove redundant assignment
ceph: cleanup splice_dentry()
Pull vfs mount API prep from Al Viro:
"Mount API prereqs.
Mostly that's LSM mount options cleanups. There are several minor
fixes in there, but nothing earth-shattering (leaks on failure exits,
mostly)"
* 'mount.part1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (27 commits)
mount_fs: suppress MAC on MS_SUBMOUNT as well as MS_KERNMOUNT
smack: rewrite smack_sb_eat_lsm_opts()
smack: get rid of match_token()
smack: take the guts of smack_parse_opts_str() into a new helper
LSM: new method: ->sb_add_mnt_opt()
selinux: rewrite selinux_sb_eat_lsm_opts()
selinux: regularize Opt_... names a bit
selinux: switch away from match_token()
selinux: new helper - selinux_add_opt()
LSM: bury struct security_mnt_opts
smack: switch to private smack_mnt_opts
selinux: switch to private struct selinux_mnt_opts
LSM: hide struct security_mnt_opts from any generic code
selinux: kill selinux_sb_get_mnt_opts()
LSM: turn sb_eat_lsm_opts() into a method
nfs_remount(): don't leak, don't ignore LSM options quietly
btrfs: sanitize security_mnt_opts use
selinux; don't open-code a loop in sb_finish_set_opts()
LSM: split ->sb_set_mnt_opts() out of ->sb_kern_mount()
new helper: security_sb_eat_lsm_opts()
...
Pull trivial vfs updates from Al Viro:
"A few cleanups + Neil's namespace_unlock() optimization"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
exec: make prepare_bprm_creds static
genheaders: %-<width>s had been there since v6; %-*s - since v7
VFS: use synchronize_rcu_expedited() in namespace_unlock()
iov_iter: reduce code duplication
Merge more updates from Andrew Morton:
- procfs updates
- various misc bits
- lib/ updates
- epoll updates
- autofs
- fatfs
- a few more MM bits
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (58 commits)
mm/page_io.c: fix polled swap page in
checkpatch: add Co-developed-by to signature tags
docs: fix Co-Developed-by docs
drivers/base/platform.c: kmemleak ignore a known leak
fs: don't open code lru_to_page()
fs/: remove caller signal_pending branch predictions
mm/: remove caller signal_pending branch predictions
arch/arc/mm/fault.c: remove caller signal_pending_branch predictions
kernel/sched/: remove caller signal_pending branch predictions
kernel/locking/mutex.c: remove caller signal_pending branch predictions
mm: select HAVE_MOVE_PMD on x86 for faster mremap
mm: speed up mremap by 20x on large regions
mm: treewide: remove unused address argument from pte_alloc functions
initramfs: cleanup incomplete rootfs
scripts/gdb: fix lx-version string output
kernel/kcov.c: mark write_comp_data() as notrace
kernel/sysctl: add panic_print into sysctl
panic: add options to print system info when panic happens
bfs: extra sanity checking and static inode bitmap
exec: separate MM_ANONPAGES and RLIMIT_STACK accounting
...
Multiple filesystems open code lru_to_page(). Rectify this by moving
the macro from mm_inline (which is specific to lru stuff) to the more
generic mm.h header and start using the macro where appropriate.
No functional changes.
Link: http://lkml.kernel.org/r/20181129104810.23361-1-nborisov@suse.com
Link: https://lkml.kernel.org/r/20181129075301.29087-1-nborisov@suse.com
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Acked-by: Pankaj gupta <pagupta@redhat.com>
Acked-by: "Yan, Zheng" <zyan@redhat.com> [ceph]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is already done for us internally by the signal machinery.
[akpm@linux-foundation.org: fix fs/buffer.c]
Link: http://lkml.kernel.org/r/20181116002713.8474-7-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso <dave@stgolabs.net>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Strengthen validation of BFS superblock against corruption. Make
in-core inode bitmap static part of superblock info structure. Print a
warning when mounting a BFS filesystem created with "-N 512" option as
only 510 files can be created in the root directory. Make the kernel
messages more uniform. Update the 'prefix' passed to bfs_dump_imap() to
match the current naming of operations. White space and comments
cleanup.
Link: http://lkml.kernel.org/r/CAK+_RLkFZMduoQF36wZFd3zLi-6ZutWKsydjeHFNdtRvZZEb4w@mail.gmail.com
Signed-off-by: Tigran Aivazian <aivazian.tigran@gmail.com>
Reported-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
get_arg_page() checks bprm->rlim_stack.rlim_cur and re-calculates the
"extra" size for argv/envp pointers every time, this is a bit ugly and
even not strictly correct: acct_arg_size() must not account this size.
Remove all the rlimit code in get_arg_page(). Instead, add bprm->argmin
calculated once at the start of __do_execve_file() and change
copy_strings to check bprm->p >= bprm->argmin.
The patch adds the new helper, prepare_arg_pages() which initializes
bprm->argc/envc and bprm->argmin.
[oleg@redhat.com: fix !CONFIG_MMU version of get_arg_page()]
Link: http://lkml.kernel.org/r/20181126122307.GA1660@redhat.com
[akpm@linux-foundation.org: use max_t]
Link: http://lkml.kernel.org/r/20181112160910.GA28440@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Kees Cook <keescook@chromium.org>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
load_script() simply truncates bprm->buf and this is very wrong if the
length of shebang string exceeds BINPRM_BUF_SIZE-2. This can silently
truncate i_arg or (worse) we can execute the wrong binary if buf[2:126]
happens to be the valid executable path.
Change load_script() to return ENOEXEC if it can't find '\n' or zero in
bprm->buf. Note that '\0' can come from either
prepare_binprm()->memset() or from kernel_read(), we do not care.
Link: http://lkml.kernel.org/r/20181112160931.GA28463@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Ben Woodard <woodard@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch introduces 3 new inline functions - is_fat12, is_fat16 and
is_fat32, and replaces every occurrence in the code in which the FS
variant (whether this is FAT12, FAT16 or FAT32) was previously checked
using msdos_sb_info->fat_bits.
Link: http://lkml.kernel.org/r/1544990640-11604-4-git-send-email-carmeli.tamir@gmail.com
Signed-off-by: Carmeli Tamir <carmeli.tamir@gmail.com>
Acked-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
MAX_FAT is useless in msdos_fs.h, since it uses the MSDOS_SB function
that is defined in fat.h. So really, this macro can be only called from
code that already includes fat.h.
Hence, this patch moves it to fat.h, right after MSDOS_SB is defined. I
also changed it to an inline function in order to save the double call
to MSDOS_SB. This was suggested by joe@perches.com in the previous
version.
This patch is required for the next in the series, in which the variant
(whether this is FAT12, FAT16 or FAT32) checks are replaced with new
macros.
Link: http://lkml.kernel.org/r/1544990640-11604-3-git-send-email-carmeli.tamir@gmail.com
Signed-off-by: Carmeli Tamir <carmeli.tamir@gmail.com>
Acked-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The comment edited in this patch was the only reference to the
FAT_FIRST_ENT macro, which is not used anymore. Moreover, the commented
line of code does not compile with the current code.
Since the FAT_FIRST_ENT macro checks the FAT variant in a way that the
patch series changes, I removed it, and instead wrote a clear
explanation of what was checked.
I verified that the changed comment is correct according to Microsoft
FAT spec, search for "BPB_Media" in the following references:
1. Microsoft FAT specification 2005
(http://read.pudn.com/downloads77/ebook/294884/FAT32%20Spec%20%28SDA%20Contribution%29.pdf).
Search for 'volume label'.
2. Microsoft Extensible Firmware Initiative, FAT32 File System Specification
(https://staff.washington.edu/dittrich/misc/fatgen103.pdf).
Search for 'volume label'.
Link: http://lkml.kernel.org/r/1544990640-11604-2-git-send-email-carmeli.tamir@gmail.com
Signed-off-by: Carmeli Tamir <carmeli.tamir@gmail.com>
Acked-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The immutable, append-only and no-dump attributes can only be retrieved
with an ioctl; implement the ->getattr() method to return them on statx.
Do not return the inode birthtime yet, because the issue of how best to
handle the post-2038 timestamps is still under discussion.
This patch is needed to pass xfstests generic/424.
Link: http://lkml.kernel.org/r/20181014163558.sxorxlzjqccq2lpw@eaf
Signed-off-by: Ernesto A. Fernández <ernesto.mnd.fernandez@gmail.com>
Cc: Viacheslav Dubeyko <slava@dubeyko.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 092a53452b ("autofs: take more care to not update last_used on
path walk") helped to (partially) resolve a problem where automounts
were not expiring due to aggressive accesses from user space.
This patch was later reverted because, for very large environments, it
meant more mount requests from clients and when there are a lot of
clients this caused a fairly significant increase in server load.
But there is a need for both types of expire check, depending on use
case, so add a mount option to allow for strict update of last use of
autofs dentrys (which just means not updating the last use on path walk
access).
Link: http://lkml.kernel.org/r/154296973880.9889.14085372741514507967.stgit@pluto-themaw-net
Signed-off-by: Ian Kent <raven@themaw.net>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Change the superblock info. catatonic setting to be part of a flags bit
field.
Link: http://lkml.kernel.org/r/154296973142.9889.17275721668508589639.stgit@pluto-themaw-net
Signed-off-by: Ian Kent <raven@themaw.net>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The parse_options() function uses a long list of parameters, most of
which are present in the super block info structure already.
The mount parameters set in parse_options() options don't require
cleanup so using the super block info struct directly is simpler.
Link: http://lkml.kernel.org/r/154296972423.9889.9368859245676473329.stgit@pluto-themaw-net
Signed-off-by: Ian Kent <raven@themaw.net>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Al Viro made some suggestions to improve the implementation of commit
0633da48f0 ("fix autofs_sbi() does not check super block type").
The check is unnecessary in all cases except for ioctl usage so placing
the check in the super block accessor function adds a small overhead to
the common case where it isn't needed.
So it's sufficient to do this in the ioctl code only.
Also the check in the ioctl code is needlessly complex.
[akpm@linux-foundation.org: declare autofs_fs_type in .h, not .c]
Link: http://lkml.kernel.org/r/154296970987.9889.1597442413573683096.stgit@pluto-themaw-net
Signed-off-by: Ian Kent <raven@themaw.net>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There is no reason why we rearm the waitiqueue upon every fetch_events
retry (for when events are found yet send_events() fails). If nothing
else, this saves four lock operations per retry, and furthermore reduces
the scope of the lock even further.
[akpm@linux-foundation.org: restore code to original position, fix and reflow comment]
Link: http://lkml.kernel.org/r/20181114182532.27981-2-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It is currently called check_events because it, well, did exactly that.
However, since the lockless ep_events_available() call, the label no
longer checks, but just sends the events. Rename as such.
Link: http://lkml.kernel.org/r/20181114182532.27981-1-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Jason Baron <jbaron@akamai.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Upon timeout, we can just exit out of the loop, without the cost of the
changing the task's state with an smp_store_mb call. Just exit out of
the loop and be done - setting the task state afterwards will be, of
course, redundant.
[dave@stgolabs.net: forgotten fixlets]
Link: http://lkml.kernel.org/r/20181109155258.jxcr4t2pnz6zqct3@linux-r8p5
Link: http://lkml.kernel.org/r/20181108051006.18751-7-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso <dave@stgolabs.net>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Davidlohr Bueso <dbueso@suse.de>
Cc: Jason Baron <jbaron@akamai.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch aims at reducing ep wq.lock hold times in epoll_wait(2). For
the blocking case, there is no need to constantly take and drop the
spinlock, which is only needed to manipulate the waitqueue.
The call to ep_events_available() is now lockless, and only exposed to
benign races. Here, if false positive (returns available events and
does not see another thread deleting an epi from the list) we call into
send_events and then the list's state is correctly seen. Otoh, if a
false negative and we don't see a list_add_tail(), for example, from irq
callback, then it is rechecked again before blocking, which will see the
correct state.
In order for more accuracy to see concurrent list_del_init(), use the
list_empty_careful() variant -- of course, this won't be safe against
insertions from wakeup.
For the overflow list we obviously need to prevent load/store tearing as
we don't want to see partial values while the ready list is disabled.
[dave@stgolabs.net: forgotten fixlets]
Link: http://lkml.kernel.org/r/20181109155258.jxcr4t2pnz6zqct3@linux-r8p5
Link: http://lkml.kernel.org/r/20181108051006.18751-6-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Suggested-by: Jason Baron <jbaron@akamai.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Insted of just commenting how important it is, lets make it more robust
and add a lockdep_assert_held() call.
Link: http://lkml.kernel.org/r/20181108051006.18751-5-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Jason Baron <jbaron@akamai.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The ep->ovflist is a secondary ready-list to temporarily store events
that might occur when doing sproc without holding the ep->wq.lock. This
accounts for every time we check for ready events and also send events
back to userspace; both callbacks, particularly the latter because of
copy_to_user, can account for a non-trivial time.
As such, the unlikely() check to see if the pointer is being used, seems
both misleading and sub-optimal. In fact, we go to an awful lot of
trouble to sync both lists, and populating the ovflist is far from an
uncommon scenario.
For example, profiling a concurrent epoll_wait(2) benchmark, with
CONFIG_PROFILE_ANNOTATED_BRANCHES shows that for a two threads a 33%
incorrect rate was seen; and when incrementally increasing the number of
epoll instances (which is used, for example for multiple queuing load
balancing models), up to a 90% incorrect rate was seen.
Similarly, by deleting the prediction, 3% throughput boost was seen
across incremental threads.
Link: http://lkml.kernel.org/r/20181108051006.18751-4-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Jason Baron <jbaron@akamai.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The current logic is a bit convoluted. Lets simplify this with a
standard list_for_each_entry_safe() loop instead and just break out
after maxevents is reached.
While at it, remove an unnecessary indentation level in the loop when
there are in fact ready events.
Link: http://lkml.kernel.org/r/20181108051006.18751-3-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Jason Baron <jbaron@akamai.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "epoll: some miscellaneous optimizations".
The following are some incremental optimizations on some of the epoll
core. Each patch has the details, but together, the series is seen to
shave off measurable cycles on a number of systems and workloads.
For example, on a 40-core IB, a pipetest as well as parallel
epoll_wait() benchmark show around a 20-30% increase in raw operations
per second when the box is fully occupied (incremental thread counts),
and up to 15% performance improvement with lower counts.
Passes ltp epoll related testcases.
This patch(of 6):
All callers pass the EP_MAX_NESTS constant already, so lets simplify
this a tad and get rid of the redundant parameter for nested eventpolls.
Link: http://lkml.kernel.org/r/20181108051006.18751-2-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Jason Baron <jbaron@akamai.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Header of /proc/*/limits is a fixed string, so print it directly without
formatting specifiers.
Link: http://lkml.kernel.org/r/20181203164242.GB6904@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
name_to_int() is defined in fs/proc/util.c and declared in
fs/proc/internal.h, but the declaration isn't included at the point of
the definition. Include the header to enforce that the definition
matches the declaration.
This addresses a gcc warning when -Wmissing-prototypes is enabled.
Link: http://lkml.kernel.org/r/20181115001833.49371-1-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Access to timerslack_ns is controlled by a process having CAP_SYS_NICE
in its effective capability set, but the current check looks in the root
namespace instead of the process' user namespace. Since a process is
allowed to do other activities controlled by CAP_SYS_NICE inside a
namespace, it should also be able to adjust timerslack_ns.
Link: http://lkml.kernel.org/r/20181030180012.232896-1-bmgordon@google.com
Signed-off-by: Benjamin Gordon <bmgordon@google.com>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: "Serge E. Hallyn" <serge@hallyn.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Oren Laadan <orenl@cellrox.com>
Cc: Ruchi Kandoi <kandoiruchi@google.com>
Cc: Rom Lemarchand <romlem@android.com>
Cc: Todd Kjos <tkjos@google.com>
Cc: Colin Cross <ccross@android.com>
Cc: Nick Kralevich <nnk@google.com>
Cc: Dmitry Shmidt <dimitrysh@google.com>
Cc: Elliott Hughes <enh@google.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Nobody has actually used the type (VERIFY_READ vs VERIFY_WRITE) argument
of the user address range verification function since we got rid of the
old racy i386-only code to walk page tables by hand.
It existed because the original 80386 would not honor the write protect
bit when in kernel mode, so you had to do COW by hand before doing any
user access. But we haven't supported that in a long time, and these
days the 'type' argument is a purely historical artifact.
A discussion about extending 'user_access_begin()' to do the range
checking resulted this patch, because there is no way we're going to
move the old VERIFY_xyz interface to that model. And it's best done at
the end of the merge window when I've done most of my merges, so let's
just get this done once and for all.
This patch was mostly done with a sed-script, with manual fix-ups for
the cases that weren't of the trivial 'access_ok(VERIFY_xyz' form.
There were a couple of notable cases:
- csky still had the old "verify_area()" name as an alias.
- the iter_iov code had magical hardcoded knowledge of the actual
values of VERIFY_{READ,WRITE} (not that they mattered, since nothing
really used it)
- microblaze used the type argument for a debug printout
but other than those oddities this should be a total no-op patch.
I tried to fix up all architectures, did fairly extensive grepping for
access_ok() uses, and the changes are trivial, but I may have missed
something. Any missed conversion should be trivially fixable, though.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJcLfFMAAoJEAAOaEEZVoIVjFAP/33l7xDEbrYRMUuzjcGQEHW0
aOblkkt+Mo8yuSEgsy0yOcfi0NXuPMdVSEp5nqgwI9YsSULhpDEai/ogP86kb2f7
xdo9H43cPMZS+pDx/avRyICjBYGZE4qwpSEs0d2zSHNT0dCX5Asuughb0tRtfOwo
TIUw+8xz3lrUf0+4XAgFHOZ1/1vvC/hlXmHL61e9ZBijafb7t0qbIbIkbmnQp4vN
0njRpcRck1i1o6FhsVcC8mm+gL/Fu4uXbK3CtkInzMom6M3ecm2FU/Es3yL5qzdV
UMBwU+G+RLAZnkZ3dGUv89gwQIDaZk43g/lQJLTGv46eY8ghx4kBeT78Iaooav/o
Ml0ormpc71/iMHRQ8iMXMkhSdx8uKhp/TL+M8nx1UM3E/VloQj8SNMJi2jhnQA5h
mkNlZGSIERgr8XZF5uRHQa/XFQb5FUAMlfa4wE4sTC4urkAOwlZDQ+qENFEuikkP
EjNquVjCtLp8BW58WNXyaIv1gc/6A0ZIo2v/TXxcXM6hiH+U4GW72FeDXSDkPmqN
1rGgMI6blKpE4JmHSJjaHJP2Lo8TsyX6k64z3cV0fPigmAAAQQC+Mhtl8GkL670k
L4sRxXh0HRR2H7HLrW/nbnbL4A3F8U7a1QmpXu4Be/FetUSbRTx3FOiiS6eyIXWu
dPoocH9cfRcGXQ+gYg8Z
=MagU
-----END PGP SIGNATURE-----
Merge tag 'locks-v4.21-2' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux
Pull file locking bugfix from Jeff Layton:
"This is a one-line fix for a bug that syzbot turned up in the new
patches to mitigate the thundering herd when a lock is released"
* tag 'locks-v4.21-2' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux:
locks: fix error in locks_move_blocks()
SMB3.1.1 dialect has additional security (among other) features
and should be requested when mounting to modern servers so it
can be used if the server supports it.
Add SMB3.1.1 to the default list of dialects requested.
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
When DFS is not used on the mount we should not be mentioning
DFS in the warning message on reconnect (it could be confusing).
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
When passing a large read to receive_encrypted_read(), ensure that the
demultiplex_thread knows that a MID was processed. Without this, those
operations never complete.
This is a similar issue/fix to lease break handling:
commit 7af929d6d0
("smb3: fix lease break problem introduced by compounding")
CC: Stable <stable@vger.kernel.org> # 4.19+
Fixes: b24df3e30c ("cifs: update receive_encrypted_standard to handle compounded responses")
Signed-off-by: Paul Aurich <paul@darkrain42.org>
Tested-by: Yves-Alexis Perez <corsac@corsac.net>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
After moving all requests from
fl->fl_blocked_requests
to
new->fl_blocked_requests
it is nonsensical to do anything to all the remaining elements, there
aren't any. This should do something to all the requests that have been
moved. For simplicity, it does it to all requests in the target list.
Setting "f->fl_blocker = new" to all members of new->fl_blocked_requests
is "obviously correct" as it preserves the invariant of the linkage
among requests.
Reported-by: syzbot+239d99847eb49ecb3899@syzkaller.appspotmail.com
Fixes: 5946c4319e ("fs/locks: allow a lock request to block other requests.")
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Note that there is a conflict with the rdma tree in this pull request, since
we delete a file that has been changed in the rdma tree. Hopefully that's
easy enough to resolve!
We also were unable to track down a maintainer for Neil Brown's changes to
the generic cred code that are prerequisites to his RPC cred cleanup patches.
We've been asking around for several months without any response, so
hopefully it's okay to include those patches in this pull request.
Stable bugfixes:
- xprtrdma: Yet another double DMA-unmap # v4.20
Features:
- Allow some /proc/sys/sunrpc entries without CONFIG_SUNRPC_DEBUG
- Per-xprt rdma receive workqueues
- Drop support for FMR memory registration
- Make port= mount option optional for RDMA mounts
Other bugfixes and cleanups:
- Remove unused nfs4_xdev_fs_type declaration
- Fix comments for behavior that has changed
- Remove generic RPC credentials by switching to 'struct cred'
- Fix crossing mountpoints with different auth flavors
- Various xprtrdma fixes from testing and auditing the close code
- Fixes for disconnect issues when using xprtrdma with krb5
- Clean up and improve xprtrdma trace points
- Fix NFS v4.2 async copy reboot recovery
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAlwtO50ACgkQ18tUv7Cl
QOtZWQ//e5Hhp2TnQZ6U+99YKedjwBHP6psH3GKSEdeHSNdlSpZ5ckgHxvMb9TBa
6t4ecgv5P/uYLIePQ0u2ubUFc9+TlyGi7Iacx13/YhK7kihGHDPnZhfl0QbYixV7
rwa9bFcKmOrXs8ld+Hw3P2UL22G1gMf/LHDhPNshbW7LFZmcshKz+mKTk70kwkq9
v7tFC59p6GwV8Sr2YI2NXn2fOWsUS00sQfgj2jceJYJ8PsNa+wHYF4wPj2IY5NsE
D5Oq2kLPbytBhCllOHgopNZaf4qb5BfqhVETyc1O+kDF3BZKUhQ1PoDi2FPinaHM
5/d8hS+5fr3eMBsQrPWQLXYjWQFUXnkQQJvU3Bo52AIgomsk/8uBq3FvH7XmFcBd
C8sgnuUAkAS8feMes8GCS50BTxclnGuYGdyFJyCRXoG9Kn9rMrw9EKitky6EVq0v
NmXhW79jK84a3yDXVlAIpZ8Y9BU/HQ3GviGX8lQEdZU9YiYRzDIHvpMFwzMgqaBi
XvLbr8PlLOm8GZokThS8QYT/G2Wu6IwfUq/AufVjVD4+HiL3duKKfWSGAvcm6aAa
GoRF6UG+OmjWlzKojtRc1dI+sy22Fzh+DW+Mx6tuf/b/66wkmYnW7eKcV4rt6Tm5
/JEhvTMo9q7elL/4FgCoMCcdoc5eXqQyXRXrQiOU7YHLzn2aWU0=
=DvVW
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-4.21-1' of git://git.linux-nfs.org/projects/anna/linux-nfs
Pull NFS client updates from Anna Schumaker:
"Stable bugfixes:
- xprtrdma: Yet another double DMA-unmap # v4.20
Features:
- Allow some /proc/sys/sunrpc entries without CONFIG_SUNRPC_DEBUG
- Per-xprt rdma receive workqueues
- Drop support for FMR memory registration
- Make port= mount option optional for RDMA mounts
Other bugfixes and cleanups:
- Remove unused nfs4_xdev_fs_type declaration
- Fix comments for behavior that has changed
- Remove generic RPC credentials by switching to 'struct cred'
- Fix crossing mountpoints with different auth flavors
- Various xprtrdma fixes from testing and auditing the close code
- Fixes for disconnect issues when using xprtrdma with krb5
- Clean up and improve xprtrdma trace points
- Fix NFS v4.2 async copy reboot recovery"
* tag 'nfs-for-4.21-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (63 commits)
sunrpc: convert to DEFINE_SHOW_ATTRIBUTE
sunrpc: Add xprt after nfs4_test_session_trunk()
sunrpc: convert unnecessary GFP_ATOMIC to GFP_NOFS
sunrpc: handle ENOMEM in rpcb_getport_async
NFS: remove unnecessary test for IS_ERR(cred)
xprtrdma: Prevent leak of rpcrdma_rep objects
NFSv4.2 fix async copy reboot recovery
xprtrdma: Don't leak freed MRs
xprtrdma: Add documenting comment for rpcrdma_buffer_destroy
xprtrdma: Replace outdated comment for rpcrdma_ep_post
xprtrdma: Update comments in frwr_op_send
SUNRPC: Fix some kernel doc complaints
SUNRPC: Simplify defining common RPC trace events
NFS: Fix NFSv4 symbolic trace point output
xprtrdma: Trace mapping, alloc, and dereg failures
xprtrdma: Add trace points for calls to transport switch methods
xprtrdma: Relocate the xprtrdma_mr_map trace points
xprtrdma: Clean up of xprtrdma chunk trace points
xprtrdma: Remove unused fields from rpcrdma_ia
xprtrdma: Cull dprintk() call sites
...
NFSv4.2 client, and cleaning up some convoluted backchannel server code
in the process. Otherwise, miscellaneous smaller bugfixes and cleanup.
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJcLR3nAAoJECebzXlCjuG+oyAQALrPSTH9Qg2AwP2eGm+AevUj
u/VFmimImIO9dYuT02t4w42w4qMIQ0/Y7R0UjT3DxG5Oixy/zA+ZaNXCCEKwSMIX
abGF4YalUISbDc6n0Z8J14/T33wDGslhy3IQ9Jz5aBCDCocbWlzXvFlmrowbb3ak
vtB0Fc3Xo6Z/Pu2GzNzlqR+f69IAmwQGJrRrAEp3JUWSIBKiSWBXTujDuVBqJNYj
ySLzbzyAc7qJfI76K635XziULR2ueM3y5JbPX7kTZ0l3OJ6Yc0PtOj16sIv5o0XK
DBYPrtvw3ZbxQE/bXqtJV9Zn6MG5ODGxKszG1zT1J3dzotc9l/LgmcAY8xVSaO+H
QNMdU9QuwmyUG20A9rMoo/XfUb5KZBHzH7HIYOmkfBidcaygwIInIKoIDtzimm4X
OlYq3TL/3QDY6rgTCZv6n2KEnwiIDpc5+TvFhXRWclMOJMcMSHJfKFvqERSv9V3o
90qrCebPA0K8Dnc0HMxcBXZ+0TqZ2QeXp/wfIjibCXqMwlg+BZhmbeA0ngZ7x7qf
2F33E9bfVJjL+VI5FcVYQf43bOTWZgD6ZKGk4T7keYl0CPH+9P70bfhl4KKy9dqc
GwYooy/y5FPb2CvJn/EETeILRJ9OyIHUrw7HBkpz9N8n9z+V6Qbp9yW7LKgaMphW
1T+GpHZhQjwuBPuJhDK0
=dRLp
-----END PGP SIGNATURE-----
Merge tag 'nfsd-4.21' of git://linux-nfs.org/~bfields/linux
Pull nfsd updates from Bruce Fields:
"Thanks to Vasily Averin for fixing a use-after-free in the
containerized NFSv4.2 client, and cleaning up some convoluted
backchannel server code in the process.
Otherwise, miscellaneous smaller bugfixes and cleanup"
* tag 'nfsd-4.21' of git://linux-nfs.org/~bfields/linux: (25 commits)
nfs: fixed broken compilation in nfs_callback_up_net()
nfs: minor typo in nfs4_callback_up_net()
sunrpc: fix debug message in svc_create_xprt()
sunrpc: make visible processing error in bc_svc_process()
sunrpc: remove unused xpo_prep_reply_hdr callback
sunrpc: remove svc_rdma_bc_class
sunrpc: remove svc_tcp_bc_class
sunrpc: remove unused bc_up operation from rpc_xprt_ops
sunrpc: replace svc_serv->sv_bc_xprt by boolean flag
sunrpc: use-after-free in svc_process_common()
sunrpc: use SVC_NET() in svcauth_gss_* functions
nfsd: drop useless LIST_HEAD
lockd: Show pid of lockd for remote locks
NFSD remove OP_CACHEME from 4.2 op_flags
nfsd: Return EPERM, not EACCES, in some SETATTR cases
sunrpc: fix cache_head leak due to queued request
nfsd: clean up indentation, increase indentation in switch statement
svcrdma: Optimize the logic that selects the R_key to invalidate
nfsd: fix a warning in __cld_pipe_upcall()
nfsd4: fix crash on writing v4_end_grace before nfsd startup
...
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAlwqtf4ACgkQiiy9cAdy
T1GLwAv+I4MaCe5oq/IHDZnr09Mb/sIRLqLXnMWJciRHedHFIa/x2egb+584M+bf
Lrb3UjDyS4aXV8cjrm4XO8zzzvQkTRLtaJrlxo/b1oDZJ8JkH2M6EeNr5gAB6qso
dbmUX59YMX8KSpmQMhigcv+ilOQdokDWVdxqZ2ezbEMeVMotkQOnhrcSiJPx05QS
CRktWjSn7JKD87cj8i0dTX+txBPX9iIpYQJGWdbJa2n6V8mQkx9JPgyQCC/FwKF2
TzCXl7wfn1gTnFSxCa/sq7lnYAr6xCngbFi+pgVU+O/Aw0dyW3AoKfF7hBOo+gAH
ZJALnvhb8pJmKolXFt7OKQKuOoJSq8MInsjKSKgSe0Xt1yHEtm7IJPy6Kbj3zKVy
TuDq1KXstB5m3uwO3QBmzGxZ7rCB4B1w1cGjn8MFcpK4+tOxtmSvIeYuzEj9Vxet
5JFZzMICFyzedyuBaRxyEX8SKH7CxOXCiDajxLsp7GI8KN1i0skzjgpbmZ/tdRbB
kHaPnRdU
=rYS/
-----END PGP SIGNATURE-----
Merge tag '4.21-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs updates from Steve French:
- four fixes for stable
- improvements to DFS including allowing failover to alternate targets
- some small performance improvements
* tag '4.21-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6: (39 commits)
cifs: update internal module version number
cifs: we can not use small padding iovs together with encryption
cifs: Minor Kconfig clarification
cifs: Always resolve hostname before reconnecting
cifs: Add support for failover in cifs_reconnect_tcon()
cifs: Add support for failover in smb2_reconnect()
cifs: Only free DFS target list if we actually got one
cifs: start DFS cache refresher in cifs_mount()
cifs: Use GFP_ATOMIC when a lock is held in cifs_mount()
cifs: Add support for failover in cifs_reconnect()
cifs: Add support for failover in cifs_mount()
cifs: remove set but not used variable 'sep'
cifs: Make use of DFS cache to get new DFS referrals
cifs: minor updates to documentation
cifs: check kzalloc return
cifs: remove set but not used variable 'server'
cifs: Use kzfree() to free password
cifs: Fix to use kmem_cache_free() instead of kfree()
cifs: update for current_kernel_time64() removal
cifs: Add DFS cache routines
...
This mostly reverts commit 849a370016 ("block: avoid ordered task
state change for polled IO"). It was wrongly claiming that the ordering
wasn't necessary. The memory barrier _is_ necessary.
If something is truly polling and not going to sleep, it's the whole
state setting that is unnecessary, not the memory barrier. Whenever you
set your state to a sleeping state, you absolutely need the memory
barrier.
Note that sometimes the memory barrier can be elsewhere. For example,
the ordering might be provided by an external lock, or by setting the
process state to sleeping before adding yourself to the wait queue list
that is used for waking up (where the wait queue lock itself will
guarantee that any wakeup will correctly see the sleeping state).
But none of those cases were true here.
NOTE! Some of the polling paths may indeed be able to drop the state
setting entirely, at which point the memory barrier also goes away.
(Also note that this doesn't revert the TASK_RUNNING cases: there is no
race between a wakeup and setting the process state to TASK_RUNNING,
since the end result doesn't depend on ordering).
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 4d97f7d53d ("inotify: Add flag IN_MASK_CREATE for
inotify_add_watch()") forgot to call fdput() before bailing out.
Fixes: 4d97f7d53d ("inotify: Add flag IN_MASK_CREATE for inotify_add_watch()")
CC: stable@vger.kernel.org
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Multipathing: In case of NFSv3, rpc_clnt_test_and_add_xprt() adds
the xprt to xprt switch (i.e. xps) if rpc_call_null_helper() returns
success. But in case of NFSv4.1, it needs to do EXCHANGEID to verify
the path along with check for session trunking.
Add the xprt in nfs4_test_session_trunk() only when
nfs4_detect_session_trunking() returns success. Also release refcount
hold by rpc_clnt_setup_test_and_add_xprt().
Signed-off-by: Santosh kumar pradhan <santoshkumar.pradhan@wdc.com>
Tested-by: Suresh Jayaraman <suresh.jayaraman@wdc.com>
Reported-by: Aditya Agnihotri <aditya.agnihotri@wdc.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
As gte_current_cred() cannot return an error,
this test is not necessary.
It hasn't been necessary for years, but it wasn't so obvious
before.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Original commit (e4648aa4f9 "NFS recover from destination server
reboot for copies") used memcmp() and then it was changed to use
nfs4_stateid_match_other() but that function returns opposite of
memcmp. As the result, recovery can't find the copy leading
to copy hanging.
Fixes: 80f4236886 ("NFSv4: Split out NFS v4.2 copy completion functions")
Fixes: cb7a8384dc ("NFS: Split out the body of nfs4_reclaim_open_state")
Signed-of-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
These symbolic values were not being displayed in string form.
TRACE_DEFINE_ENUM was missing in many cases. It also turns out that
__print_symbolic wants an unsigned long in the first field...
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Having to specify "proto=rdma,port=20049" is cumbersome.
RFC 8267 Section 6.3 requires NFSv4 clients to use "the alternative
well-known port number", which is 20049. Make the use of the well-
known port number automatic, just as it is for NFS/TCP and port
2049.
For NFSv2/3, Section 4.2 allows clients to simply choose 20049 as
the default or use rpcbind. I don't know of an NFS/RDMA server
implementation that registers it's NFS/RDMA service with rpcbind,
so automatically choosing 20049 seems like the better choice. The
other widely-deployed NFS/RDMA client, Solaris, also uses 20049
as the default port.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
The check for special (reserved) inode number checks in __ext4_iget()
was broken by commit 8a363970d1: ("ext4: avoid declaring fs
inconsistent due to invalid file handles"). This was caused by a
botched reversal of the sense of the flag now known as
EXT4_IGET_SPECIAL (when it was previously named EXT4_IGET_NORMAL).
Fix the logic appropriately.
Fixes: 8a363970d1 ("ext4: avoid declaring fs inconsistent...")
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Cc: stable@kernel.org
* Clean up unnecessary usage of prepare_to_wait_exclusive()
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJcJ8sYAAoJEB7SkWpmfYgCdJ8QAJ6uPXu26SOfwew3MITNfNGO
8VXnjDlOq8mi9u51xPeurcT+4h3g1AiEMGl6ugI8Jxx704a4/P80fnAjVTmYvugy
7Ub29tTpqFPaermaT2N/K4zyqZxo/ozo5k1q3EqvNYc2IIBDlHKwKcirQpTqzIJ/
hv8sgLLf/f9J6CtBNSXeGfsV6DKp8bmqXvSGzSsphhbkcW/i1UMCey5rXN5iIT4/
gSdSeLxP6asjzeGm1/sC1G6g3Pi6USVmWe6Cs7dMbPSgkmzpGirkobmx+e34npBQ
gmabFMxaClPCar2vAGorhPbtXu5uZrHCURirVpMvmIj9MJlK/8uX4kbgn6r6N5nS
hZRZlnIvvjfucb66xCyFE/1I2xL7iIdOlcLSyG4f6bGAZTmupFGGOsoyf+BQSeT0
08n4rvmBWQ/thUXAzkR4yUu77zRmQkmwbTjnOXUv4GNocvMoUcLwazh1QeY8W2rW
RnUkk8B3iEgjfpKrjok/6MWd8qokwUVozOKUSVvKc8MEMraPVNzQMDKIl0hWUuE5
kjF+YXv+qozYvLR7IToqx+2TZp6VcZUujV5qof05nPQGHIztkwHIKZg7EimZe8qa
hKZA2X+1XOv2EGYLq5XexxR8rehiqgH7HlaMwuQBYqEnmkTx4tVWHwax2+vzVnVh
UcpyHRN2RFwPWIIUTaeN
=2anQ
-----END PGP SIGNATURE-----
Merge tag 'dax-fix-4.21' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull dax fix from Dan Williams:
"Clean up unnecessary usage of prepare_to_wait_exclusive().
While I feel a bit silly sending a single-commit pull-request there is
nothing else queued up for dax this cycle. This change has shipped in
-next for multiple releases"
* tag 'dax-fix-4.21' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
dax: Use non-exclusive wait in wait_entry_unlocked()
In this round, we've focused on bug fixes since Pixel devices have been
shipping with f2fs. Some of them were related to hardware encryption support
which are actually not an issue in mainline, but would be better to merge
them in order to avoid potential bugs.
Enhancement:
- do GC sub-sections when the section is large
- add a flag in ioctl(SHUTDOWN) to trigger fsck for QA
- use kvmalloc() in order to give another chance to avoid ENOMEM
Bug fix:
- fix accessing memory boundaries in a malformed iamge
- GC gives stale unencrypted block
- GC counts in large sections
- detect idle time more precisely
- block allocation of DIO writes
- race conditions between write_begin and write_checkpoint
- allow GCs for node segments via ioctl()
There are various clean-ups and minor bug fixes as well.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE00UqedjCtOrGVvQiQBSofoJIUNIFAlwmkzcACgkQQBSofoJI
UNLIlw//RUd8rH8VMW/BcXQ3fwB8qFCk296cIVlWQu12BV7vmMKY886Sc+C9uAuR
9Lze4D6Iyg1k2KLkaWKhnE2E4ZWQ2O5zVf11r4amm1bN9glPJAKfLWbenJhMfeGa
IdWo1ksntxgjfqLim2QlGSv5cBUjEo8zf32D446BbQURQMD/mm/lndywItS5vTx1
2eGKJ3pSXX6geFzyf1pD/wmC/7j6roKaTKAIMBagIuINNukeHctT/gRfklmYmOFC
lRbXlsYhjIfGln6gktY6r4qyZbouOMad2fDl+dkp+UewQM4/2xLUaMqiN2eqd7Hr
cag9LGe80MGgWwesqS6VOJHQviOlNd78Ie//iO1+ZsaIxKrvgzocRSDdxZswv+Vq
reZNsEKH0s4/0ik9hiprW1NdQLvtg7e/UWPyrRvVRlDVitW/BXOWRBpGVgkdg01K
OmAxNrVcr2+Gxp8jZL7M+rQdpApL/ms31x9HE/8u5ABBZfmCy3zhpICTZ6Hdngp1
3iUi2PvKUg+Y7KgII1nK2HLDpmZZ5UkLm8NhxLI0ntBWExjQjOBasOxPJf4en/UR
TAtK3ozNKim3fQ01NCBy3B8WMKRtovxLK6PgqNPROJparlKJ3FEHnPIrYKglDUPg
sxWCm1gFSraRGP86TaeGeD+v+PP6erI9CTaf6F/U3pWFgIpHoPg=
=NEns
-----END PGP SIGNATURE-----
Merge tag 'f2fs-for-4.21' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs
Pull f2fs updates from Jaegeuk Kim:
"In this round, we've focused on bug fixes since Pixel devices have
been shipping with f2fs. Some of them were related to hardware
encryption support which are actually not an issue in mainline, but
would be better to merge them in order to avoid potential bugs.
Enhancements:
- do GC sub-sections when the section is large
- add a flag in ioctl(SHUTDOWN) to trigger fsck for QA
- use kvmalloc() in order to give another chance to avoid ENOMEM
Bug fixes:
- fix accessing memory boundaries in a malformed iamge
- GC gives stale unencrypted block
- GC counts in large sections
- detect idle time more precisely
- block allocation of DIO writes
- race conditions between write_begin and write_checkpoint
- allow GCs for node segments via ioctl()
There are various clean-ups and minor bug fixes as well"
* tag 'f2fs-for-4.21' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (43 commits)
f2fs: sanity check of xattr entry size
f2fs: fix use-after-free issue when accessing sbi->stat_info
f2fs: check PageWriteback flag for ordered case
f2fs: fix validation of the block count in sanity_check_raw_super
f2fs: fix missing unlock(sbi->gc_mutex)
f2fs: fix to dirty inode synchronously
f2fs: clean up structure extent_node
f2fs: fix block address for __check_sit_bitmap
f2fs: fix sbi->extent_list corruption issue
f2fs: clean up checkpoint flow
f2fs: flush stale issued discard candidates
f2fs: correct wrong spelling, issing_*
f2fs: use kvmalloc, if kmalloc is failed
f2fs: remove redundant comment of unused wio_mutex
f2fs: fix to reorder set_page_dirty and wait_on_page_writeback
f2fs: clear PG_writeback if IPU failed
f2fs: add an ioctl() to explicitly trigger fsck later
f2fs: avoid frequent costly fsck triggers
f2fs: fix m_may_create to make OPU DIO write correctly
f2fs: fix to update new block address correctly for OPU
...
Patch fixes compilation error in nfs_callback_up_net()
serv->sv_bc_enabled is defined under enabled CONFIG_SUNRPC_BACKCHANNEL,
however nfs_callback_up_net() can access it even if this config option
was not set.
Fixes: a289ce5311 (sunrpc: replace svc_serv->sv_bc_xprt by boolean flag)
Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
We can not append small padding buffers as separate iovs when encryption is
used. For this case we must flatten the request into a single buffer
containing both the data from all the iovs as well as the padding bytes.
This is at least needed for 4.20 as well due to compounding changes.
CC: Stable <stable@vger.kernel.org>
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
We already using mapping_set_error() in fs/ext4/page_io.c, so all we
need to do is to use file_check_and_advance_wb_err() when handling
fsync() requests in ext4_sync_file().
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
In no-journal mode, we previously used __generic_file_fsync() in
no-journal mode. This triggers a lockdep warning, and in addition,
it's not safe to depend on the inode writeback mechanism in the case
ext4. We can solve both problems by calling ext4_write_inode()
directly.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
The xfstests generic/475 test switches the underlying device with
dm-error while running a stress test. This results in a large number
of file system errors, and since we can't lock the buffer head when
marking the superblock dirty in the ext4_grp_locked_error() case, it's
possible the superblock to be !buffer_uptodate() without
buffer_write_io_error() being true.
We need to set buffer_uptodate() before we call mark_buffer_dirty() or
this will trigger a WARN_ON. It's safe to do this since the
superblock must have been properly read into memory or the mount would
have been successful. So if buffer_uptodate() is not set, we can
safely assume that this happened due to a failed attempt to write the
superblock.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
Drop LIST_HEAD where the variable it declares is never used.
Commit 0410c3bb2b ("xfs: factor ag btree root block
initialisation") stopped using buffer_list and started using a
buffer list in an aghdr_init_data structure, but the declaration
of buffer_list was not removed.
The semantic patch that fixes this problem is as follows:
(http://coccinelle.lip6.fr/)
// <smpl>
@@
identifier x;
@@
- LIST_HEAD(x);
... when != x
// </smpl>
Fixes: 0410c3bb2b ("xfs: factor ag btree root block initialisation")
Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Drop LIST_HEAD where the variable it declares has never
been used.
The semantic patch that fixes this problem is as follows:
(http://coccinelle.lip6.fr/)
// <smpl>
@@
identifier x;
@@
- LIST_HEAD(x);
... when != x
// </smpl>
Fixes: 26f1fe858f ("xfs: reduce lock hold times in buffer writeback")
Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Here is the big set of char and misc driver patches for 4.21-rc1.
Lots of different types of driver things in here, as this tree seems to
be the "collection of various driver subsystems not big enough to have
their own git tree" lately.
Anyway, some highlights of the changes in here:
- binderfs: is it a rule that all driver subsystems will eventually
grow to have their own filesystem? Binder now has one to handle the
use of it in containerized systems. This was discussed at the
Plumbers conference a few months ago and knocked into mergable shape
very fast by Christian Brauner. Who also has signed up to be
another binder maintainer, showing a distinct lack of good judgement :)
- binder updates and fixes
- mei driver updates
- fpga driver updates and additions
- thunderbolt driver updates
- soundwire driver updates
- extcon driver updates
- nvmem driver updates
- hyper-v driver updates
- coresight driver updates
- pvpanic driver additions and reworking for more device support
- lp driver updates. Yes really, it's _finally_ moved to the proper
parallal port driver model, something I never thought I would see
happen. Good stuff.
- other tiny driver updates and fixes.
All of these have been in linux-next for a while with no reported
issues.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCXCZCUA8cZ3JlZ0Brcm9h
aC5jb20ACgkQMUfUDdst+ymF9QCgx/Z8Fj1qzGVGrIE4flXOi7pxOrgAoMqJEWtU
ywwL8M9suKDz7cZT9fWQ
=xxr6
-----END PGP SIGNATURE-----
Merge tag 'char-misc-4.21-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc
Pull char/misc driver updates from Greg KH:
"Here is the big set of char and misc driver patches for 4.21-rc1.
Lots of different types of driver things in here, as this tree seems
to be the "collection of various driver subsystems not big enough to
have their own git tree" lately.
Anyway, some highlights of the changes in here:
- binderfs: is it a rule that all driver subsystems will eventually
grow to have their own filesystem? Binder now has one to handle the
use of it in containerized systems.
This was discussed at the Plumbers conference a few months ago and
knocked into mergable shape very fast by Christian Brauner. Who
also has signed up to be another binder maintainer, showing a
distinct lack of good judgement :)
- binder updates and fixes
- mei driver updates
- fpga driver updates and additions
- thunderbolt driver updates
- soundwire driver updates
- extcon driver updates
- nvmem driver updates
- hyper-v driver updates
- coresight driver updates
- pvpanic driver additions and reworking for more device support
- lp driver updates. Yes really, it's _finally_ moved to the proper
parallal port driver model, something I never thought I would see
happen. Good stuff.
- other tiny driver updates and fixes.
All of these have been in linux-next for a while with no reported
issues"
* tag 'char-misc-4.21-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (116 commits)
MAINTAINERS: add another Android binder maintainer
intel_th: msu: Fix an off-by-one in attribute store
stm class: Add a reference to the SyS-T document
stm class: Fix a module refcount leak in policy creation error path
char: lp: use new parport device model
char: lp: properly count the lp devices
char: lp: use first unused lp number while registering
char: lp: detach the device when parallel port is removed
char: lp: introduce list to save port number
bus: qcom: remove duplicated include from qcom-ebi2.c
VMCI: Use memdup_user() rather than duplicating its implementation
char/rtc: Use of_node_name_eq for node name comparisons
misc: mic: fix a DMA pool free failure
ptp: fix an IS_ERR() vs NULL check
genwqe: Fix size check
binder: implement binderfs
binder: fix use-after-free due to ksys_close() during fdget()
bus: fsl-mc: remove duplicated include files
bus: fsl-mc: explicitly define the fsl_mc_command endianness
misc: ti-st: make array read_ver_cmd static, shrinks object size
...
Here is the "big" set of driver core patches for 4.21-rc1.
It's not really big, just a number of small changes for some reported
issues, some documentation updates to hopefully make it harder for
people to abuse the driver model, and some other minor cleanups.
All of these have been in linux-next for a while with no reported
issues.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCXCY/dA8cZ3JlZ0Brcm9h
aC5jb20ACgkQMUfUDdst+ylZrgCeIi+rWj0mqlyKZk0A+gurH2BPmfwAniGfiHJp
w60Fr5/EbCqUr1d1wQIO
=4N7R
-----END PGP SIGNATURE-----
Merge tag 'driver-core-4.21-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core updates from Greg KH:
"Here is the "big" set of driver core patches for 4.21-rc1.
It's not really big, just a number of small changes for some reported
issues, some documentation updates to hopefully make it harder for
people to abuse the driver model, and some other minor cleanups.
All of these have been in linux-next for a while with no reported
issues"
* tag 'driver-core-4.21-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
mm, memory_hotplug: update a comment in unregister_memory()
component: convert to DEFINE_SHOW_ATTRIBUTE
sysfs: Disable lockdep for driver bind/unbind files
driver core: Add missing dev->bus->need_parent_lock checks
kobject: return error code if writing /sys/.../uevent fails
driver core: Move async_synchronize_full call
driver core: platform: Respect return code of platform_device_register_full()
kref/kobject: Improve documentation
drivers/base/memory.c: Use DEVICE_ATTR_RO and friends
driver core: Replace simple_strto{l,ul} by kstrtou{l,ul}
kernfs: Improve kernfs_notify() poll notification latency
kobject: Fix warnings in lib/kobject_uevent.c
kobject: drop unnecessary cast "%llu" for u64
driver core: fix comments for device_block_probing()
driver core: Replace simple_strtol by kstrtoint
Merge misc updates from Andrew Morton:
- large KASAN update to use arm's "software tag-based mode"
- a few misc things
- sh updates
- ocfs2 updates
- just about all of MM
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (167 commits)
kernel/fork.c: mark 'stack_vm_area' with __maybe_unused
memcg, oom: notify on oom killer invocation from the charge path
mm, swap: fix swapoff with KSM pages
include/linux/gfp.h: fix typo
mm/hmm: fix memremap.h, move dev_page_fault_t callback to hmm
hugetlbfs: Use i_mmap_rwsem to fix page fault/truncate race
hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization
memory_hotplug: add missing newlines to debugging output
mm: remove __hugepage_set_anon_rmap()
include/linux/vmstat.h: remove unused page state adjustment macro
mm/page_alloc.c: allow error injection
mm: migrate: drop unused argument of migrate_page_move_mapping()
blkdev: avoid migration stalls for blkdev pages
mm: migrate: provide buffer_migrate_page_norefs()
mm: migrate: move migrate_page_lock_buffers()
mm: migrate: lock buffers before migrate_page_move_mapping()
mm: migration: factor out code to compute expected number of page references
mm, page_alloc: enable pcpu_drain with zone capability
kmemleak: add config to select auto scan
mm/page_alloc.c: don't call kasan_free_pages() at deferred mem init
...
This has been a fairly typical cycle, with the usual sorts of driver
updates. Several series continue to come through which improve and
modernize various parts of the core code, and we finally are starting to
get the uAPI command interface cleaned up.
- Various driver fixes for bnxt_re, cxgb3/4, hfi1, hns, i40iw, mlx4, mlx5,
qib, rxe, usnic
- Rework the entire syscall flow for uverbs to be able to run over
ioctl(). Finally getting past the historic bad choice to use write()
for command execution
- More functional coverage with the mlx5 'devx' user API
- Start of the HFI1 series for 'TID RDMA'
- SRQ support in the hns driver
- Support for new IBTA defined 2x lane widths
- A big series to consolidate all the driver function pointers into
a big struct and have drivers provide a 'static const' version of the
struct instead of open coding initialization
- New 'advise_mr' uAPI to control device caching/loading of page tables
- Support for inline data in SRPT
- Modernize how umad uses the driver core and creates cdev's and sysfs
files
- First steps toward removing 'uobject' from the view of the drivers
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEfB7FMLh+8QxL+6i3OG33FX4gmxoFAlwhV2oACgkQOG33FX4g
mxpF8A/9EkRCg6wCDC59maA53b5PjuNmD//9hXbycQPQSlxntI2PyYtxrzBqc0+2
yIaFFMehL41XNN6y1zfkl7ndl62McCH2TpiidU8RyTxVw/e3KsDD5sU6++atfHRo
M82RNfedDtxPG8TcCPKVLof6JHADApGSR1r4dCYfAnu7KFMyvlLmeYyx4r/2E6yC
iQPmtKVOdbGkuWGeX+brGEA0vg7FUOAvaysnxddjyh9hyem4h0SUR3Af/Ik0N5ME
PYzC+hMKbkPVBLoCWyg7QwUaqK37uWwguMQLtI2byF7FgbiK/lBQt6TsidR4Fw3p
EalL7uqxgCTtLYh918vxLFjdYt6laka9j7xKCX8M8d06sy/Lo8iV4hWjiTESfMFG
usqs7D6p09gA/y1KISji81j6BI7C92CPVK2drKIEnfyLgY5dBNFcv9m2H12lUCH2
NGbfCNVaTQVX6bFWPpy2Bt2y/Litsfxw5RviehD7jlG0lQjsXGDkZzsDxrMSSlNU
S79iiTJyK4kUZkXzrSSlN58pLBlbupJwm5MDjKmM+irsrsCHjGIULvc902qtnC3/
8ImiTtW6XvqLbgWXyy2Th8/ZgRY234p1ybhog+DFaGKUch0XqB7VXTV2OZm0GjcN
Fp4PUeBt+/gBgYqjpuffqQc1rI4uwXYSoz7wq9RBiOpw5zBFT1E=
=T0p1
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
Pull rdma updates from Jason Gunthorpe:
"This has been a fairly typical cycle, with the usual sorts of driver
updates. Several series continue to come through which improve and
modernize various parts of the core code, and we finally are starting
to get the uAPI command interface cleaned up.
- Various driver fixes for bnxt_re, cxgb3/4, hfi1, hns, i40iw, mlx4,
mlx5, qib, rxe, usnic
- Rework the entire syscall flow for uverbs to be able to run over
ioctl(). Finally getting past the historic bad choice to use
write() for command execution
- More functional coverage with the mlx5 'devx' user API
- Start of the HFI1 series for 'TID RDMA'
- SRQ support in the hns driver
- Support for new IBTA defined 2x lane widths
- A big series to consolidate all the driver function pointers into a
big struct and have drivers provide a 'static const' version of the
struct instead of open coding initialization
- New 'advise_mr' uAPI to control device caching/loading of page
tables
- Support for inline data in SRPT
- Modernize how umad uses the driver core and creates cdev's and
sysfs files
- First steps toward removing 'uobject' from the view of the drivers"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (193 commits)
RDMA/srpt: Use kmem_cache_free() instead of kfree()
RDMA/mlx5: Signedness bug in UVERBS_HANDLER()
IB/uverbs: Signedness bug in UVERBS_HANDLER()
IB/mlx5: Allocate the per-port Q counter shared when DEVX is supported
IB/umad: Start using dev_groups of class
IB/umad: Use class_groups and let core create class file
IB/umad: Refactor code to use cdev_device_add()
IB/umad: Avoid destroying device while it is accessed
IB/umad: Simplify and avoid dynamic allocation of class
IB/mlx5: Fix wrong error unwind
IB/mlx4: Remove set but not used variable 'pd'
RDMA/iwcm: Don't copy past the end of dev_name() string
IB/mlx5: Fix long EEH recover time with NVMe offloads
IB/mlx5: Simplify netdev unbinding
IB/core: Move query port to ioctl
RDMA/nldev: Expose port_cap_flags2
IB/core: uverbs copy to struct or zero helper
IB/rxe: Reuse code which sets port state
IB/rxe: Make counters thread safe
IB/mlx5: Use the correct commands for UMEM and UCTX allocation
...
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAlwb7aEQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgptKREACow57782zzoaqert39bczau9nDfY83vd0r
u5ToCJstrN7esO0idJx5XLJHJjvbKlXa1I06shG3beX/kSnI+RYtMnKgBh1tzWmj
ywxJLBB3CkCP/0Kt8oAzgjiiGUXeZYWmEkytk8JLXyQQxIBHrxBkNu6+HwBmxkOp
kV06GGVR9l1cMUl6RF9pLdkRwMyZ7PPuNvPGjhbVDvHCIbRNeruxFU/a3TCqGRn/
oFRCBCEnaUZHIH0M9XeQFiCOXo4A1wE+CKM7ymMpAfLF0DGUo+EsAKFC6380eXLv
Haiv0rmzGAJJE3BLvkVOLz5UlatPzIBvhp8l//Jxv9wi6x9JDbv31K+FaE8DiTuj
dubhnhFdEo9HfcRPxBEysuMCho+56ZF2mw/kb0V0aRR9m3tRstrcyYXuVBM/PMpI
HQrklAS25J70WtdlnSGayvasNC2H/HURFkGG9+QooW7r4trs56fjeef3O/wZVDIh
oFnJVCOkcSo8O0bbh24LOeE6o1uGqlfUGCPM/Fv1qPJVwfGS41CfgilljsN0skCo
Q3Cv8DPDeGZf8i3Bi9nvddWP9x7p9qaRg8sSKxlOtCjMdr4OIwe7vINQpUDyygQF
ARqcRkfvHEdeN0KO38+6x6Q5YOchFMBwU311SkhVO4EQzto8WjDTJr2CR0Z7mKyC
mDLWR6vM/g==
=d3W9
-----END PGP SIGNATURE-----
Merge tag 'for-4.21/aio-20181221' of git://git.kernel.dk/linux-block
Pull aio updates from Jens Axboe:
"Flushing out pre-patches for the buffered/polled aio series. Some
fixes in here, but also optimizations"
* tag 'for-4.21/aio-20181221' of git://git.kernel.dk/linux-block:
aio: abstract out io_event filler helper
aio: split out iocb copy from io_submit_one()
aio: use iocb_put() instead of open coding it
aio: only use blk plugs for > 2 depth submissions
aio: don't zero entire aio_kiocb aio_get_req()
aio: separate out ring reservation from req allocation
aio: use assigned completion handler
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAlwb7R8QHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpjiID/97oDjMhNT7rwpuMbHw855h62j1hEN/m+N3
FI0uxivYoYZLD+eJRnMcBwHlKjrCX8iJQAcv9ffI3ThtFW7dnZT3atUacaZVR/Dt
IrxdymdBP3qsmuaId5NYBug7rJ+AiqFJKjEvCcSPu5X397J4I3SEbzhfvYLJ/aZX
16o0HJlVVIrcbmq1IP4HwiIIOaKXvPaw04L4z4fpeynRSWG7EAi8NLSnhlR4Rxbb
BTiMkCTsjRCFdyO6da4fvNQKWmPGPa3bJkYy3qR99cvJCeIbQjRyCloQlWNJRRgi
3eJpCHVxqFmN0/+DNTJVQEEr4H8o0AVucrLVct1Jc4pessenkpoUniP8vELqwlng
Z2VHLkhTfCEmvFlk82grrYdNvGATRsrbswt/PlP4T7rBfr1IpDk8kXDWF59EL2dy
ly35Sk3wJGHBl8qa+vEPXOAnaWdqJXuVGpwB4ifOIatOls8mOxwfZjiRc7x05/fC
1O4rR2IfLwRqwoYHs0AJ+h6ohOSn1mkGezl2Tch1VSFcJUOHmuYvraTaUi6hblpA
SslaAoEhO39hRBL0HsvsMeqVWM9uzqvFkLDCfNPdiA81H1258CIbo4vF8z6czCIS
eeXnTJxVhPVbZgb3a1a93SPwM6KIDZFoIijyd+NqjpU94thlnhYD0QEcKJIKH7os
2p4aHs6ktw==
=TRdW
-----END PGP SIGNATURE-----
Merge tag 'for-4.21/block-20181221' of git://git.kernel.dk/linux-block
Pull block updates from Jens Axboe:
"This is the main pull request for block/storage for 4.21.
Larger than usual, it was a busy round with lots of goodies queued up.
Most notable is the removal of the old IO stack, which has been a long
time coming. No new features for a while, everything coming in this
week has all been fixes for things that were previously merged.
This contains:
- Use atomic counters instead of semaphores for mtip32xx (Arnd)
- Cleanup of the mtip32xx request setup (Christoph)
- Fix for circular locking dependency in loop (Jan, Tetsuo)
- bcache (Coly, Guoju, Shenghui)
* Optimizations for writeback caching
* Various fixes and improvements
- nvme (Chaitanya, Christoph, Sagi, Jay, me, Keith)
* host and target support for NVMe over TCP
* Error log page support
* Support for separate read/write/poll queues
* Much improved polling
* discard OOM fallback
* Tracepoint improvements
- lightnvm (Hans, Hua, Igor, Matias, Javier)
* Igor added packed metadata to pblk. Now drives without metadata
per LBA can be used as well.
* Fix from Geert on uninitialized value on chunk metadata reads.
* Fixes from Hans and Javier to pblk recovery and write path.
* Fix from Hua Su to fix a race condition in the pblk recovery
code.
* Scan optimization added to pblk recovery from Zhoujie.
* Small geometry cleanup from me.
- Conversion of the last few drivers that used the legacy path to
blk-mq (me)
- Removal of legacy IO path in SCSI (me, Christoph)
- Removal of legacy IO stack and schedulers (me)
- Support for much better polling, now without interrupts at all.
blk-mq adds support for multiple queue maps, which enables us to
have a map per type. This in turn enables nvme to have separate
completion queues for polling, which can then be interrupt-less.
Also means we're ready for async polled IO, which is hopefully
coming in the next release.
- Killing of (now) unused block exports (Christoph)
- Unification of the blk-rq-qos and blk-wbt wait handling (Josef)
- Support for zoned testing with null_blk (Masato)
- sx8 conversion to per-host tag sets (Christoph)
- IO priority improvements (Damien)
- mq-deadline zoned fix (Damien)
- Ref count blkcg series (Dennis)
- Lots of blk-mq improvements and speedups (me)
- sbitmap scalability improvements (me)
- Make core inflight IO accounting per-cpu (Mikulas)
- Export timeout setting in sysfs (Weiping)
- Cleanup the direct issue path (Jianchao)
- Export blk-wbt internals in block debugfs for easier debugging
(Ming)
- Lots of other fixes and improvements"
* tag 'for-4.21/block-20181221' of git://git.kernel.dk/linux-block: (364 commits)
kyber: use sbitmap add_wait_queue/list_del wait helpers
sbitmap: add helpers for add/del wait queue handling
block: save irq state in blkg_lookup_create()
dm: don't reuse bio for flushes
nvme-pci: trace SQ status on completions
nvme-rdma: implement polling queue map
nvme-fabrics: allow user to pass in nr_poll_queues
nvme-fabrics: allow nvmf_connect_io_queue to poll
nvme-core: optionally poll sync commands
block: make request_to_qc_t public
nvme-tcp: fix spelling mistake "attepmpt" -> "attempt"
nvme-tcp: fix endianess annotations
nvmet-tcp: fix endianess annotations
nvme-pci: refactor nvme_poll_irqdisable to make sparse happy
nvme-pci: only set nr_maps to 2 if poll queues are supported
nvmet: use a macro for default error location
nvmet: fix comparison of a u16 with -1
blk-mq: enable IO poll if .nr_queues of type poll > 0
blk-mq: change blk_mq_queue_busy() to blk_mq_queue_inflight()
blk-mq: skip zero-queue maps in blk_mq_map_swqueue
...
This concludes the main part of the system call rework for 64-bit time_t,
which has spread over most of year 2018, the last six system calls being
- ppoll
- pselect6
- io_pgetevents
- recvmmsg
- futex
- rt_sigtimedwait
As before, nothing changes for 64-bit architectures, while 32-bit
architectures gain another entry point that differs only in the layout
of the timespec structure. Hopefully in the next release we can wire up
all 22 of those system calls on all 32-bit architectures, which gives
us a baseline version for glibc to start using them.
This does not include the clock_adjtime, getrusage/waitid, and
getitimer/setitimer system calls. I still plan to have new versions
of those as well, but they are not required for correct operation of
the C library since they can be emulated using the old 32-bit time_t
based system calls.
Aside from the system calls, there are also a few cleanups here,
removing old kernel internal interfaces that have become unused after
all references got removed. The arch/sh cleanups are part of this,
there were posted several times over the past year without a reaction
from the maintainers, while the corresponding changes made it into all
other architectures.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJcHCCRAAoJEGCrR//JCVInkqsP/3TuLgSyQwolFRXcoBOjR1Ar
JoX33GuDlAxHSqPadButVfflmRIWvL3aNMFFwcQM4uYgQ593FoHbmnusCdFgHcQ7
Q13pGo7szbfEFxydhnDMVust/hxd5C9Y5zNSJ+eMLGLLJXosEyjd9YjRoHDROWal
oDLqpPCArlLN1B1XFhjH8J847+JgS+hUrAfk3AOU0B2TuuFkBnRImlCGCR5JcgPh
XIpHRBOgEMP4kZ3LjztPfS3v/XJeGrguRcbD3FsPKdPeYO9QRUiw0vahEQRr7qXL
9hOgDq1YHPUQeUFhy3hJPCZdsDFzWoIE7ziNkZCZvGBw+qSw9i8KChGUt6PcSNlJ
nqKJY5Wneb4svu+kOdK7d8ONbTdlVYvWf5bj/sKoNUA4BVeIjNcDXplvr3cXiDzI
e40CcSQ3oLEvrIxMcoyNPPG63b+FYG8nMaCOx4dB4pZN7sSvZUO9a1DbDBtzxMON
xy5Kfk1n5gIHcfBJAya5CnMQ1Jm4FCCu/LHVanYvb/nXA/2jEegSm24Md17icE/Q
VA5jJqIdICExor4VHMsG0lLQxBJsv/QqYfT2OCO6Oykh28mjFqf+X+9Ctz1w6KVG
VUkY1u97x8jB0M4qolGO7ZGn6P1h0TpNVFD1zDNcDt2xI63cmuhgKWiV2pv5b7No
ty6insmmbJWt3tOOPyfb
=yIAT
-----END PGP SIGNATURE-----
Merge tag 'y2038-for-4.21' of ssh://gitolite.kernel.org:/pub/scm/linux/kernel/git/arnd/playground
Pull y2038 updates from Arnd Bergmann:
"More syscalls and cleanups
This concludes the main part of the system call rework for 64-bit
time_t, which has spread over most of year 2018, the last six system
calls being
- ppoll
- pselect6
- io_pgetevents
- recvmmsg
- futex
- rt_sigtimedwait
As before, nothing changes for 64-bit architectures, while 32-bit
architectures gain another entry point that differs only in the layout
of the timespec structure. Hopefully in the next release we can wire
up all 22 of those system calls on all 32-bit architectures, which
gives us a baseline version for glibc to start using them.
This does not include the clock_adjtime, getrusage/waitid, and
getitimer/setitimer system calls. I still plan to have new versions of
those as well, but they are not required for correct operation of the
C library since they can be emulated using the old 32-bit time_t based
system calls.
Aside from the system calls, there are also a few cleanups here,
removing old kernel internal interfaces that have become unused after
all references got removed. The arch/sh cleanups are part of this,
there were posted several times over the past year without a reaction
from the maintainers, while the corresponding changes made it into all
other architectures"
* tag 'y2038-for-4.21' of ssh://gitolite.kernel.org:/pub/scm/linux/kernel/git/arnd/playground:
timekeeping: remove obsolete time accessors
vfs: replace current_kernel_time64 with ktime equivalent
timekeeping: remove timespec_add/timespec_del
timekeeping: remove unused {read,update}_persistent_clock
sh: remove board_time_init() callback
sh: remove unused rtc_sh_get/set_time infrastructure
sh: sh03: rtc: push down rtc class ops into driver
sh: dreamcast: rtc: push down rtc class ops into driver
y2038: signal: Add compat_sys_rt_sigtimedwait_time64
y2038: signal: Add sys_rt_sigtimedwait_time32
y2038: socket: Add compat_sys_recvmmsg_time64
y2038: futex: Add support for __kernel_timespec
y2038: futex: Move compat implementation into futex.c
io_pgetevents: use __kernel_timespec
pselect6: use __kernel_timespec
ppoll: use __kernel_timespec
signal: Add restore_user_sigmask()
signal: Add set_user_sigmask()
hugetlbfs page faults can race with truncate and hole punch operations.
Current code in the page fault path attempts to handle this by 'backing
out' operations if we encounter the race. One obvious omission in the
current code is removing a page newly added to the page cache. This is
pretty straight forward to address, but there is a more subtle and
difficult issue of backing out hugetlb reservations. To handle this
correctly, the 'reservation state' before page allocation needs to be
noted so that it can be properly backed out. There are four distinct
possibilities for reservation state: shared/reserved, shared/no-resv,
private/reserved and private/no-resv. Backing out a reservation may
require memory allocation which could fail so that needs to be taken into
account as well.
Instead of writing the required complicated code for this rare occurrence,
just eliminate the race. i_mmap_rwsem is now held in read mode for the
duration of page fault processing. Hold i_mmap_rwsem longer in truncation
and hold punch code to cover the call to remove_inode_hugepages.
With this modification, code in remove_inode_hugepages checking for races
becomes 'dead' as it can not longer happen. Remove the dead code and
expand comments to explain reasoning. Similarly, checks for races with
truncation in the page fault path can be simplified and removed.
[mike.kravetz@oracle.com: incorporat suggestions from Kirill]
Link: http://lkml.kernel.org/r/20181222223013.22193-3-mike.kravetz@oracle.com
Link: http://lkml.kernel.org/r/20181218223557.5202-3-mike.kravetz@oracle.com
Fixes: ebed4bfc8d ("hugetlb: fix absurd HugePages_Rsvd")
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
All callers of migrate_page_move_mapping() now pass NULL for 'head'
argument. Drop it.
Link: http://lkml.kernel.org/r/20181211172143.7358-7-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently, block device pages don't provide a ->migratepage callback and
thus fallback_migrate_page() is used for them. This handler cannot deal
with dirty pages in async mode and also with the case a buffer head is in
the LRU buffer head cache (as it has elevated b_count). Thus such page
can block memory offlining.
Fix the problem by using buffer_migrate_page_norefs() for migrating block
device pages. That function takes care of dropping bh LRU in case
migration would fail due to elevated buffer refcount to avoid stalls and
can also migrate dirty pages without writing them.
Link: http://lkml.kernel.org/r/20181211172143.7358-6-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When the process being tracked does mremap() without
UFFD_FEATURE_EVENT_REMAP on the corresponding tracking uffd file handle,
we should not generate the remap event, and at the same time we should
clear all the uffd flags on the new VMA. Without this patch, we can still
have the VM_UFFD_MISSING|VM_UFFD_WP flags on the new VMA even the fault
handling process does not even know the existance of the VMA.
Link: http://lkml.kernel.org/r/20181211053409.20317-1-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Hugh Dickins <hughd@google.com>
Cc: Pavel Emelyanov <xemul@virtuozzo.com>
Cc: Pravin Shedge <pravin.shedge4linux@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
David Rientjes has reported that commit 1860033237 ("mm: make
PR_SET_THP_DISABLE immediately active") has changed the way how we
report THPable VMAs to the userspace. Their monitoring tool is
triggering false alarms on PR_SET_THP_DISABLE tasks because it considers
an insufficient THP usage as a memory fragmentation resp. memory
pressure issue.
Before the said commit each newly created VMA inherited VM_NOHUGEPAGE
flag and that got exposed to the userspace via /proc/<pid>/smaps file.
This implementation had its downsides as explained in the commit message
but it is true that the userspace doesn't have any means to query for
the process wide THP enabled/disabled status.
PR_SET_THP_DISABLE is a process wide flag so it makes a lot of sense to
export in the process wide context rather than per-vma. Introduce a new
field to /proc/<pid>/status which export this status. If
PR_SET_THP_DISABLE is used then it reports false same as when the THP is
not compiled in. It doesn't consider the global THP status because we
already export that information via sysfs
Link: http://lkml.kernel.org/r/20181211143641.3503-4-mhocko@kernel.org
Fixes: 1860033237 ("mm: make PR_SET_THP_DISABLE immediately active")
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reported-by: David Rientjes <rientjes@google.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Paul Oppenheimer <bepvte@gmail.com>
Cc: William Kucharski <william.kucharski@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Userspace falls short when trying to find out whether a specific memory
range is eligible for THP. There are usecases that would like to know
that
http://lkml.kernel.org/r/alpine.DEB.2.21.1809251248450.50347@chino.kir.corp.google.com
: This is used to identify heap mappings that should be able to fault thp
: but do not, and they normally point to a low-on-memory or fragmentation
: issue.
The only way to deduce this now is to query for hg resp. nh flags and
confronting the state with the global setting. Except that there is also
PR_SET_THP_DISABLE that might change the picture. So the final logic is
not trivial. Moreover the eligibility of the vma depends on the type of
VMA as well. In the past we have supported only anononymous memory VMAs
but things have changed and shmem based vmas are supported as well these
days and the query logic gets even more complicated because the
eligibility depends on the mount option and another global configuration
knob.
Simplify the current state and report the THP eligibility in
/proc/<pid>/smaps for each existing vma. Reuse
transparent_hugepage_enabled for this purpose. The original
implementation of this function assumes that the caller knows that the vma
itself is supported for THP so make the core checks into
__transparent_hugepage_enabled and use it for existing callers.
__show_smap just use the new transparent_hugepage_enabled which also
checks the vma support status (please note that this one has to be out of
line due to include dependency issues).
[mhocko@kernel.org: fix oops with NULL ->f_mapping]
Link: http://lkml.kernel.org/r/20181224185106.GC16738@dhcp22.suse.cz
Link: http://lkml.kernel.org/r/20181211143641.3503-3-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Paul Oppenheimer <bepvte@gmail.com>
Cc: William Kucharski <william.kucharski@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
To avoid having to change many call sites everytime we want to add a
parameter use a structure to group all parameters for the mmu_notifier
invalidate_range_start/end cakks. No functional changes with this patch.
[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/20181205053628.3210-3-jglisse@redhat.com
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Acked-by: Christian König <christian.koenig@amd.com>
Acked-by: Jan Kara <jack@suse.cz>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Ross Zwisler <zwisler@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krcmar <rkrcmar@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Felix Kuehling <felix.kuehling@amd.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
From: Jérôme Glisse <jglisse@redhat.com>
Subject: mm/mmu_notifier: use structure for invalidate_range_start/end calls v3
fix build warning in migrate.c when CONFIG_MMU_NOTIFIER=n
Link: http://lkml.kernel.org/r/20181213171330.8489-3-jglisse@redhat.com
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Certain pages that are never mapped to userspace have a type indicated in
the page_type field of their struct pages (e.g. PG_buddy). page_type
overlaps with _mapcount so set the count to 0 and avoid calling
page_mapcount() for these pages.
[anthony.yznaga@oracle.com: incorporate feedback from Matthew Wilcox]
Link: http://lkml.kernel.org/r/1544481313-27318-1-git-send-email-anthony.yznaga@oracle.com
Link: http://lkml.kernel.org/r/1543963526-27917-1-git-send-email-anthony.yznaga@oracle.com
Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Matthew Wilcox <willy@infradead.org>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Miles Chen <miles.chen@mediatek.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Reference counters should use refcount_t rather than atomic_t, since the
refcount_t implementation can prevent overflows, reducing the
exploitability of reference leak bugs. userfaultfd_ctx::refcount is a
reference counter with the usual semantics, so convert it to refcount_t.
Note: I replaced the BUG() on incrementing a 0 refcount with just
refcount_inc(), since part of the semantics of refcount_t is that that
incrementing a 0 refcount is not allowed; with CONFIG_REFCOUNT_FULL,
refcount_inc() already checks for it and warns.
Link: http://lkml.kernel.org/r/20181115003916.63381-1-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
totalram_pages and totalhigh_pages are made static inline function.
Main motivation was that managed_page_count_lock handling was complicating
things. It was discussed in length here,
https://lore.kernel.org/patchwork/patch/995739/#1181785 So it seemes
better to remove the lock and convert variables to atomic, with preventing
poteintial store-to-read tearing as a bonus.
[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/1542090790-21750-4-git-send-email-arunks@codeaurora.org
Signed-off-by: Arun KS <arunks@codeaurora.org>
Suggested-by: Michal Hocko <mhocko@suse.com>
Suggested-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Reviewed-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "mm: convert totalram_pages, totalhigh_pages and managed
pages to atomic", v5.
This series converts totalram_pages, totalhigh_pages and
zone->managed_pages to atomic variables.
totalram_pages, zone->managed_pages and totalhigh_pages updates are
protected by managed_page_count_lock, but readers never care about it.
Convert these variables to atomic to avoid readers potentially seeing a
store tear.
Main motivation was that managed_page_count_lock handling was complicating
things. It was discussed in length here,
https://lore.kernel.org/patchwork/patch/995739/#1181785 It seemes better
to remove the lock and convert variables to atomic. With the change,
preventing poteintial store-to-read tearing comes as a bonus.
This patch (of 4):
This is in preparation to a later patch which converts totalram_pages and
zone->managed_pages to atomic variables. Please note that re-reading the
value might lead to a different value and as such it could lead to
unexpected behavior. There are no known bugs as a result of the current
code but it is better to prevent from them in principle.
Link: http://lkml.kernel.org/r/1542090790-21750-2-git-send-email-arunks@codeaurora.org
Signed-off-by: Arun KS <arunks@codeaurora.org>
Reviewed-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
For sync io read in ocfs2_read_blocks_sync(), first clear bh uptodate flag
and submit the io, second wait io done, last check whether bh uptodate, if
not return io error.
If two sync io for the same bh were issued, it could be the first io done
and set uptodate flag, but just before check that flag, the second io came
in and cleared uptodate, then ocfs2_read_blocks_sync() for the first io
will return IO error.
Indeed it's not necessary to clear uptodate flag, as the io end handler
end_buffer_read_sync() will set or clear it based on io succeed or failed.
The following message was found from a nfs server but the underlying
storage returned no error.
[4106438.567376] (nfsd,7146,3):ocfs2_get_suballoc_slot_bit:2780 ERROR: read block 1238823695 failed -5
[4106438.567569] (nfsd,7146,3):ocfs2_get_suballoc_slot_bit:2812 ERROR: status = -5
[4106438.567611] (nfsd,7146,3):ocfs2_test_inode_bit:2894 ERROR: get alloc slot and bit failed -5
[4106438.567643] (nfsd,7146,3):ocfs2_test_inode_bit:2932 ERROR: status = -5
[4106438.567675] (nfsd,7146,3):ocfs2_get_dentry:94 ERROR: test inode bit failed -5
Same issue in non sync read ocfs2_read_blocks(), fixed it as well.
Link: http://lkml.kernel.org/r/20181121020023.3034-4-junxiao.bi@oracle.com
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Changwei Ge <ge.changwei@h3c.com>
Reviewed-by: Yiwen Jiang <jiangyiwen@huawei.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Joseph Qi <jiangqi903@gmail.com>
Cc: Jun Piao <piaojun@huawei.com>
Cc: Mark Fasheh <mfasheh@versity.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Dirty flag of the journal should be cleared at the last stage of umount,
if do it before jbd2_journal_destroy(), then some metadata in uncommitted
transaction could be lost due to io error, but as dirty flag of journal
was already cleared, we can't find that until run a full fsck. This may
cause system panic or other corruption.
Link: http://lkml.kernel.org/r/20181121020023.3034-3-junxiao.bi@oracle.com
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Yiwen Jiang <jiangyiwen@huawei.com>
Reviewed-by: Joseph Qi <jiangqi903@gmail.com>
Cc: Jun Piao <piaojun@huawei.com>
Cc: Changwei Ge <ge.changwei@h3c.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@versity.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Included file path was hard-wired in the ocfs2 makefile, which might
causes some confusion when compiling ocfs2 as an external module.
Say if we compile ocfs2 module as following.
cp -r /kernel/tree/fs/ocfs2 /other/dir/ocfs2
cd /other/dir/ocfs2
make -C /path/to/kernel_source M=`pwd` modules
Acutally, the compiler wil try to find included file in
/kernel/tree/fs/ocfs2, rather than the directory /other/dir/ocfs2.
To fix this little bug, we introduce the var $(src) provided by kbuild.
$(src) means the absolute path of the running kbuild file.
Link: http://lkml.kernel.org/r/20181108085546.15149-1-lchen@suse.com
Signed-off-by: Larry Chen <lchen@suse.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Cc: Changwei Ge <ge.changwei@h3c.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
lastzero is not used after setting its value. It is safe to remove the
unused variable.
Link: http://lkml.kernel.org/r/1540296942-24533-1-git-send-email-zhongjiang@huawei.com
Signed-off-by: zhong jiang <zhongjiang@huawei.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Cc: Changwei Ge <ge.changwei@h3c.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
status is not used after setting its value. It is safe to remove the
unused variable.
Link: http://lkml.kernel.org/r/1540300179-26697-1-git-send-email-zhongjiang@huawei.com
Signed-off-by: zhong jiang <zhongjiang@huawei.com>
Reviewed-by: Jun Piao <piaojun@huawei.com>
Acked-by: Joseph Qi <jiangqi903@gmail.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <ge.changwei@h3c.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Reading heartbeat data from lowest node rather than from zero, in cases
where the node is not defined from zero, can reduce the number of sectors
read.
Here is a simple test data obtained with 'iostat -dmx dm-5 2', with
two nodes in the cluster, node number 10, 20, respectively.
Before optimization:
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
dm-5 0.00 0.00 0.50 0.50 0.01 0.00 11.00 0.00 1.00 1.00 1.00 1.50 0.15
After the optimization:
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
dm-5 0.00 0.00 0.50 0.50 0.00 0.00 6.00 0.00 0.50 1.00 0.00 0.50 0.05
Link: http://lkml.kernel.org/r/99fe4988-69ac-3615-a218-3042fe6fbe72@huawei.com
Signed-off-by: Jia Guo <guojia12@huawei.com>
Reviewed-by: Jun Piao <piaojun@huawei.com>
Reviewed-by: Yiwen Jiang <jiangyiwen@huawei.com>
Acked-by: Joseph Qi <jiangqi903@gmail.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <ge.changwei@h3c.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Clarify the use of the CONFIG_DFS_UPCALL for DNS name resolution
when server ip addresses change (e.g. on long running mounts)
Signed-off-by: Steve French <stfrench@microsoft.com>
In case a hostname resolves to a different IP address (e.g. long
running mounts), make sure to resolve it every time prior to calling
generic_ip_connect() in reconnect.
Suggested-by: Steve French <stfrench@microsoft.com>
Signed-off-by: Paulo Alcantara <palcantara@suse.de>
Signed-off-by: Steve French <stfrench@microsoft.com>
After a successful failover, the cifs_reconnect_tcon() function will
make sure to reconnect every tcon to new target server.
Same as previous commit but for SMB1 codepath.
Signed-off-by: Paulo Alcantara <palcantara@suse.de>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
After a successful failover in cifs_reconnect(), the smb2_reconnect()
function will make sure to reconnect every tcon to new target server.
For SMB2+.
Signed-off-by: Paulo Alcantara <palcantara@suse.de>
Signed-off-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Fix potential NULL ptr deref when DFS target list is empty.
Signed-off-by: Paulo Alcantara <palcantara@suse.de>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Start the DFS cache refresh worker per volume during cifs mount.
Signed-off-by: Paulo Alcantara <palcantara@suse.de>
Reviewed-by: Aurelien Aptel <aaptel@suse.de>
Signed-off-by: Steve French <stfrench@microsoft.com>
A spin lock is held before kstrndup, it may sleep with holding
the spinlock, so we should use GFP_ATOMIC instead.
Fixes: e58c31d5e387 ("cifs: Add support for failover in cifs_reconnect()")
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Paulo Alcantara <palcantara@suse.de>
After failing to reconnect to original target, it will retry any
target available from DFS cache.
Signed-off-by: Paulo Alcantara <palcantara@suse.de>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
This patch adds support for failover when failing to connect in
cifs_mount().
Signed-off-by: Paulo Alcantara <palcantara@suse.de>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Fixes gcc '-Wunused-but-set-variable' warning:
fs/cifs/cifs_dfs_ref.c: In function 'cifs_dfs_do_automount':
fs/cifs/cifs_dfs_ref.c:309:7: warning:
variable 'sep' set but not used [-Wunused-but-set-variable]
It never used since introdution in commit 0f56b277073c ("cifs: Make use
of DFS cache to get new DFS referrals")
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Reviewed-by: Paulo Alcantara <palcantara@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
This patch will make use of DFS cache routines where appropriate and
do not always request a new referral from server.
Signed-off-by: Paulo Alcantara <palcantara@suse.de>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
kzalloc can return NULL so an additional check is needed. While there
is a check for ret_buf there is no check for the allocation of
ret_buf->crfid.fid - this check is thus added. Both call-sites
of tconInfoAlloc() check for NULL return of tconInfoAlloc()
so returning NULL on failure of kzalloc() here seems appropriate.
As the kzalloc() is the only thing here that can fail it is
moved to the beginning so as not to initialize other resources
on failure of kzalloc.
Fixes: 3d4ef9a153 ("smb3: fix redundant opens on root")
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Fixes gcc '-Wunused-but-set-variable' warning:
fs/cifs/smb2pdu.c: In function 'smb311_posix_mkdir':
fs/cifs/smb2pdu.c:2040:26: warning:
variable 'server' set but not used [-Wunused-but-set-variable]
fs/cifs/smb2pdu.c: In function 'build_qfs_info_req':
fs/cifs/smb2pdu.c:4067:26: warning:
variable 'server' set but not used [-Wunused-but-set-variable]
The first 'server' never used since commit bea851b8ba ("smb3: Fix mode on
mkdir on smb311 mounts")
And the second not used since commit 1fc6ad2f10 ("cifs: remove
header_preamble_size where it is always 0")
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
We should zero out the password before we free it.
Fixes: 3d6cacbb5310 ("cifs: Add DFS cache routines")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Paulo Alcantara <palcantara@suse.de>
memory allocated by kmem_cache_alloc() in alloc_cache_entry()
should be freed using kmem_cache_free(), not kfree().
Fixes: 34a44fb160f9 ("cifs: Add DFS cache routines")
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Fixes cifs build failure after merge of the y2038 tree
After merging the y2038 tree, today's linux-next build (x86_64
allmodconfig) failed like this:
fs/cifs/dfs_cache.c: In function 'cache_entry_expired':
fs/cifs/dfs_cache.c:106:7: error: implicit declaration of function 'current_kernel_time64'; did you mean 'core_kernel_text'? [-Werror=implicit-function-declaration]
ts = current_kernel_time64();
^~~~~~~~~~~~~~~~~~~~~
core_kernel_text
fs/cifs/dfs_cache.c:106:5: error: incompatible types when assigning to type 'struct timespec64' from type 'int'
ts = current_kernel_time64();
^
fs/cifs/dfs_cache.c: In function 'get_expire_time':
fs/cifs/dfs_cache.c:342:24: error: incompatible type for argument 1 of 'timespec64_add'
return timespec64_add(current_kernel_time64(), ts);
^~~~~~~~~~~~~~~~~~~~~~~
In file included from include/linux/restart_block.h:10,
from include/linux/thread_info.h:13,
from arch/x86/include/asm/preempt.h:7,
from include/linux/preempt.h:78,
from include/linux/rcupdate.h:40,
from fs/cifs/dfs_cache.c:8:
include/linux/time64.h:66:66: note: expected 'struct timespec64' but argument is of type 'int'
static inline struct timespec64 timespec64_add(struct timespec64 lhs,
~~~~~~~~~~~~~~~~~~^~~
fs/cifs/dfs_cache.c:343:1: warning: control reaches end of non-void function [-Wreturn-type]
}
^
Caused by:
commit ccea641b6742 ("timekeeping: remove obsolete time accessors")
interacting with:
commit 34a44fb160f9 ("cifs: Add DFS cache routines")
from the cifs tree.
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Reviewed-by: Paulo Alcantara <palcantara@suse.de>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Steve French <stfrench@microsoft.com>
* Add new dfs_cache.[ch] files
* Add new /proc/fs/cifs/dfscache file
- dump current cache when read
- clear current cache when writing "0" to it
* Add delayed_work to periodically refresh cache entries
The new interface will be used for caching DFS referrals, as well as
supporting client target failover.
The DFS cache is a hashtable that maps UNC paths to cache entries.
A cache entry contains:
- the UNC path it is mapped on
- how much the the UNC path the entry consumes
- flags
- a Time-To-Live after which the entry expires
- a list of possible targets (linked lists of UNC paths)
- a "hint target" pointing the last known working target or the first
target if none were tried. This hint lets cifs.ko remember and try
working targets first.
* Looking for an entry in the cache is done with dfs_cache_find()
- if no valid entries are found, a DFS query is made, stored in the
cache and returned
- the full target list can be copied and returned to avoid race
conditions and looped on with the help with the
dfs_cache_tgt_iterator
* Updating the target hint to the next target is done with
dfs_cache_update_tgthint()
These functions have a dfs_cache_noreq_XXX() version that doesn't
fetches referrals if no entries are found. These versions don't
require the tcp/ses/tcon/cifs_sb parameters as a result.
Expired entries cannot be used and since they have a pretty short TTL
[1] in order for them to be useful for failover the DFS cache adds a
delayed work called periodically to keep them fresh.
Since we might not have available connections to issue the referral
request when refreshing we need to store volume_info structs with
credentials and other needed info to be able to connect to the right
server.
1: Windows defaults: 5mn for domain-based referrals, 30mn for regular
links
Signed-off-by: Paulo Alcantara <palcantara@suse.de>
Signed-off-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
svc_serv-> sv_bc_xprt is netns-unsafe and cannot be used as pointer.
To prevent its misuse in future it is replaced by new boolean flag.
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Drop LIST_HEAD where the variable it declares is never used.
This was introduced in c5c707f96f ("nfsd: implement pNFS
layout recalls"), but was not used even in that commit.
The semantic patch that fixes this problem is as follows:
(http://coccinelle.lip6.fr/)
// <smpl>
@@
identifier x;
@@
- LIST_HEAD(x);
... when != x
// </smpl>
Fixes: c5c707f96f ("nfsd: implement pNFS layout recalls")
Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJcILryAAoJEAAOaEEZVoIVByMP/RYty0dsV9ALt0PKqxKyDVT7
9KOGA73JIahpeO2dnlIiZAXdeC3Y350UH2axrX6Xy/X6ktTCca3X/xqZMn/wK5kt
zeMeqcjsyTz4wJoOCWUm7nsteMALfusDAIMd8axEnBFgWk9nsQwgh+jS1gYVj2D1
GUSdCceKcQ0ZOSsZDzDFgd/R34dNMobvYd1aOE2bgHL19BAXj1aKIv81yAjjY41A
D+VVvEyzIHQUtxmWk1X3X8kYfhHMu4X5AQhhqPxw8jw8CC0w0lfQTOZe/ApeMfxZ
BFhusMdwf8QhkGWMOfhOTldTm3GobBmdsNX5HmukvcAEjDVPiOLCiKXaLWEBJ68r
HbmB3YyxzT2re0PfSa72WIu6W8aKZHUny+BLgTHiuW3KIV1khJK4AflWKMLfe8yi
0xUWm0SeiwMCdBLtkD8IykC19LBCAKM15JBxpUHadBvBsO9G+54DfzRImjnxSjAH
tX0RnFWmA7hXt02fZjcywT+/+n84kt0lbjdJAKXK6tl0DbGxlEj8YoJVBNG/p94q
ARHgBKMFTrsDPJiMY5WSobHapjmnSNDBk6uSf6TPe2E5wmnhLcwfTqwCdnpocW6c
A7CX+q/Jfyk1oNEGajXXyIsjaDc2Ma0Fs/mfb3o5Lhkjq68e+LT0+eIjn4veuY5H
KoZ/KVBGs4Uxy81qMPkl
=sUSp
-----END PGP SIGNATURE-----
Merge tag 'locks-v4.21-1' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux
Pull file locking updates from Jeff Layton:
"The main change in this set is Neil Brown's work to reduce the
thundering herd problem when a heavily-contended file lock is
released.
Previously we'd always wake up all waiters when this occurred. With
this set, we'll now we only wake up waiters that were blocked on the
range being released"
* tag 'locks-v4.21-1' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux:
locks: Use inode_is_open_for_write
fs/locks: remove unnecessary white space.
fs/locks: merge posix_unblock_lock() and locks_delete_block()
fs/locks: create a tree of dependent requests.
fs/locks: change all *_conflict() functions to return bool.
fs/locks: always delete_block after waiting.
fs/locks: allow a lock request to block other requests.
fs/locks: use properly initialized file_lock when unlocking.
ocfs2: properly initial file_lock used for unlock.
gfs2: properly initial file_lock used for unlock.
NFS: use locks_copy_lock() to copy locks.
fs/locks: split out __locks_wake_up_blocks().
fs/locks: rename some lists and pointers.
in ext4's NFS support, and fix an ioctl (EXT4_IOC_GROUP_ADD) used by
old versions of e2fsprogs which we accidentally broke a while back.
Also fixed some error paths in ext4's quota and inline data support.
Finally, improve tail latency in jbd2's commit code.
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAlwgYYEACgkQ8vlZVpUN
gaPh1Af9GCGcgbwmmE+PSwqYXXdDm27hG3Xv4glLcAnmmuT32lMjzjifWPhT8sFs
+l1nh2rd0/u14PocjAL8neuuc9G9J+xNS33jlNQsqMsfFQD4cZMk7T1j68JIAEd3
bt/VNIUxvPshYwgvEJlXAeZvXx8kPMKyR44/FyzHdU9oDSWBYE3A9+rjRGUFxXDR
LuqwLhvERv6Vykfrzhluj8IOZM6V221alRDuWjx1sQF+/E6zAqyjR3YoYXk04Ajg
vnAfEXToeBwLVeTUQgmT9hPrinh7/00wCKekNuzzhg7oKDp7FgD1BMlxBr9eOW+5
pQwM9T+AVbs9EfpYasC6ElEMbLfOPw==
=lCnm
-----END PGP SIGNATURE-----
Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 updates from Ted Ts'o:
"All cleanups and bug fixes; most notably, fix some problems discovered
in ext4's NFS support, and fix an ioctl (EXT4_IOC_GROUP_ADD) used by
old versions of e2fsprogs which we accidentally broke a while back.
Also fixed some error paths in ext4's quota and inline data support.
Finally, improve tail latency in jbd2's commit code"
* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
ext4: check for shutdown and r/o file system in ext4_write_inode()
ext4: force inode writes when nfsd calls commit_metadata()
ext4: avoid declaring fs inconsistent due to invalid file handles
ext4: include terminating u32 in size of xattr entries when expanding inodes
ext4: compare old and new mode before setting update_mode flag
ext4: fix EXT4_IOC_GROUP_ADD ioctl
ext4: hard fail dax mount on unsupported devices
jbd2: update locking documentation for transaction_t
ext4: remove redundant condition check
jbd2: clean up indentation issue, replace spaces with tab
ext4: clean up indentation issues, remove extraneous tabs
ext4: missing unlock/put_page() in ext4_try_to_write_inline_data()
ext4: fix possible use after free in ext4_quota_enable
jbd2: avoid long hold times of j_state_lock while committing a transaction
ext4: add ext4_sb_bread() to disambiguate ENOMEM cases
- Fix CoW remapping of extremely fragmented file areas
- Fix a zero-length symlink verifier error
- Constify some of the rmap owner structures for per-AG metadata
- Precalculate inode geometry for later use
- Fix scrub counting problems
- Don't crash when rtsummary inode is null
- Fix x32 ioctl operation
- Fix enum->string mappings for ftrace output
- Cache realtime summary information in memory
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAlwedm4ACgkQ+H93GTRK
tOumSBAAjP9nFAWmI9n/4GEyNI7p6Iz26q42vePiX7FspJY2iWG38gf5PyRarL7X
eWzmUhUJVQ3koA4+h+ImGPWwjZAyDUug7r437sPPPkcMvG9WKrj8A2M+kQve39NB
kCaXsRJFEvOdFXTMwTzCMrZrHa1b/slqj65eoUyXI7dzT0vwuU8Nj8CX6//zOHD2
4i7yiyGGnZvQ+Ql8ad+ck7LjooeBUzl3XBeOHlFZAO+gDCu6o1TUqO+x8HZPy58N
sfBLY/9vqlzEN8EpWjiTdI+GNH0zhO0tJQI5YJlGqa6SpIC78fWElfFF0A9CUXy4
meKc9VmJ5oYi0Rg62S3pKTRZWtkr9FddBjS/C0thvXIpw6CSmHz/ED41P/wrtnI+
vbknTXeiLeCuTTLtnYWJNIW1kvWjS2vXlQQT5jt/rp49ZqzfetM9nuopES0CcnCm
a8K++V2LZuYEPzM2fKQgAdd5z02BYeNu8uaEBEHcAWDUnVo/bKInp87dEiQZ5GXx
PBfIfOuqVooCppp9w1T7nzshI0R3k+LQj2R4S0d039hfEw8eQmPu39rfvugy6YwI
qCZo8uErzWmqF7T4EXMtGu1FS8xIwyLPyiqcsCd8g+zep95iVySSrhha+3DgR/mG
85s4VOE5aWijBrRZixIIqF1bIc+LPE/miBEXIyCsBPRD52+YQcQ=
=P0gt
-----END PGP SIGNATURE-----
Merge tag 'xfs-4.21-merge-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull XFS updates from Darrick Wong:
- Fix CoW remapping of extremely fragmented file areas
- Fix a zero-length symlink verifier error
- Constify some of the rmap owner structures for per-AG metadata
- Precalculate inode geometry for later use
- Fix scrub counting problems
- Don't crash when rtsummary inode is null
- Fix x32 ioctl operation
- Fix enum->string mappings for ftrace output
- Cache realtime summary information in memory
* tag 'xfs-4.21-merge-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (24 commits)
xfs: reallocate realtime summary cache on growfs
xfs: stringify scrub types in ftrace output
xfs: stringify btree cursor types in ftrace output
xfs: move XFS_INODE_FORMAT_STR mappings to libxfs
xfs: move XFS_AG_BTREE_CMP_FORMAT_STR mappings to libxfs
xfs: fix symbolic enum printing in ftrace output
xfs: fix function pointer type in ftrace format
xfs: Fix x32 ioctls when cmd numbers differ from ia32.
xfs: Fix bulkstat compat ioctls on x32 userspace.
xfs: Align compat attrlist_by_handle with native implementation.
xfs: require both realtime inodes to mount
xfs: cache minimum realtime summary level
xfs: count inode blocks correctly in inobt scrub
xfs: precalculate cluster alignment in inodes and blocks
xfs: precalculate inodes and blocks per inode cluster
xfs: add a block to inode count converter
xfs: remove xfs_rmap_ag_owner and friends
xfs: const-ify xfs_owner_info arguments
xfs: streamline defer op type handling
xfs: idiotproof defer op type configuration
...
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAlwbniUACgkQnJ2qBz9k
QNlACwgA0d/OoJesmZPihnRmFWTScvtXlxsf9NJOmoANaxRA1uToEQsrFmFeUXf+
vrdVwCfVLUACF9mgp1KB/m7HNUHlFps+DBnLZ2XbgIqKKHLANllDujA6v36ZEbAJ
h71KggrYHw8GUjEvOURp5DBOBzdqzp5dS17NNaJquxeXmATx8rGaxszo3/J9zNXv
8GZsVBuD9aZg+1XpusnztzR1MbRc0OtQ0GEryTO/t3LGBN5og8YLxZ6rlZUvO2Qu
x5OmNKU/VGKRPo3boU2dFozeHFPyQTcZdBa/66JI5wzt/qyTinyvyLbhu0fcA98/
U7plgtz8F7wYZXy321ywkvWNY2aALw==
=rJYy
-----END PGP SIGNATURE-----
Merge tag 'fs_for_4.21-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull ext2, udf, and quota update from Jan Kara:
"Some ext2 cleanups, a fix for UDF crash on corrupted media, and one
quota locking fix"
* tag 'fs_for_4.21-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
quota: Lock s_umount in exclusive mode for Q_XQUOTA{ON,OFF} quotactls.
udf: Fix BUG on corrupted inode
ext2: change reusable parameter to true when calling mb_cache_entry_create()
ext2: remove redundant condition check
ext2: avoid unnecessary operation in ext2_error()
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAlwbm5YACgkQnJ2qBz9k
QNnF5wf+L9lWe4LzGf3sCc45JFtcsvLtcYd0XSn93ATRPSIxNYCCDEdvJ9UGfGhG
+2y4Z42KBWkqg2Dx8W2yiQjP1RxJTbWIAEE7uKKyWIojNMbtpjSjRw/n7VmCG7m/
ZHhO3XITHYRCVruIKVfwC0a9zPNWD6RD4+IhHRrZomC1ZrfTzs0OBc6PADXLchdR
0QRkpWlomD3CASGVioYbOFqX3pcDJWUdoFvOyMu2sELkJzAJjC7spB+MfoFeRJsA
QEUoRAdWCSFtxcuM6bkNMc1zXig3niLBka/ozXvwA0192rbOcyGvua3PwqYTdqnu
Bea36CD2G+ta0cBv1M5s7e87surttw==
=wdFw
-----END PGP SIGNATURE-----
Merge tag 'fsnotify_for_v4.21-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull fsnotify updates from Jan Kara:
"Support for new FAN_OPEN_EXEC event and couple of cleanups around
fsnotify"
* tag 'fsnotify_for_v4.21-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
fanotify: Use inode_is_open_for_write
fanotify: Make sure to check event_len when copying
fsnotify/fdinfo: include fdinfo.h for inotify_show_fdinfo()
fanotify: introduce new event mask FAN_OPEN_EXEC_PERM
fsnotify: refactor fsnotify_parent()/fsnotify() paired calls when event is on path
fanotify: introduce new event mask FAN_OPEN_EXEC
fanotify: return only user requested event types in event mask
This set is entirely trivial fixes, mainly around correct cleanup
on error paths and improved error checks. One patch adds scheduling
in a potentially long recovery loop.
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJcGnvKAAoJEDgbc8f8gGmq4hcP/0vhF9y+SLXDkvm989n9QBMe
elJ5KhxeHQgYDScv2Nbth9heTGMw2OFzDAUSMRXD45eYy8EjM12Gl8Dz4y8pQBne
5ENeibzPYmD2Smvtyq5T/45HpErlM5NmcefICumaC9WNRQBei/6dD3UEteRnAZbC
qyNPr9/bqFSkMCzJzKRsjfduk5GTYB5FQzIOeHvwn3dShPtTETn+zF5MH/mpswBP
LEtimOZdXVUZzdmbAuOMOfAMLoRdpzLFNxC10g//JyZAbG3dQZdpEdn98iu0oR3c
Y7JUs6JLtkB/FP46v0Ay7xJ7amx8dNnKbEj7UzqveSgo3PxPbIZYa3VUdXfdP/9T
J/+k88behzQ5NG+pETKA51+DaVJ50nAVeQqX001dgUi095esjqmyQDPbDwohsRvf
z+gb0vqwtsbS58/1tRcfRdrzYl/FSHuEcif0rwOwTnqAjOjz+V89wNZo2HbLbOeK
8yyeeGd+jMxom9xwHV79/IPo3U2P5iU/40AfJXW2MMnudbq+kfE7GdNXu2MWPQ1P
yhmwvELoNG8aD93nX14HMhHFUl+Lv9p5VW/RhySFevt/9TKRiwlG1tw7e/XhNJjs
P8K8ry394FlPOPYeAsiabOI9dUi3Divm7/7ChUFSqQUc7wAHjpLbUlFczYCnVz3W
rb/GfmgBFcJQtzURXDnx
=p4Yr
-----END PGP SIGNATURE-----
Merge tag 'dlm-4.21' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm
Pull dlm updates from David Teigland:
"This set is entirely trivial fixes, mainly around correct cleanup on
error paths and improved error checks. One patch adds scheduling in a
potentially long recovery loop"
* tag 'dlm-4.21' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm:
dlm: fix invalid cluster name warning
dlm: NULL check before some freeing functions is not needed
dlm: NULL check before kmem_cache_destroy is not needed
dlm: fix missing idr_destroy for recover_idr
dlm: memory leaks on error path in dlm_user_request()
dlm: lost put_lkb on error path in receive_convert() and receive_unlock()
dlm: possible memory leak on error path in create_lkb()
dlm: fixed memory leaks after failed ls_remove_names allocation
dlm: fix possible call to kfree() for non-initialized pointer
dlm: Don't swamp the CPU with callbacks queued during recovery
dlm: don't leak kernel pointer to userspace
dlm: don't allow zero length names
dlm: fix invalid free
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAlwaO6gACgkQxWXV+ddt
WDuiuRAAktVw4AEogVKW8J9Fx7UCW8hJ0C2NFWSlcvaNfKi7ccMBsINqQIH6bvmv
lO2KUuaKRK+FID3R3Er9Z4GZmvg7nXAEp/+PQgYMYYOFOmsdr/YCZCjBn9ACDZ/K
XJD/ESpEq9FORMj1yHre3zKgCPwkOK66Z8XFRMPaC5ANiLNohuE25FkwTTva0Bse
TyOc8aSKmfXBqoNvlKS0qc52FUoS6aebrYfTA7l9FfmUJ0szZeHSCtRyY4EzlQu7
VLojNq7GkSXpGGQHTTSp73Bj1E2ZM6fwTgpC1NHAFnTCTFSr/GswdfMdmxMkNuqm
zsrioqIzh6kX8RXVvB81Vb1aCRXELhcpBDWK/5QbEPyQvfKxQJLoHBt3QqewwtnT
lClhNi4tx3ItfWddz24kq/E8aaIOQtECq6OQXpyRPc47TmKwY8m8V41LP/ECNBQ2
InIuAsk0abkQvemc98SNxDNV0hiMJHJnR93wY8MEs6JTY3PuVE6ELrMTB7GRsl6M
zHHFD1kSBVNMxWr+pwnDe5NxQ3qP4SB1RZvTlN7FdzLAhWIFEN7foHoXI9h2G0kx
+H/cVr2OiXdxUd19IuAF9kxVkfrJ7A3JEopVdaI57swyu/IK2xGsPd5nDkf4kOXU
OpcnnbDxg6v0NL/Vr/LxWSa6qgVqM2D4JyxjMTMPQOhK7Y6GP4M=
=rO1q
-----END PGP SIGNATURE-----
Merge tag 'for-4.21-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
"New features:
- swapfile support - after a long time it's here, with some
limitations where COW design does not work well with the swap
implementation (nodatacow file, no compression, cannot be
snapshotted, not possible on multiple devices, ...), as this is the
most restricted but working setup, we'll try to improve that in the
future
- metadata uuid - an optional incompat feature to assign a new
filesystem UUID without overwriting all metadata blocks, stored
only in superblock
- more balance messages are printed to system log, initial is in the
format of the command line that would be used to start it
Fixes:
- tag pages of a snapshot to better separate pages that are involved
in the snapshot (and need to get synced) from newly dirtied pages
that could slow down or even livelock the snapshot operation
- improved check of filesystem id associated with a device during
scan to detect duplicate devices that could be mixed up during
mount
- fix device replace state transitions, eg. when it ends up
interrupted and reboot tries to restart balance too, or when
start/cancel ioctls race
- fix a crash due to a race when quotas are enabled during snapshot
creation
- GFP_NOFS/memalloc_nofs_* fixes due to GFP_KERNEL allocations in
transaction context
- fix fsync of files with multiple hard links in new directories
- fix race of send with transaction commits that create snapshots
Core changes:
- cleanups:
* further removals of now-dead fsync code
* core function for finding free extent has been split and
provides a base for further cleanups to make the logic more
understandable
* removed lot of indirect callbacks for data and metadata inodes
* simplified refcounting and locking for cloned extent buffers
* removed redundant function arguments
* defines converted to enums where appropriate
- separate reserve for delayed refs from global reserve, update logic
to do less trickery and ad-hoc heuristics, move out some related
expensive operations from transaction commit or file truncate
- dev-replace switched from custom locking scheme to semaphore
- remove first phase of balance that tried to make some space for the
relocation by calling shrink and grow, this did not work as
expected and only introduced more error states due to potential
resize failures, slightly improves the runtime as the chunks on all
devices are not needlessly enumerated
- clone and deduplication now use generic helper that adds a few more
checks that were missing from the original btrfs implementation of
the ioctls"
* tag 'for-4.21-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (125 commits)
btrfs: Fix typos in comments and strings
btrfs: improve error handling of btrfs_add_link
Btrfs: use generic_remap_file_range_prep() for cloning and deduplication
btrfs: Refactor main loop in extent_readpages
btrfs: Remove 1st shrink/grow phase from balance
Btrfs: send, fix race with transaction commits that create snapshots
Btrfs: use nofs context when initializing security xattrs to avoid deadlock
btrfs: run delayed items before dropping the snapshot
btrfs: catch cow on deleting snapshots
btrfs: extent-tree: cleanup one-shot usage of @blocksize in do_walk_down
Btrfs: scrub, move setup of nofs contexts higher in the stack
btrfs: scrub: move scrub_setup_ctx allocation out of device_list_mutex
btrfs: scrub: pass fs_info to scrub_setup_ctx
btrfs: fix truncate throttling
btrfs: don't run delayed refs in the end transaction logic
btrfs: rework btrfs_check_space_for_delayed_refs
btrfs: add new flushing states for the delayed refs rsv
btrfs: update may_commit_transaction to use the delayed refs rsv
btrfs: introduce delayed_refs_rsv
btrfs: only track ref_heads in delayed_ref_updates
...
- Enhancements and performance improvements to journal replay (Abhi Das)
- Cleanup of gfs2_is_ordered and gfs2_is_writeback (Andreas Gruenbacher)
- Fix a potential double-free in inode creation (Andreas Gruenbacher)
- Fix the bitmap search loop that was searching too far (Andreas Gruenbacher)
- Various cleanups (Andreas Gruenbacher, Bob Peterson)
- Implement Steve Whitehouse's patch to dump nrpages for inodes (Bob Peterson)
- Fix a withdraw bug where stuffed journaled data files didn't allocate
enough journal space to be grown (Bob Peterson)
-----BEGIN PGP SIGNATURE-----
iQEcBAABAgAGBQJcGlTwAAoJENeLYdPf93o7oGIIAIC5OfmIU2rz789L1s50m6Xg
fAJ4X+m3d1jbHZDAwd5BXYcCaIMQJpaL89JWZSyuN5vG4hmNq/Q1V0Ho4nU5onkY
sIQ5PlOf0I2uEbe2k9lAyWr2ro+S7QSfW+ogVIE7didOxYYRY825lxgMpJEChW+r
n0a7SqzPBHJglzNUVASrebD7VA7afzJ/Lb4YcXc7p6ewb7++/Kc3W0sewwy8kKER
29W21bxAhzm4vLMyU7h0g51AuLxBljijSZ7J1IHf0UPwQ/KC4QFuWVyufuH9ePuY
5yJxa25HTeZboIeeeuF01UDv6V9TOu83ztrGlI0uoR9rWPxPEA3Y1Spy6wQ/2w8=
=k9oZ
-----END PGP SIGNATURE-----
Merge tag 'gfs2-4.21.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2
Pull gfs2 updates from Bob Peterson:
- Enhancements and performance improvements to journal replay (Abhi
Das)
- Cleanup of gfs2_is_ordered and gfs2_is_writeback (Andreas
Gruenbacher)
- Fix a potential double-free in inode creation (Andreas Gruenbacher)
- Fix the bitmap search loop that was searching too far (Andreas
Gruenbacher)
- Various cleanups (Andreas Gruenbacher, Bob Peterson)
- Implement Steve Whitehouse's patch to dump nrpages for inodes (Bob
Peterson)
- Fix a withdraw bug where stuffed journaled data files didn't allocate
enough journal space to be grown (Bob Peterson)
* tag 'gfs2-4.21.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
gfs2: take jdata unstuff into account in do_grow
gfs2: Dump nrpages for inodes and their glocks
gfs2: Fix loop in gfs2_rbm_find
gfs2: Get rid of potential double-freeing in gfs2_create_inode
gfs2: Remove vestigial bd_ops
gfs2: read journal in large chunks to locate the head
gfs2: add a helper function to get_log_header that can be used elsewhere
gfs2: changes to gfs2_log_XXX_bio
gfs2: add more timing info to journal recovery process
gfs2: Fix the gfs2_invalidatepage description
gfs2: Clean up gfs2_is_{ordered,writeback}
Pull crypto updates from Herbert Xu:
"API:
- Add 1472-byte test to tcrypt for IPsec
- Reintroduced crypto stats interface with numerous changes
- Support incremental algorithm dumps
Algorithms:
- Add xchacha12/20
- Add nhpoly1305
- Add adiantum
- Add streebog hash
- Mark cts(cbc(aes)) as FIPS allowed
Drivers:
- Improve performance of arm64/chacha20
- Improve performance of x86/chacha20
- Add NEON-accelerated nhpoly1305
- Add SSE2 accelerated nhpoly1305
- Add AVX2 accelerated nhpoly1305
- Add support for 192/256-bit keys in gcmaes AVX
- Add SG support in gcmaes AVX
- ESN for inline IPsec tx in chcr
- Add support for CryptoCell 703 in ccree
- Add support for CryptoCell 713 in ccree
- Add SM4 support in ccree
- Add SM3 support in ccree
- Add support for chacha20 in caam/qi2
- Add support for chacha20 + poly1305 in caam/jr
- Add support for chacha20 + poly1305 in caam/qi2
- Add AEAD cipher support in cavium/nitrox"
* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (130 commits)
crypto: skcipher - remove remnants of internal IV generators
crypto: cavium/nitrox - Fix build with !CONFIG_DEBUG_FS
crypto: salsa20-generic - don't unnecessarily use atomic walk
crypto: skcipher - add might_sleep() to skcipher_walk_virt()
crypto: x86/chacha - avoid sleeping under kernel_fpu_begin()
crypto: cavium/nitrox - Added AEAD cipher support
crypto: mxc-scc - fix build warnings on ARM64
crypto: api - document missing stats member
crypto: user - remove unused dump functions
crypto: chelsio - Fix wrong error counter increments
crypto: chelsio - Reset counters on cxgb4 Detach
crypto: chelsio - Handle PCI shutdown event
crypto: chelsio - cleanup:send addr as value in function argument
crypto: chelsio - Use same value for both channel in single WR
crypto: chelsio - Swap location of AAD and IV sent in WR
crypto: chelsio - remove set but not used variable 'kctx_len'
crypto: ux500 - Use proper enum in hash_set_dma_transfer
crypto: ux500 - Use proper enum in cryp_set_dma_transfer
crypto: aesni - Add scatter/gather avx stubs, and use them in C
crypto: aesni - Introduce partial block macro
..
There is a security report where f2fs_getxattr() has a hole to expose wrong
memory region when the image is malformed like this.
f2fs_getxattr: entry->e_name_len: 4, size: 12288, buffer_size: 16384, len: 4
Cc: <stable@vger.kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
iput() on sbi->node_inode can update sbi->stat_info
in the below context, if the f2fs_write_checkpoint()
has failed with error.
f2fs_balance_fs_bg+0x1ac/0x1ec
f2fs_write_node_pages+0x4c/0x260
do_writepages+0x80/0xbc
__writeback_single_inode+0xdc/0x4ac
writeback_single_inode+0x9c/0x144
write_inode_now+0xc4/0xec
iput+0x194/0x22c
f2fs_put_super+0x11c/0x1e8
generic_shutdown_super+0x70/0xf4
kill_block_super+0x2c/0x5c
kill_f2fs_super+0x44/0x50
deactivate_locked_super+0x60/0x8c
deactivate_super+0x68/0x74
cleanup_mnt+0x40/0x78
Fix this by moving f2fs_destroy_stats() further below iput() in
both f2fs_put_super() and f2fs_fill_super() paths.
Signed-off-by: Sahitya Tummala <stummala@codeaurora.org>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
For all ordered cases in f2fs_wait_on_page_writeback(), we need to
check PageWriteback status, so let's clean up to relocate the check
into f2fs_wait_on_page_writeback().
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Treat "block_count" from struct f2fs_super_block as 64-bit little endian
value in sanity_check_raw_super() because struct f2fs_super_block
declares "block_count" as "__le64".
This fixes a bug where the superblock validation fails on big endian
devices with the following error:
F2FS-fs (sda1): Wrong segment_count / block_count (61439 > 0)
F2FS-fs (sda1): Can't find valid F2FS filesystem in 1th superblock
F2FS-fs (sda1): Wrong segment_count / block_count (61439 > 0)
F2FS-fs (sda1): Can't find valid F2FS filesystem in 2th superblock
As result of this the partition cannot be mounted.
With this patch applied the superblock validation works fine and the
partition can be mounted again:
F2FS-fs (sda1): Mounted with checkpoint version = 7c84
My little endian x86-64 hardware was able to mount the partition without
this fix.
To confirm that mounting f2fs filesystems works on big endian machines
again I tested this on a 32-bit MIPS big endian (lantiq) device.
Fixes: 0cfe75c5b0 ("f2fs: enhance sanity_check_raw_super() to avoid potential overflows")
Cc: stable@vger.kernel.org
Signed-off-by: Martin Blumenstingl <martin.blumenstingl@googlemail.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If user change inode's i_flags via ioctl, let's add it into global
dirty list, so that checkpoint can guarantee its persistence before
fsync, it can make checkpoint keeping strong consistency.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The union in struct extent_node wass only to indicate below fields
struct rb_node rb_node;
union {
struct {
unsigned int fofs;
unsigned int len;
...
...
can be parsed as fields in struct rb_entry, but they were never be
used explicitly before, so let's remove them for cleanup.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Should use lstart (logical start address) instead of start (in dev) here.
This fixes a bug in multi-device scenarios.
Signed-off-by: Qiuyang Sun <sunqiuyang@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
When there is a failure in f2fs_fill_super() after/during
the recovery of fsync'd nodes, it frees the current sbi and
retries again. This time the mount is successful, but the files
that got recovered before retry, still holds the extent tree,
whose extent nodes list is corrupted since sbi and sbi->extent_list
is freed up. The list_del corruption issue is observed when the
file system is getting unmounted and when those recoverd files extent
node is being freed up in the below context.
list_del corruption. prev->next should be fffffff1e1ef5480, but was (null)
<...>
kernel BUG at kernel/msm-4.14/lib/list_debug.c:53!
lr : __list_del_entry_valid+0x94/0xb4
pc : __list_del_entry_valid+0x94/0xb4
<...>
Call trace:
__list_del_entry_valid+0x94/0xb4
__release_extent_node+0xb0/0x114
__free_extent_tree+0x58/0x7c
f2fs_shrink_extent_tree+0xdc/0x3b0
f2fs_leave_shrinker+0x28/0x7c
f2fs_put_super+0xfc/0x1e0
generic_shutdown_super+0x70/0xf4
kill_block_super+0x2c/0x5c
kill_f2fs_super+0x44/0x50
deactivate_locked_super+0x60/0x8c
deactivate_super+0x68/0x74
cleanup_mnt+0x40/0x78
__cleanup_mnt+0x1c/0x28
task_work_run+0x48/0xd0
do_notify_resume+0x678/0xe98
work_pending+0x8/0x14
Fix this by not creating extents for those recovered files if shrinker is
not registered yet. Once mount is successful and shrinker is registered,
those files can have extents again.
Signed-off-by: Sahitya Tummala <stummala@codeaurora.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch cleans up checkpoint flow a bit:
- remove unneeded circulation of flushing meta pages.
- don't flush nat_bits pages in prior to other checkpoint pages.
- add bug_on to check remained meta pages after flushing.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Sometimes, I could observe # of issuing_discard to be 1 which blocks background
jobs due to is_idle()=false.
The only way to get out of it was to trigger gc_urgent. This patch avoids that
by checking any candidates as done in the list.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
One report says memalloc failure during mount.
(unwind_backtrace) from [<c010cd4c>] (show_stack+0x10/0x14)
(show_stack) from [<c049c6b8>] (dump_stack+0x8c/0xa0)
(dump_stack) from [<c024fcf0>] (warn_alloc+0xc4/0x160)
(warn_alloc) from [<c0250218>] (__alloc_pages_nodemask+0x3f4/0x10d0)
(__alloc_pages_nodemask) from [<c0270450>] (kmalloc_order_trace+0x2c/0x120)
(kmalloc_order_trace) from [<c03fa748>] (build_node_manager+0x35c/0x688)
(build_node_manager) from [<c03de494>] (f2fs_fill_super+0xf0c/0x16cc)
(f2fs_fill_super) from [<c02a5864>] (mount_bdev+0x15c/0x188)
(mount_bdev) from [<c03da624>] (f2fs_mount+0x18/0x20)
(f2fs_mount) from [<c02a68b8>] (mount_fs+0x158/0x19c)
(mount_fs) from [<c02c3c9c>] (vfs_kern_mount+0x78/0x134)
(vfs_kern_mount) from [<c02c76ac>] (do_mount+0x474/0xca4)
(do_mount) from [<c02c8264>] (SyS_mount+0x94/0xbc)
(SyS_mount) from [<c0108180>] (ret_fast_syscall+0x0/0x48)
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Commit 089842de ("f2fs: remove codes of unused wio_mutex") removes codes
of unused wio_mutex, but missing the comment, so delete it.
Signed-off-by: Yunlong Song <yunlong.song@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Pull RCU updates from Ingo Molnar:
"The biggest RCU changes in this cycle were:
- Convert RCU's BUG_ON() and similar calls to WARN_ON() and similar.
- Replace calls of RCU-bh and RCU-sched update-side functions to
their vanilla RCU counterparts. This series is a step towards
complete removal of the RCU-bh and RCU-sched update-side functions.
( Note that some of these conversions are going upstream via their
respective maintainers. )
- Documentation updates, including a number of flavor-consolidation
updates from Joel Fernandes.
- Miscellaneous fixes.
- Automate generation of the initrd filesystem used for rcutorture
testing.
- Convert spin_is_locked() assertions to instead use lockdep.
( Note that some of these conversions are going upstream via their
respective maintainers. )
- SRCU updates, especially including a fix from Dennis Krein for a
bag-on-head-class bug.
- RCU torture-test updates"
* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (112 commits)
rcutorture: Don't do busted forward-progress testing
rcutorture: Use 100ms buckets for forward-progress callback histograms
rcutorture: Recover from OOM during forward-progress tests
rcutorture: Print forward-progress test age upon failure
rcutorture: Print time since GP end upon forward-progress failure
rcutorture: Print histogram of CB invocation at OOM time
rcutorture: Print GP age upon forward-progress failure
rcu: Print per-CPU callback counts for forward-progress failures
rcu: Account for nocb-CPU callback counts in RCU CPU stall warnings
rcutorture: Dump grace-period diagnostics upon forward-progress OOM
rcutorture: Prepare for asynchronous access to rcu_fwd_startat
torture: Remove unnecessary "ret" variables
rcutorture: Affinity forward-progress test to avoid housekeeping CPUs
rcutorture: Break up too-long rcu_torture_fwd_prog() function
rcutorture: Remove cbflood facility
torture: Bring any extra CPUs online during kernel startup
rcutorture: Add call_rcu() flooding forward-progress tests
rcutorture/formal: Replace synchronize_sched() with synchronize_rcu()
tools/kernel.h: Replace synchronize_sched() with synchronize_rcu()
net/decnet: Replace rcu_barrier_bh() with rcu_barrier()
...
Pull sparc updates from David Miller:
- Automatic system call table generation, from Firoz Khan.
- Clean up accesses to the OF device names by using full_name instead
of path_component_name.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc-next:
ALSA: sparc: Use of_node_name_eq for node name comparisons
sbus: Use of_node_name_eq for node name comparisons
sparc: generate uapi header and system call table files
sparc: add system call table generation support
sparc: add __NR_syscalls along with NR_syscalls
sparc: move __IGNORE* entries to non uapi header
sparc: Use DT node full_name instead of name for resources
sparc: Remove unused leon_trans_init
sparc: Use device_type helpers to access the node type
sparc: Use of_node_name_eq for node name comparisons
sparc: Convert to using %pOFn instead of device_node.name
sparc: prom: use property "name" directly to construct node names
of: Drop full path from full_name for PDT systems
sparc: Convert to using %pOF instead of full_name
fs/openpromfs: Use of_node_name_eq for node name comparisons
fs/openpromfs: use full_name instead of path_component_name
mds contains an optimization, it does not re-issue stale caps if
client does not want any cap.
A special case of the optimization is that client wants some caps,
but skipped updating 'wanted'. For this case, client needs to update
'wanted' when stale session get renewed.
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
When reading cached inode that already has Fscr caps, this can avoid
two cap messages (one updats 'wanted' caps, one clears 'wanted' caps).
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Updating mseq makes client think importer mds has accepted all prior
cap messages and importer mds knows what caps client wants. Actually
some cap messages may have been dropped because of mseq mismatch.
If mseq is left untouched, importing cap's mds_wanted later will get
reset by cap import message.
Cc: stable@vger.kernel.org
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
There is redundant assighment of variable i in
ceph_mdsmap_get_random_mds(), just remvoe it.
Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
splice_dentry() may drop the original dentry and return other
dentry. It relies on its caller to update pointer that points
to the dropped dentry. This is error-prone.
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Core changes:
- Parse the 4BAIT SFDP section
- Add a bunch of SPI NOR entries to the flash_info table
- Add the concept of SFDP fixups and use it to fix a bug on MX25L25635F
- A bunch of minor cleanups/comestic changes
NAND changes:
NAND core changes:
- kernel-doc miscellaneous fixes.
- Third batch of fixes/cleanup to the raw NAND core impacting various
controller drivers (ams-delta, marvell, fsmc, denali, tegra, vf610):
* Stopping to pass mtd_info objects to internal functions
* Reorganizing code to avoid forward declarations
* Dropping useless test in nand_legacy_set_defaults()
* Moving nand_exec_op() to internal.h
* Adding nand_[de]select_target() helpers
* Passing the CS line to be selected in struct nand_operation
* Making ->select_chip() optional when ->exec_op() is implemented
* Deprecating the ->select_chip() hook
* Moving the ->exec_op() method to nand_controller_ops
* Moving ->setup_data_interface() to nand_controller_ops
* Deprecating the dummy_controller field
* Fixing JEDEC detection
* Providing a helper for polling GPIO R/B pin
Raw NAND chip drivers changes:
- Macronix:
* Flagging 1.8V AC chips with a broken GET_FEATURES(TIMINGS)
Raw NAND controllers drivers changes:
- Ams-delta:
* Fixing the error path
* SPDX tag added
* May be compiled with COMPILE_TEST=y
* Conversion to ->exec_op() interface
* Dropping .IOADDR_R/W use
* Use GPIO API for data I/O
- Denali:
* Removing denali_reset_banks()
* Removing ->dev_ready() hook
* Including <linux/bits.h> instead of <linux/bitops.h>
* Changes to comply with the above fixes/cleanup done in the core.
- FSMC:
* Adding an SPDX tag to replace the license text
* Making conversion from chip to fsmc consistent
* Fixing unchecked return value in fsmc_read_page_hwecc
* Changes to comply with the above fixes/cleanup done in the core.
- Marvell:
* Preventing timeouts on a loaded machine (fix)
* Changes to comply with the above fixes/cleanup done in the core.
- OMAP2:
* Pass the parent of pdev to dma_request_chan() (fix)
- R852:
* Use generic DMA API
- sh_flctl:
* Converting to SPDX identifiers
- Sunxi:
* Write pageprog related opcodes to the right register: WCMD_SET (fix)
- Tegra:
* Stop implementing ->select_chip()
- VF610:
* Adding an SPDX tag to replace the license text
* Changes to comply with the above fixes/cleanup done in the core.
- Various trivial/spelling/coding style fixes.
SPI-NAND drivers changes:
- Removing the depreacated mt29f_spinand driver from staging.
- Adding support for:
* Toshiba TC58CVG2S0H
* GigaDevice GD5FxGQ4xA
* Winbond W25N01GV
JFFS2 changes:
- Fix a lockdep issue
MTD changes:
- Rework the physmap driver to merge gpio-addr-flash and physmap_of
in it
- Add a new compatible for RedBoot partitions
- Make sub-partitions RW if the parent partition was RO because of a
mis-alignment
- Add pinctrl support to the
- Addition of /* fall-through */ comments where appropriate
- Various minor fixes and cleanups
Other changes:
- Update my email address
-----BEGIN PGP SIGNATURE-----
iQJQBAABCgA6FiEEKmCqpbOU668PNA69Ze02AX4ItwAFAlwZRoocHGJvcmlzLmJy
ZXppbGxvbkBib290bGluLmNvbQAKCRBl7TYBfgi3ALhmEACGoyJLVxRsE0yXOex7
JdPhZxnOkwnzaxQPQBQ7pFrWvRinV1Th8JenCCLiyM6DBcIDRo/87etC6cnjfnH/
eJ8e7xPLiESCsAB8xixP1YvLaZQzjBKOH/+qNR6anp505ZNKhgsuUDATLSFfaEvz
bZv3f1V70dPXdEK/3QXDZakqVtfvVBMSdyJMWdKSdVn70yA72wCPP4igcGdfYfoZ
KKaanhS1EdxA++WFrRqocQay9rtjnFGONHLfrefop9YKSevLp1UDZSAYSk4CHcEf
yaMFD+qlxc0wHlOk4XyDY3vREcdO50r2vXfN/Hxf0D6JeC/6RT0Hrm0YyY8X/u7l
xLhSv+DKGyft3SnQ4MDdwg57bvDSOkPryI3cxBul3S6I00pCdo8l5zQskezDbk0E
CkUuB+f7Wn3lmV7W8ZNbYHx0ljVJEXMxEQ6m/6ZLizIr/aC7m5ncgv8a3mL7QQQA
KXuLamw9pPlf/tAOQxB1PTiE1n8ECYqcInJXzoxaRzUdn8TlIYjxUb7GIDoOFpzm
6a3oCx6tUZq4j/GAoyYgo12NhH0aYW3X/V8N7SfPjmIofvFiUL1y1VQ2ghNDLCqH
2cz6WsFGoA0oSVnxl/TSSM6/ZXbvnLgM4zq6+wZITyCFNwyC956MddZrjL6lVkMT
s6kvh/h3wBWN+GxwlDeY+KZMlQ==
=myqK
-----END PGP SIGNATURE-----
Merge tag 'mtd/for-4.21' of git://git.infradead.org/linux-mtd
Pull mtd updates from Boris Brezillon:
"SPI NOR Core changes:
- Parse the 4BAIT SFDP section
- Add a bunch of SPI NOR entries to the flash_info table
- Add the concept of SFDP fixups and use it to fix a bug on MX25L25635F
- A bunch of minor cleanups/comestic changes
NAND core changes:
- kernel-doc miscellaneous fixes.
- Third batch of fixes/cleanup to the raw NAND core impacting various
controller drivers (ams-delta, marvell, fsmc, denali, tegra,
vf610):
* Stop to pass mtd_info objects to internal functions
* Reorganize code to avoid forward declarations
* Drop useless test in nand_legacy_set_defaults()
* Move nand_exec_op() to internal.h
* Add nand_[de]select_target() helpers
* Pass the CS line to be selected in struct nand_operation
* Make ->select_chip() optional when ->exec_op() is implemented
* Deprecate the ->select_chip() hook
* Move the ->exec_op() method to nand_controller_ops
* Move ->setup_data_interface() to nand_controller_ops
* Deprecate the dummy_controller field
* Fix JEDEC detection
* Provide a helper for polling GPIO R/B pin
Raw NAND chip drivers changes:
- Macronix:
* Flag 1.8V AC chips with a broken GET_FEATURES(TIMINGS)
Raw NAND controllers drivers changes:
- Ams-delta:
* Fix the error path
* SPDX tag added
* May be compiled with COMPILE_TEST=y
* Conversion to ->exec_op() interface
* Drop .IOADDR_R/W use
* Use GPIO API for data I/O
- Denali:
* Remove denali_reset_banks()
* Remove ->dev_ready() hook
* Include <linux/bits.h> instead of <linux/bitops.h>
* Changes to comply with the above fixes/cleanup done in the core.
- FSMC:
* Add an SPDX tag to replace the license text
* Make conversion from chip to fsmc consistent
* Fix unchecked return value in fsmc_read_page_hwecc
* Changes to comply with the above fixes/cleanup done in the core.
- Marvell:
* Prevent timeouts on a loaded machine (fix)
* Changes to comply with the above fixes/cleanup done in the core.
- OMAP2:
* Pass the parent of pdev to dma_request_chan() (fix)
- R852:
* Use generic DMA API
- sh_flctl:
* Convert to SPDX identifiers
- Sunxi:
* Write pageprog related opcodes to the right register: WCMD_SET (fix)
- Tegra:
* Stop implementing ->select_chip()
- VF610:
* Add an SPDX tag to replace the license text
* Changes to comply with the above fixes/cleanup done in the core.
- Various trivial/spelling/coding style fixes.
SPI-NAND drivers changes:
- Remove the depreacated mt29f_spinand driver from staging.
- Add support for:
* Toshiba TC58CVG2S0H
* GigaDevice GD5FxGQ4xA
* Winbond W25N01GV
JFFS2 changes:
- Fix a lockdep issue
MTD changes:
- Rework the physmap driver to merge gpio-addr-flash and physmap_of
in it
- Add a new compatible for RedBoot partitions
- Make sub-partitions RW if the parent partition was RO because of a
mis-alignment
- Add pinctrl support to the
- Addition of /* fall-through */ comments where appropriate
- Various minor fixes and cleanups
Other changes:
- Update my email address"
* tag 'mtd/for-4.21' of git://git.infradead.org/linux-mtd: (108 commits)
mtd: rawnand: sunxi: Write pageprog related opcodes to WCMD_SET
MAINTAINERS: Update my email address
mtd: rawnand: marvell: prevent timeouts on a loaded machine
mtd: rawnand: omap2: Pass the parent of pdev to dma_request_chan()
mtd: rawnand: Fix JEDEC detection
mtd: spi-nor: Add support for is25lp016d
mtd: spi-nor: parse SFDP 4-byte Address Instruction Table
mtd: spi-nor: Add 4B_OPCODES flag to is25lp256
mtd: spi-nor: Add an SPDX tag to spi-nor.{c,h}
mtd: spi-nor: Make the enable argument passed to set_byte() a bool
mtd: spi-nor: Stop passing flash_info around
mtd: spi-nor: Avoid forward declaration of internal functions
mtd: spi-nor: Drop inline on all internal helpers
mtd: spi-nor: Add a post BFPT fixup for MX25L25635E
mtd: spi-nor: Add a post BFPT parsing fixup hook
mtd: spi-nor: Add the SNOR_F_4B_OPCODES flag
mtd: spi-nor: cast to u64 to avoid uint overflows
mtd: spi-nor: Add support for IS25LP032/064
mtd: spi-nor: add entry for mt35xu512aba flash
mtd: spi-nor: add macros related to MICRON flash
...
The ext4_inline_data_fiemap() function calls fiemap_fill_next_extent()
while still holding the xattr semaphore. This is not necessary and it
triggers a circular lockdep warning. This is because
fiemap_fill_next_extent() could trigger a page fault when it writes
into page which triggers a page fault. If that page is mmaped from
the inline file in question, this could very well result in a
deadlock.
This problem can be reproduced using generic/519 with a file system
configuration which has the inline_data feature enabled.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
There are enough credits reserved for most dioread_nolock writes;
however, if the extent tree is sufficiently deep, and/or quota is
enabled, the code was not allowing for all eventualities when
reserving journal credits for the unwritten extent conversion.
This problem can be seen using xfstests ext4/034:
WARNING: CPU: 1 PID: 257 at fs/ext4/ext4_jbd2.c:271 __ext4_handle_dirty_metadata+0x10c/0x180
Workqueue: ext4-rsv-conversion ext4_end_io_rsv_work
RIP: 0010:__ext4_handle_dirty_metadata+0x10c/0x180
...
EXT4-fs: ext4_free_blocks:4938: aborting transaction: error 28 in __ext4_handle_dirty_metadata
EXT4: jbd2_journal_dirty_metadata failed: handle type 11 started at line 4921, credits 4/0, errcode -28
EXT4-fs error (device dm-1) in ext4_free_blocks:4950: error 28
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
This will be needed by DFS cache.
Signed-off-by: Paulo Alcantara <palcantara@suse.de>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Different servers have different set of file ids.
After failover, unique IDs will be different so we can't validate
them.
Signed-off-by: Aurelien Aptel <aaptel@suse.com>
Reviewed-by: Paulo Alcantara <palcantara@suse.de>
Signed-off-by: Steve French <stfrench@microsoft.com>
If we only want to get the mount options strings, do not return the
devname.
For DFS failover, we'll be passing the DFS full path down to
cifs_mount() rather than the devname.
Signed-off-by: Paulo Alcantara <palcantara@suse.de>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
When extracting hostname from UNC, check for leading backslashes
before trying to remove them.
Signed-off-by: Paulo Alcantara <palcantara@suse.de>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
* Split and refactor the very large function cifs_mount() in multiple
functions:
- tcp, ses and tcon setup to mount_get_conns()
- tcp, ses and tcon cleanup in mount_put_conns()
- tcon tlink setup to mount_setup_tlink()
- remote path checking to is_path_remote()
* Implement 2 version of cifs_mount() for DFS-enabled builds and
non-DFS-enabled builds (CONFIG_CIFS_DFS_UPCALL).
In preparation for DFS failover support.
Signed-off-by: Paulo Alcantara <palcantara@suse.de>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
While resolving a bug with locks on samba shares found a strange behavior.
When a file locked by one node and we trying to lock it from another node
it fail with errno 5 (EIO) but in that case errno must be set to
(EACCES | EAGAIN).
This isn't happening when we try to lock file second time on same node.
In this case it returns EACCES as expected.
Also this issue not reproduces when we use SMB1 protocol (vers=1.0 in
mount options).
Further investigation showed that the mapping from status_to_posix_error
is different for SMB1 and SMB2+ implementations.
For SMB1 mapping is [NT_STATUS_LOCK_NOT_GRANTED to ERRlock]
(See fs/cifs/netmisc.c line 66)
but for SMB2+ mapping is [STATUS_LOCK_NOT_GRANTED to -EIO]
(see fs/cifs/smb2maperror.c line 383)
Quick changes in SMB2+ mapping from EIO to EACCES has fixed issue.
BUG: https://bugzilla.kernel.org/show_bug.cgi?id=201971
Signed-off-by: Georgy A Bystrenin <gkot@altlinux.org>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
CC: Stable <stable@vger.kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
When pinning memory failed, we should return the correct error code and
rewind the SMB credits.
Reported-by: Murphy Zhou <jencce.kernel@gmail.com>
Signed-off-by: Long Li <longli@microsoft.com>
Cc: stable@vger.kernel.org
Cc: Murphy Zhou <jencce.kernel@gmail.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
The current code attempts to pin memory using the largest possible wsize
based on the currect SMB credits. This doesn't cause kernel oops but this
is not optimal as we may pin more pages then actually needed.
Fix this by only pinning what are needed for doing this write I/O.
Signed-off-by: Long Li <longli@microsoft.com>
Cc: stable@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Joey Pabalinas <joeypabalinas@gmail.com>
RHBZ: 1021460
There is an issue where when multiple threads open/close the same directory
ntwrk_buf_start might end up being NULL, causing the call to smbCalcSize
later to oops with a NULL deref.
The real bug is why this happens and why this can become NULL for an
open cfile, which should not be allowed.
This patch tries to avoid a oops until the time when we fix the underlying
issue.
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
password_with_pad is a fixed size buffer of 16 bytes, it contains a
password string, to be padded with \0 if shorter than 16 bytes
but is just truncated if longer.
It is not, and we do not depend on it to be, nul terminated.
As such, do not use strncpy() to populate this buffer since
the str* prefix suggests that this is a string, which it is not,
and it also confuses coverity causing a false warning.
Detected by CoverityScan CID#113743 ("Buffer not null terminated")
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Fixes gcc '-Wunused-but-set-variable' warning:
fs/cifs/sess.c: In function '_sess_auth_rawntlmssp_assemble_req':
fs/cifs/sess.c:1157:18: warning:
variable 'smb_buf' set but not used [-Wunused-but-set-variable]
It never used since commit cc87c47d9d ("cifs: Separate rawntlmssp auth
from CIFS_SessSetup()")
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
To avoid the warning:
warning: this statement may fall through [-Wimplicit-fallthrough=]
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Reviewed-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reducing the number of network roundtrips improves the performance
of query xattrs
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Technically 3.02 is not the dialect name although that is more familiar to
many, so we should also accept the official dialect name (3.0.2 vs. 3.02)
in vers=
Signed-off-by: Kenneth D'souza <kdsouza@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
This is not actually a bug but as Coverity points out we shouldn't
be doing an "|=" on a value which hasn't been set (although technically
it was memset to zero so isn't a bug) and so might as well change
"|=" to "=" in this line
Detected by CoverityScan, CID#728535 ("Unitialized scalar variable")
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
As Coverity points out le16_to_cpu(midEntry->Command) can not be
less than zero.
Detected by CoverityScan, CID#1438650 ("Macro compares unsigned to 0")
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Improve performance by reducing number of network round trips
for set xattr.
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Trivial fix to clean up indentation, replace spaces with tab
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Pull vfs fixes from Al Viro:
"A couple of fixes - no common topic ;-)"
[ The aio spectre patch also came in from Jens, so now we have that
doubly fixed .. ]
* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
proc/sysctl: don't return ENOMEM on lookup when a table is unregistering
aio: fix spectre gadget in lookup_ioctx
This reverts commit 55956b59df.
commit 55956b59df ("vfs: Allow userns root to call mknod on owned filesystems.")
enabled mknod() in user namespaces for userns root if CAP_MKNOD is
available. However, these device nodes are useless since any filesystem
mounted from a non-initial user namespace will set the SB_I_NODEV flag on
the filesystem. Now, when a device node s created in a non-initial user
namespace a call to open() on said device node will fail due to:
bool may_open_dev(const struct path *path)
{
return !(path->mnt->mnt_flags & MNT_NODEV) &&
!(path->mnt->mnt_sb->s_iflags & SB_I_NODEV);
}
The problem with this is that as of the aforementioned commit mknod()
creates partially functional device nodes in non-initial user namespaces.
In particular, it has the consequence that as of the aforementioned commit
open() will be more privileged with respect to device nodes than mknod().
Before it was the other way around. Specifically, if mknod() succeeded
then it was transparent for any userspace application that a fatal error
must have occured when open() failed.
All of this breaks multiple userspace workloads and a widespread assumption
about how to handle mknod(). Basically, all container runtimes and systemd
live by the slogan "ask for forgiveness not permission" when running user
namespace workloads. For mknod() the assumption is that if the syscall
succeeds the device nodes are useable irrespective of whether it succeeds
in a non-initial user namespace or not. This logic was chosen explicitly
to allow for the glorious day when mknod() will actually be able to create
fully functional device nodes in user namespaces.
A specific problem people are already running into when running 4.18 rc
kernels are failing systemd services. For any distro that is run in a
container systemd services started with the PrivateDevices= property set
will fail to start since the device nodes in question cannot be
opened (cf. the arguments in [1]).
Full disclosure, Seth made the very sound argument that it is already
possible to end up with partially functional device nodes. Any filesystem
mounted with MS_NODEV set will allow mknod() to succeed but will not allow
open() to succeed. The difference to the case here is that the MS_NODEV
case is transparent to userspace since it is an explicitly set mount option
while the SB_I_NODEV case is an implicit property enforced by the kernel
and hence opaque to userspace.
[1]: https://github.com/systemd/systemd/pull/9483
Signed-off-by: Christian Brauner <christian@brauner.io>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Seth Forshee <seth.forshee@canonical.com>
Cc: Serge Hallyn <serge@hallyn.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
At mount time, we allocate m_rsum_cache with the number of realtime
bitmap blocks. However, xfs_growfs_rt() can increase the number of
realtime bitmap blocks. Using the cache after this happens may access
out of the bounds of the cache. Fix it by reallocating the cache in this
case.
Fixes: 355e353213 ("xfs: cache minimum realtime summary level")
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
get_unlocked_entry() uses an exclusive wait because it is guaranteed to
eventually obtain the lock and follow on with an unlock+wakeup cycle.
The wait_entry_unlocked() path does not have the same guarantee. Rather
than open-code an extra wakeup, just switch to a non-exclusive wait.
Cc: Jan Kara <jack@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
This patch removes the check from nfs_compare_mount_options to see if a
`sec' option was passed for the current mount before comparing auth
flavors and instead just always compares auth flavors.
Consider the following scenario:
You have a server with the address 192.168.1.1 and two exports /export/a
and /export/b. The first export supports `sys' and `krb5' security, the
second just `sys'.
Assume you start with no mounts from the server.
The following results in EIOs being returned as the kernel nfs client
incorrectly thinks it can share the underlying `struct nfs_server's:
$ mkdir /tmp/{a,b}
$ sudo mount -t nfs -o vers=3,sec=krb5 192.168.1.1:/export/a /tmp/a
$ sudo mount -t nfs -o vers=3 192.168.1.1:/export/b /tmp/b
$ df >/dev/null
df: ‘/tmp/b’: Input/output error
Signed-off-by: Chris Perl <cperl@janestreet.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAlwcW2IACgkQiiy9cAdy
T1FaLAv/Vs0QhYATaJHkvmjk7EdFoXTZST2wYxPcPSrOfWbCaMjsKREwPVUjOCiW
W//zL/Qbi0FSVMWdoskJP0KQC5rAhUSxlvtDxrpzqszfeJprIs0oL5IRlOwxNdtT
F9/i8+s/AuYq0TTn15UtqZHS6wZdWcerdttWV1V/97hEwcO5Xg0pEtyCmLPf7k8W
wNdxCAQFZ9j2pDVyuJO3a0+Tas34dc2t/cac12h0qeXbrE6e88bQWa6bpEKSvCKr
cMU94pkajCKeelZOhq+ga7cCmlBJs6gt4sgsKEsoDn72tQCsWVH6p1N4+AxmLsZU
bR65XodusR1WHMesSth7QraUk0pIQ4ZzMRPZJCkh9bSjaa+fxX1Up/sjn74q1prf
DHJ/52rQrWK3hvETUZD2B6N9AEDN0swbqeCJRlUYlzG5OEdfit9qSgfTaRzYxVnX
+tct7j+8mjzk+rsGuTXrQupPXUndPTcpUrKFp5db9Sejcx/Gw/atKhW6mVgL3LiR
8kVIraSV
=+8yd
-----END PGP SIGNATURE-----
Merge tag '4.20-rc7-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6
Pull smb3 fix from Steve French:
"An important smb3 fix for an regression to some servers introduced by
compounding optimization to rmdir.
This fix has been tested by multiple developers (including me) with
the usual private xfstesting, but also by the new cifs/smb3 "buildbot"
xfstest VMs (thank you Ronnie and Aurelien for good work on this
automation). The automated testing has been updated so that it will
catch problems like this in the future.
Note that Pavel discovered (very recently) some unrelated but
extremely important bugs in credit handling (smb3 flow control problem
that can lead to disconnects/reconnects) when compounding, that I
would have liked to send in ASAP but the complete testing of those two
fixes may not be done in time and have to wait for 4.21"
* tag '4.20-rc7-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6:
smb3: Fix rmdir compounding regression to strict servers
Adding options to growing mnt_opts. NFS kludge with passing
context= down into non-text-options mount switched to it, and
with that the last use of ->sb_parse_opts_str() is gone.
Reviewed-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Keep void * instead, allocate on demand (in parse_str_opts, at the
moment). Eventually both selinux and smack will be better off
with private structures with several strings in those, rather than
this "counter and two pointers to dynamically allocated arrays"
ugliness. This commit allows to do that at leisure, without
disrupting anything outside of given module.
Changes:
* instead of struct security_mnt_opt use an opaque pointer
initialized to NULL.
* security_sb_eat_lsm_opts(), security_sb_parse_opts_str() and
security_free_mnt_opts() take it as var argument (i.e. as void **);
call sites are unchanged.
* security_sb_set_mnt_opts() and security_sb_remount() take
it by value (i.e. as void *).
* new method: ->sb_free_mnt_opts(). Takes void *, does
whatever freeing that needs to be done.
* ->sb_set_mnt_opts() and ->sb_remount() might get NULL as
mnt_opts argument, meaning "empty".
Reviewed-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* if mount(2) passes something like "context=foo" with MS_REMOUNT
in flags (/sbin/mount.nfs will _not_ do that - you need to issue
the syscall manually), you'll get leaked copies for LSM options.
The reason is that instead of nfs_{alloc,free}_parsed_mount_data()
nfs_remount() uses kzalloc/kfree, which lacks the needed cleanup.
* selinux options are not changed on remount (as for any other
fs), but in case of NFS the failure is quiet - they are not compared
to what we used to have, with complaint in case of attempted changes.
Trivially fixed by converting to use of security_sb_remount().
Reviewed-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
1) keeping a copy in btrfs_fs_info is completely pointless - we never
use it for anything. Getting rid of that allows for simpler calling
conventions for setup_security_options() (caller is responsible for
freeing mnt_opts in all cases).
2) on remount we want to use ->sb_remount(), not ->sb_set_mnt_opts(),
same as we would if not for FS_BINARY_MOUNTDATA. Behaviours *are*
close (in fact, selinux sb_set_mnt_opts() ought to punt to
sb_remount() in "already initialized" case), but let's handle
that uniformly. And the only reason why the original btrfs changes
didn't go for security_sb_remount() in btrfs_remount() case is that
it hadn't been exported. Let's export it for a while - it'll be
going away soon anyway.
Reviewed-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
... leaving the "is it kernel-internal" logics in the caller.
Reviewed-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
combination of alloc_secdata(), security_sb_copy_data(),
security_sb_parse_opt_str() and free_secdata().
Reviewed-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
This paves the way for retaining the LSM options from a common filesystem
mount context during a mount parameter parsing phase to be instituted prior
to actual mount/reconfiguration actions.
Reviewed-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
This paves the way for retaining the LSM options from a common filesystem
mount context during a mount parameter parsing phase to be instituted prior
to actual mount/reconfiguration actions.
Reviewed-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
iomap_is_partially_uptodate() is intended to check wither blocks within
the selected range of a not-uptodate page are uptodate; if the range we
care about is up to date, it's an optimization.
However, the iomap implementation continues to check all blocks up to
from+count, which is beyond the page, and can even be well beyond the
iop->uptodate bitmap.
I think the worst that will happen is that we may eventually find a zero
bit and return "not partially uptodate" when it would have otherwise
returned true, and skip the optimization. Still, it's clearly an invalid
memory access that must be fixed.
So: fix this by limiting the search to within the page as is done in the
non-iomap variant, block_is_partially_uptodate().
Zorro noticed thiswhen KASAN went off for 512 byte blocks on a 64k
page system:
BUG: KASAN: slab-out-of-bounds in iomap_is_partially_uptodate+0x1a0/0x1e0
Read of size 8 at addr ffff800120c3a318 by task fsstress/22337
Reported-by: Zorro Lang <zlang@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
- Kconfig dependency fixes for our new auth feature
- Fix for selecting the right compressor when creating a fs
- Bugfix for a bug in UBIFS's O_TMPFILE implementation
- Refcounting fixes for UBI
-----BEGIN PGP SIGNATURE-----
iQJKBAABCAA0FiEEdgfidid8lnn52cLTZvlZhesYu8EFAlwb+MAWHHJpY2hhcmRA
c2lnbWEtc3Rhci5hdAAKCRBm+VmF6xi7wSIjD/9zyAQUaMXJzI/ct2SXXikyTgRi
hmLbUhuIEZD/c3tDQMgb3XqL7yYkOt/QnfeMXpi/Fy9wAdKcOU9QCcL6TPfR9B+B
v22VdDcsU5pp2Hcinm2Yz6v2OyYwchb2YShfMkiZSf0CKAhcsKZhaeqhug/Zq+w5
xZ6fMPvu8HhinBavq9OgSpPx892vlF1ypXHNeQNxwslz/gWKso47jtmXpz4LDX7+
s4+p+lagVJ0MHQ3fVhBvEplWr9zPKHvX8uGohADLsDp5bAqRyGNxBwqgGm6J6H7B
UHuqUjUf19hUZEauwKugmPTM3Y0TkZjV472qsp9omx8w/XozVqfb4YgMBBE3Qwuy
ZS7GcaGxf54WJbOJFQv7pS3DUmfOjQBnU7YAR7phMBDeOUfxRXWkuzIB+RkED+0d
7JjTGWQItdbTfXl2QWfSr/1rj45LNKmVPMAI2pKUBnC9dD9pktCcB90vTcrg2vjE
3U9GbqOVORZvHh2bcR0Y8uDUBOp4v7ybCc+OOTNVWuWQpckp79E/7dkVLbpoI12R
xEgMEaCSXlwPUxQ/FrxKBWhOZprPXNMo4xw547SDBxMq0IKUniA+YoUxtN3FxhQ8
OiLkfcQ+5bove9qYfARaRC/hk5lpgMWebAGou/CkfTIPJsw3ZdlmL6HH6ag5i6fS
fBHsVBdtv0ADSLXylg==
=nZML
-----END PGP SIGNATURE-----
Merge tag 'upstream-4.20-rc7' of git://git.infradead.org/linux-ubifs
Pull UBI/UBIFS fixes from Richard Weinberger:
- Kconfig dependency fixes for our new auth feature
- Fix for selecting the right compressor when creating a fs
- Bugfix for a bug in UBIFS's O_TMPFILE implementation
- Refcounting fixes for UBI
* tag 'upstream-4.20-rc7' of git://git.infradead.org/linux-ubifs:
ubifs: Handle re-linking of inodes correctly while recovery
ubi: Do not drop UBI device reference before using
ubi: Put MTD device after it is not used
ubifs: Fix default compression selection in ubifs
ubifs: Fix memory leak on error condition
ubifs: auth: Add CONFIG_KEYS dependency
ubifs: CONFIG_UBIFS_FS_AUTHENTICATION should depend on UBIFS_FS
ubifs: replay: Fix high stack usage
Separate just the changing of mount flags (MS_REMOUNT|MS_BIND) from full
remount because the mount data will get parsed with the new fs_context
stuff prior to doing a remount - and this causes the syscall to fail under
some circumstances.
To quote Eric's explanation:
[...] mount(..., MS_REMOUNT|MS_BIND, ...) now validates the mount options
string, which breaks systemd unit files with ProtectControlGroups=yes
(e.g. systemd-networkd.service) when systemd does the following to
change a cgroup (v1) mount to read-only:
mount(NULL, "/run/systemd/unit-root/sys/fs/cgroup/systemd", NULL,
MS_RDONLY|MS_NOSUID|MS_NODEV|MS_NOEXEC|MS_REMOUNT|MS_BIND, NULL)
... when the kernel has CONFIG_CGROUPS=y but no cgroup subsystems
enabled, since in that case the error "cgroup1: Need name or subsystem
set" is hit when the mount options string is empty.
Probably it doesn't make sense to validate the mount options string at
all in the MS_REMOUNT|MS_BIND case, though maybe you had something else
in mind.
This is also worthwhile doing because we will need to add a mount_setattr()
syscall to take over the remount-bind function.
Reported-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Reviewed-by: David Howells <dhowells@redhat.com>
Only the mount namespace code that implements mount(2) should be using the
MS_* flags. Suppress them inside the kernel unless uapi/linux/mount.h is
included.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Reviewed-by: David Howells <dhowells@redhat.com>
This reverts commit 61c6de6672.
The reverted commit added page reference counting to iomap page
structures that are used to track block size < page size state. This
was supposed to align the code with page migration page accounting
assumptions, but what it has done instead is break XFS filesystems.
Every fstests run I've done on sub-page block size XFS filesystems
has since picking up this commit 2 days ago has failed with bad page
state errors such as:
# ./run_check.sh "-m rmapbt=1,reflink=1 -i sparse=1 -b size=1k" "generic/038"
....
SECTION -- xfs
FSTYP -- xfs (debug)
PLATFORM -- Linux/x86_64 test1 4.20.0-rc6-dgc+
MKFS_OPTIONS -- -f -m rmapbt=1,reflink=1 -i sparse=1 -b size=1k /dev/sdc
MOUNT_OPTIONS -- /dev/sdc /mnt/scratch
generic/038 454s ...
run fstests generic/038 at 2018-12-20 18:43:05
XFS (sdc): Unmounting Filesystem
XFS (sdc): Mounting V5 Filesystem
XFS (sdc): Ending clean mount
BUG: Bad page state in process kswapd0 pfn:3a7fa
page:ffffea0000ccbeb0 count:0 mapcount:0 mapping:ffff88800d9b6360 index:0x1
flags: 0xfffffc0000000()
raw: 000fffffc0000000 dead000000000100 dead000000000200 ffff88800d9b6360
raw: 0000000000000001 0000000000000000 00000000ffffffff
page dumped because: non-NULL mapping
CPU: 0 PID: 676 Comm: kswapd0 Not tainted 4.20.0-rc6-dgc+ #915
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.1-1 04/01/2014
Call Trace:
dump_stack+0x67/0x90
bad_page.cold.116+0x8a/0xbd
free_pcppages_bulk+0x4bf/0x6a0
free_unref_page_list+0x10f/0x1f0
shrink_page_list+0x49d/0xf50
shrink_inactive_list+0x19d/0x3b0
shrink_node_memcg.constprop.77+0x398/0x690
? shrink_slab.constprop.81+0x278/0x3f0
shrink_node+0x7a/0x2f0
kswapd+0x34b/0x6d0
? node_reclaim+0x240/0x240
kthread+0x11f/0x140
? __kthread_bind_mask+0x60/0x60
ret_from_fork+0x24/0x30
Disabling lock debugging due to kernel taint
....
The failures are from anyway that frees pages and empties the
per-cpu page magazines, so it's not a predictable failure or an easy
to debug failure.
generic/038 is a reliable reproducer of this problem - it has a 9 in
10 failure rate on one of my test machines. Failure on other
machines have been at random points in fstests runs but every run
has ended up tripping this problem. Hence generic/038 was used to
bisect the failure because it was the most reliable failure.
It is too close to the 4.20 release (not to mention holidays) to
try to diagnose, fix and test the underlying cause of the problem,
so reverting the commit is the only option we have right now. The
revert has been tested against a current tot 4.20-rc7+ kernel across
multiple machines running sub-page block size XFs filesystems and
none of the bad page state failures have been seen.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Cc: Piotr Jaroszynski <pjaroszynski@nvidia.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: William Kucharski <william.kucharski@oracle.com>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Brian Foster <bfoster@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use __print_symbolic to print the scrub type in ftrace output.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Use __print_symbolic to print the btree type in ftrace output.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Move XFS_INODE_FORMAT_STR to libxfs so that we don't forget to keep it
updated, and add necessary TRACE_DEFINE_ENUM.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Move XFS_AG_BTREE_CMP_FORMAT_STR to libxfs so that we don't forget to
keep it updated, and TRACE_DEFINE_ENUM the values while we're at it.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
ftrace's __print_symbolic() has a (very poorly documented) requirement
that any enum values used in the symbol to string translation table be
wrapped in a TRACE_DEFINE_ENUM so that the enum value can be encoded in
the ftrace ring buffer. Fix this unsatisfied requirement.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Use %pS instead of %pF in ftrace strings so that we record the actual
function address instead of the function descriptor.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
If the file system has been shut down or is read-only, then
ext4_write_inode() needs to bail out early.
Also use jbd2_complete_transaction() instead of ext4_force_commit() so
we only force a commit if it is needed.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Some time back, nfsd switched from calling vfs_fsync() to using a new
commit_metadata() hook in export_operations(). If the file system did
not provide a commit_metadata() hook, it fell back to using
sync_inode_metadata(). Unfortunately doesn't work on all file
systems. In particular, it doesn't work on ext4 due to how the inode
gets journalled --- the VFS writeback code will not always call
ext4_write_inode().
So we need to provide our own ext4_nfs_commit_metdata() method which
calls ext4_write_inode() directly.
Google-Bug-Id: 121195940
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
SUNRPC has two sorts of credentials, both of which appear as
"struct rpc_cred".
There are "generic credentials" which are supplied by clients
such as NFS and passed in 'struct rpc_message' to indicate
which user should be used to authorize the request, and there
are low-level credentials such as AUTH_NULL, AUTH_UNIX, AUTH_GSS
which describe the credential to be sent over the wires.
This patch replaces all the generic credentials by 'struct cred'
pointers - the credential structure used throughout Linux.
For machine credentials, there is a special 'struct cred *' pointer
which is statically allocated and recognized where needed as
having a special meaning. A look-up of a low-level cred will
map this to a machine credential.
Signed-off-by: NeilBrown <neilb@suse.com>
Acked-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Use the common 'struct cred' to pass credentials for readdir.
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Rather than keying the access cache with 'struct rpc_cred',
use 'struct cred'. Then use cred_fscmp() to compare
credentials rather than comparing the raw pointer.
A benefit of this approach is that in the common case we avoid the
rpc_lookup_cred_nonblock() call which can be slow when the cred cache is large.
This also keeps many fewer items pinned in the rpc cred cache, so the
cred cache is less likely to get large.
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
NFS needs to know when a credential is about to expire so that
it can modify write-back behaviour to finish the write inside the
expiry time.
It currently uses functions in SUNRPC code which make use of a
fairly complex callback scheme and flags in the generic credientials.
As I am working to discard the generic credentials, this has to change.
This patch moves the logic into NFS, in part by finding and caching
the low-level credential in the open_context. We then make direct
cred-api calls on that.
This makes the code much simpler and removes a dependency on generic
rpc credentials.
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
When NFS creates a machine credential, it is a "generic" credential,
not tied to any auth protocol, and is really just a container for
the princpal name.
This doesn't get linked to a genuine credential until rpcauth_bindcred()
is called.
The lookup always succeeds, so various places that test if the machine
credential is NULL, are pointless.
As a step towards getting rid of generic credentials, this patch gets
rid of generic machine credentials. The nfs_client and rpc_client
just hold a pointer to a constant principal name.
When a machine credential is wanted, a special static 'struct rpc_cred'
pointer is used. rpcauth_bindcred() recognizes this, finds the
principal from the client, and binds the correct credential.
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
This lock is no longer necessary.
If nfs4_get_renew_cred() needs to hunt through the open-state
creds for a user cred, it still takes the lock to stablize
the rbtree, but otherwise there are no races.
Note that this completely removes the lock from nfs4_renew_state().
It appears that the original need for the locking here was removed
long ago, and there is no longer anything to protect.
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
NFSv4 state management tries a root credential when no machine
credential is available, as can happen with kerberos.
It does this by replacing the cl_machine_cred with a root credential.
This means that any user of the machine credential needs to take
a lock while getting a reference to the machine credential, which is
a little cumbersome.
So introduce an explicit cl_root_cred, and never free either
credential until client shutdown. This means that no locking
is needed to reference these credentials. Future patches
will make use of this.
This is only a temporary addition. both cl_machine_cred and
cl_root_cred will disappear later in the series.
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
We can use cred->groupinfo (from the 'struct cred') instead.
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
The SUNRPC credential framework was put together before
Linux has 'struct cred'. Now that we have it, it makes sense to
use it.
This first step just includes a suitable 'struct cred *' pointer
in every 'struct auth_cred' and almost every 'struct rpc_cred'.
The rpc_cred used for auth_null has a NULL 'struct cred *' as nothing
else really makes sense.
For rpc_cred, the pointer is reference counted.
For auth_cred it isn't. struct auth_cred are either allocated on
the stack, in which case the thread owns a reference to the auth,
or are part of 'struct generic_cred' in which case gc_base owns the
reference, and "acred" shares it.
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Please see comment to filelayout_pg_test for reference.
To: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Anna Schumaker <anna.schumaker@netapp.com>
Cc: linux-nfs@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
commit e8f25e6d6d "NFS: Remove the NFS v4 xdev mount function"
removed the last use of this.
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
If we receive a file handle, either from NFS or open_by_handle_at(2),
and it points at an inode which has not been initialized, and the file
system has metadata checksums enabled, we shouldn't try to get the
inode, discover the checksum is invalid, and then declare the file
system as being inconsistent.
This can be reproduced by creating a test file system via "mke2fs -t
ext4 -O metadata_csum /tmp/foo.img 8M", mounting it, cd'ing into that
directory, and then running the following program.
#define _GNU_SOURCE
#include <fcntl.h>
struct handle {
struct file_handle fh;
unsigned char fid[MAX_HANDLE_SZ];
};
int main(int argc, char **argv)
{
struct handle h = {{8, 1 }, { 12, }};
open_by_handle_at(AT_FDCWD, &h.fh, O_RDONLY);
return 0;
}
Google-Bug-Id: 120690101
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
In ext4_expand_extra_isize_ea(), we calculate the total size of the
xattr header, plus the xattr entries so we know how much of the
beginning part of the xattrs to move when expanding the inode extra
size. We need to include the terminating u32 at the end of the xattr
entries, or else if there is uninitialized, non-zero bytes after the
xattr entries and before the xattr values, the list of xattr entries
won't be properly terminated.
Reported-by: Steve Graham <stgraham2000@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Some servers require that the setinfo matches the exact size,
and in this case compounding changes introduced by
commit c2e0fe3f5a ("cifs: make rmdir() use compounding")
caused us to send 8 bytes (padded length) instead of 1 byte
(the size of the structure). See MS-FSCC section 2.4.11.
Fixing this when we send a SET_INFO command for delete file
disposition, then ends up as an iov of a single byte but this
causes problems with SMB3 and encryption.
To avoid this, instead of creating a one byte iov for the disposition value
and then appending an additional iov with a 7 byte padding we now handle
this as a single 8 byte iov containing both the disposition byte as well as
the padding in one single buffer.
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Acked-by: Paulo Alcantara <palcantara@suse.de>
44d8047f1d ("binder: use standard functions to allocate fds")
exposed a pre-existing issue in the binder driver.
fdget() is used in ksys_ioctl() as a performance optimization.
One of the rules associated with fdget() is that ksys_close() must
not be called between the fdget() and the fdput(). There is a case
where this requirement is not met in the binder driver which results
in the reference count dropping to 0 when the device is still in
use. This can result in use-after-free or other issues.
If userpace has passed a file-descriptor for the binder driver using
a BINDER_TYPE_FDA object, then kys_close() is called on it when
handling a binder_ioctl(BC_FREE_BUFFER) command. This violates
the assumptions for using fdget().
The problem is fixed by deferring the close using task_work_add(). A
new variant of __close_fd() was created that returns a struct file
with a reference. The fput() is deferred instead of using ksys_close().
Fixes: 44d8047f1d ("binder: use standard functions to allocate fds")
Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Todd Kjos <tkjos@google.com>
Cc: stable <stable@vger.kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
- Parse the 4BAIT SFDP section
- Add a bunch of SPI NOR entries to the flash_info table
- Add the concept of SFDP fixups and use it to fix a bug on MX25L25635F
- A bunch of minor cleanups/comestic changes
-----BEGIN PGP SIGNATURE-----
iQJQBAABCgA6FiEEKmCqpbOU668PNA69Ze02AX4ItwAFAlwVQUYcHGJvcmlzLmJy
ZXppbGxvbkBib290bGluLmNvbQAKCRBl7TYBfgi3ACIYD/sG0+vRIKK8+NgNUHYy
nzKICKvdnBrm2RWi+6va5n2pYggyNy1VhWEjmqWLupjxn7NGkjZiBilhfj8Iv6YN
HScNy7FLHM6pxTKpsZKQLLGKvaUXODgvZwiw3L6T5T3JaJ5nlpE5g8jQy8sCzfjK
pwKdrOw17caZgoY0bMe2ppCObIDLd+mY+WSHbo6tb4/fohpTX1l9QZYHjfgHU9vP
CG0z3sU0JCNGXsbQMngfeuyXFjJ4OKdnklbVTeZl673AYtQMBhQEIGNVkVefuBP3
p8hU0CWRn0Yikc1HGTENvYCnQ7ju3z+16rnLxy3A5CPHhCDrTgUmM8HabYbh+0Si
0Y1wXpEOZ0OZQ7uMs2Q8SK0GLyxqdxkE0ibHgb7K/aLb+yg8oB7DB4Uenb06CiaQ
KAZWGgWZlSondX+/GI7YcQozslFFCfixAw6H0kCpQW0/2piXNqsN5BOROEqjueXW
xeMG0DNnzrQ6/vB9ukLESSB/YvVwfUvt6GSjqMSDdXwx6zyKSvHJ+chCxlK46+Hm
zIVcvNT3mpwcuVnQOZZeCaiIrDUAySEq/8Ztp9O5/CfkzdQyMWxDPoY9A0HjL2p3
FmRN7aAB9jJcdHc2tLwcKPRepjliIUMLf0NXdTSizzQz8WJqULZRBN0VWW4sCbLc
+tTisYjX8fYUz9+kHUcQ4XY16g==
=JXhP
-----END PGP SIGNATURE-----
Merge tag 'spi-nor/for-4.21' of git://git.infradead.org/linux-mtd into mtd/next
Core changes:
- Parse the 4BAIT SFDP section
- Add a bunch of SPI NOR entries to the flash_info table
- Add the concept of SFDP fixups and use it to fix a bug on MX25L25635F
- A bunch of minor cleanups/comestic changes
NAND core changes:
- kernel-doc miscellaneous fixes.
- Third batch of fixes/cleanup to the raw NAND core impacting various
controller drivers (ams-delta, marvell, fsmc, denali, tegra, vf610):
* Stopping to pass mtd_info objects to internal functions
* Reorganizing code to avoid forward declarations
* Dropping useless test in nand_legacy_set_defaults()
* Moving nand_exec_op() to internal.h
* Adding nand_[de]select_target() helpers
* Passing the CS line to be selected in struct nand_operation
* Making ->select_chip() optional when ->exec_op() is implemented
* Deprecating the ->select_chip() hook
* Moving the ->exec_op() method to nand_controller_ops
* Moving ->setup_data_interface() to nand_controller_ops
* Deprecating the dummy_controller field
* Fixing JEDEC detection
* Providing a helper for polling GPIO R/B pin
Raw NAND chip drivers changes:
- Macronix:
* Flagging 1.8V AC chips with a broken GET_FEATURES(TIMINGS)
Raw NAND controllers drivers changes:
- Ams-delta:
* Fixing the error path
* SPDX tag added
* May be compiled with COMPILE_TEST=y
* Conversion to ->exec_op() interface
* Dropping .IOADDR_R/W use
* Use GPIO API for data I/O
- Denali:
* Removing denali_reset_banks()
* Removing ->dev_ready() hook
* Including <linux/bits.h> instead of <linux/bitops.h>
* Changes to comply with the above fixes/cleanup done in the core.
- FSMC:
* Adding an SPDX tag to replace the license text
* Making conversion from chip to fsmc consistent
* Fixing unchecked return value in fsmc_read_page_hwecc
* Changes to comply with the above fixes/cleanup done in the core.
- Marvell:
* Preventing timeouts on a loaded machine (fix)
* Changes to comply with the above fixes/cleanup done in the core.
- OMAP2:
* Pass the parent of pdev to dma_request_chan() (fix)
- R852:
* Use generic DMA API
- sh_flctl:
* Converting to SPDX identifiers
- Sunxi:
* Write pageprog related opcodes to the right register: WCMD_SET (fix)
- Tegra:
* Stop implementing ->select_chip()
- VF610:
* Adding an SPDX tag to replace the license text
* Changes to comply with the above fixes/cleanup done in the core.
- Various trivial/spelling/coding style fixes.
SPI-NAND drivers changes:
- Removing the depreacated mt29f_spinand driver from staging.
- Adding support for:
* Toshiba TC58CVG2S0H
* GigaDevice GD5FxGQ4xA
* Winbond W25N01GV
Several ioctl structs change size between native 32-bit (ia32) and x32
applications, because x32 follows the native 64-bit (amd64) integer
alignment rules and uses 64-bit time_t. In these instances, the ioctl
number changes so userspace simply gets -ENOTTY. This scenario can be
handled by simply adding more cases.
Looking at the different ioctls implemented here:
- All the ones marked 'No size or alignment issue on any arch' should
presumably all be fine.
- All the ones under BROKEN_X86_ALIGNMENT are different under integer
alignment rules. Since x32 matches amd64 here, we just need both
sets of cases handled.
- XFS_IOC_SWAPEXT has both integer alignment differences and time_t
differences. Since x32 matches amd64 here, we need to add a case
which calls the native implementation.
- The remaining ioctls have neither 64-bit integers nor time_t, so
x32 matches ia32 here and no change is required at this level. The
bulkstat ioctl implementations have some pointer chasing which is
handled separately.
Signed-off-by: Nick Bowler <nbowler@draconx.ca>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The bulkstat family of ioctls are problematic on x32, because there is
a mixup of native 32-bit and 64-bit conventions. The xfs_fsop_bulkreq
struct contains pointers and 32-bit integers so that matches the native
32-bit layout, and that means the ioctl implementation goes into the
regular compat path on x32.
However, the 'ubuffer' member of that struct in turn refers to either
struct xfs_inogrp or xfs_bstat (or an array of these). On x32, those
structures match the native 64-bit layout. The compat implementation
writes out the 32-bit version of these structures. This is not the
expected format for x32 userspace, causing problems.
Fortunately the functions which actually output these xfs_inogrp and
xfs_bstat structures have an easy way to select which output format
is required, so we just need a little tweak to select the right format
on x32.
Signed-off-by: Nick Bowler <nbowler@draconx.ca>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
While inspecting the ioctl implementations, I noticed that the compat
implementation of XFS_IOC_ATTRLIST_BY_HANDLE does not do exactly the
same thing as the native implementation. Specifically, the "cursor"
does not appear to be written out to userspace on the compat path,
like it is on the native path.
This adjusts the compat implementation to copy out the cursor just
like the native implementation does. The attrlist cursor does not
require any special compat handling. This fixes xfstests xfs/269
on both IA-32 and x32 userspace, when running on an amd64 kernel.
Signed-off-by: Nick Bowler <nbowler@draconx.ca>
Fixes: 0facef7fb0 ("xfs: in _attrlist_by_handle, copy the cursor back to userspace")
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Before this patch, function do_grow would not reserve enough journal
blocks in the transaction to unstuff jdata files while growing them.
This patch adds the logic to add one more block if the file to grow
is jdata.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com>
In preparation of handing in iocbs in a different fashion as well. Also
make it clear that the iocb being passed in isn't modified, by marking
it const throughout.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Replace the percpu_ref_put() + kmem_cache_free() with a call to
iocb_put() instead.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Plugging is meant to optimize submission of a string of IOs, if we don't
have more than 2 being submitted, don't bother setting up a plug.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
It's 192 bytes, fairly substantial. Most items don't need to be cleared,
especially not upfront. Clear the ones we do need to clear, and leave
the other ones for setup when the iocb is prepared and submitted.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This is in preparation for certain types of IO not needing a ring
reserveration.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We know this is a read/write request, but in preparation for
having different kinds of those, ensure that we call the assigned
handler instead of assuming it's aio_complete_rq().
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
* for-4.21/block: (351 commits)
blk-mq: enable IO poll if .nr_queues of type poll > 0
blk-mq: change blk_mq_queue_busy() to blk_mq_queue_inflight()
blk-mq: skip zero-queue maps in blk_mq_map_swqueue
block: fix blk-iolatency accounting underflow
blk-mq: fix dispatch from sw queue
block: mq-deadline: Fix write completion handling
nvme-pci: don't share queue maps
blk-mq: only dispatch to non-defauly queue maps if they have queues
blk-mq: export hctx->type in debugfs instead of sysfs
blk-mq: fix allocation for queue mapping table
blk-wbt: export internal state via debugfs
blk-mq-debugfs: support rq_qos
block: update sysfs documentation
block: loop: check error using IS_ERR instead of IS_ERR_OR_NULL in loop_add()
aoe: add __exit annotation
block: clear REQ_HIPRI if polling is not supported
blk-mq: replace and kill blk_mq_request_issue_directly
blk-mq: issue directly with bypass 'false' in blk_mq_sched_insert_requests
blk-mq: refactor the code of issue request directly
block: remove the bio_integrity_advance export
...
current_time is the last remaining caller of current_kernel_time64(),
which is a wrapper around ktime_get_coarse_real_ts64(). This calls the
latter directly for consistency with the rest of the kernel that is moving
to the ktime_get_ family of time accessors, as now documented in
Documentation/core-api/timekeeping.rst.
An open questions is whether we may want to actually call the more
accurate ktime_get_real_ts64() for file systems that save high-resolution
timestamps in their on-disk format. This would add a small overhead to
each update of the inode stamps but lead to inode timestamps to actually
have a usable resolution better than one jiffy (1 to 10 milliseconds
normally). Experiments on a variety of hardware platforms show a typical
time of around 100 CPU cycles to read the cycle counter and calculate the
accurate time from that. On old platforms without a cycle counter, this
can be signiciantly higher, up to several microseconds to access a
hardware clock, but those have become very rare by now.
I traced the original addition of the current_kernel_time() call to set
the nanosecond fields back to linux-2.5.48, where Andi Kleen added a patch
with subject "nanosecond stat timefields". Andi explains that the
motivation was to introduce as little overhead as possible back then. At
this time, reading the clock hardware was also more expensive when most
architectures did not have a cycle counter.
One side effect of having more accurate inode timestamp would be having to
write out the inode every time that mtime/ctime/atime get touched on most
systems, whereas many file systems today only write it when the timestamps
have changed, i.e. at most once per jiffy unless something else changes
as well. That change would certainly be noticed in some workloads, which
is enough reason to not do it without a good reason, regardless of the
cost of reading the time.
One thing we could still consider however would be to round the timestamps
from current_time() to multiples of NSEC_PER_JIFFY, e.g. full
milliseconds rather than having six or seven meaningless but confusing
digits at the end of the timestamp.
Link: http://lkml.kernel.org/r/20180726130820.4174359-1-arnd@arndb.de
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
... and don't abuse mount_nodev(), while we are at it.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Reviewed-by: David Howells <dhowells@redhat.com>
The typos accumulate over time so once in a while time they get fixed in
a large patch.
Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In the error handling block, err holds the return value of either
btrfs_del_root_ref() or btrfs_del_inode_ref() but it hasn't been checked
since it's introduction with commit fe66a05a06 (Btrfs: improve error
handling for btrfs_insert_dir_item callers) in 2012.
If the error handling in the error handling fails, there's not much left
to do and the abort either happened earlier in the callees or is
necessary here.
So if one of btrfs_del_root_ref() or btrfs_del_inode_ref() failed, abort
the transaction, but still return the original code of the failure
stored in 'ret' as this will be reported to the user.
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since cloning and deduplication are no longer Btrfs specific operations, we
now have generic code to handle parameter validation, compare file ranges
used for deduplication, clear capabilities when cloning, etc. This change
makes Btrfs use it, eliminating a lot of code in Btrfs and also fixing a
few bugs, such as:
1) When cloning, the destination file's capabilities were not dropped
(the fstest generic/513 tests this);
2) We were not checking if the destination file is immutable;
3) Not checking if either the source or destination files are swap
files (swap file support is coming soon for Btrfs);
4) System limits were not checked (resource limits and O_LARGEFILE).
Note that the generic helper generic_remap_file_range_prep() does start
and waits for writeback by calling filemap_write_and_wait_range(), however
that is not enough for Btrfs for two reasons:
1) With compression, we need to start writeback twice in order to get the
pages marked for writeback and ordered extents created;
2) filemap_write_and_wait_range() (and all its other variants) only waits
for the IO to complete, but we need to wait for the ordered extents to
finish, so that when we do the actual reflinking operations the file
extent items are in the fs tree. This is also important due to the fact
that the generic helper, for the deduplication case, compares the
contents of the pages in the requested range, which might require
reading extents from disk in the very unlikely case that pages get
invalidated after writeback finishes (so the file extent items must be
up to date in the fs tree).
Since these reasons are specific to Btrfs we have to do it in the Btrfs
code before calling generic_remap_file_range_prep(). This also results
in a simpler way of dealing with existing delalloc in the source/target
ranges, specially for the deduplication case where we used to lock all
the pages first and then if we found any dealloc for the range, or
ordered extent, we would unlock the pages trigger writeback and wait for
ordered extents to complete, then lock all the pages again and check if
deduplication can be done. So now we get a simpler approach: lock the
inodes, then trigger writeback and then wait for ordered extents to
complete.
So make btrfs use generic_remap_file_range_prep() (XFS and OCFS2 use it)
to eliminate duplicated code, fix a few bugs and benefit from future bug
fixes done there - for example the recent clone and dedupe bugs involving
reflinking a partial EOF block got a counterpart fix in the generic
helper, since it affected all filesystems supporting these operations,
so we no longer need special checks in Btrfs for them.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
extent_readpages processes all pages in the readlist in batches of 16,
this is implemented by a single for loop but thanks to an if condition
the loop does 2 things based on whether we've filled the batch or not.
Additionally due to the structure of the code there is an additional
check which deals with partial batches.
Streamline all of this by explicitly using two loops. The outter one is
used to process all pages while the inner one just fills in the batch
of 16 (currently). Due to this new structure the code guarantees that
all pages are processed in the loop hence the code to deal with any
leftovers is eliminated.
This also enable the compiler to inline __extent_readpages:
./scripts/bloat-o-meter fs/btrfs/extent_io.o extent_io.for
add/remove: 0/1 grow/shrink: 1/0 up/down: 660/-820 (-160)
Function old new delta
extent_readpages 476 1136 +660
__extent_readpages 820 - -820
Total: Before=44315, After=44155, chg -0.36%
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The first step of the rebalance process ensures there is 1MiB free on
each device. This number seems rather small. And in fact when talking to
the original authors their opinions were:
"man that's a little bonkers"
"i don't think we even need that code anymore"
"I think it was there to make sure we had room for the blank 1M at the
beginning. I bet it goes all the way back to v0"
"we just don't need any of that tho, i say we just delete it"
Clearly, this piece of code has lost its original intent throughout the
years. It doesn't really bring any real practical benefits to the
relocation process.
Additionally, this patch makes the balance process more lightweight by
removing a pair of shrink/grow operations which are rather expensive for
heavily populated filesystems. This is mainly due to shrink requiring
relocating block groups, involving heavy use of the btree.
The intermediate shrink/grow can fail and leave the filesystem in a
middle state that would need to be changed back by the user.
Suggested-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
If we create a snapshot of a snapshot currently being used by a send
operation, we can end up with send failing unexpectedly (returning
-ENOENT error to user space for example). The following diagram shows
how this happens.
CPU 1 CPU2 CPU3
btrfs_ioctl_send()
(...)
create_snapshot()
-> creates snapshot of a
root used by the send
task
btrfs_commit_transaction()
create_pending_snapshot()
__get_inode_info()
btrfs_search_slot()
btrfs_search_slot_get_root()
down_read commit_root_sem
get reference on eb of the
commit root
-> eb with bytenr == X
up_read commit_root_sem
btrfs_cow_block(root node)
btrfs_free_tree_block()
-> creates delayed ref to
free the extent
btrfs_run_delayed_refs()
-> runs the delayed ref,
adds extent to
fs_info->pinned_extents
btrfs_finish_extent_commit()
unpin_extent_range()
-> marks extent as free
in the free space cache
transaction commit finishes
btrfs_start_transaction()
(...)
btrfs_cow_block()
btrfs_alloc_tree_block()
btrfs_reserve_extent()
-> allocates extent at
bytenr == X
btrfs_init_new_buffer(bytenr X)
btrfs_find_create_tree_block()
alloc_extent_buffer(bytenr X)
find_extent_buffer(bytenr X)
-> returns existing eb,
which the send task got
(...)
-> modifies content of the
eb with bytenr == X
-> uses an eb that now
belongs to some other
tree and no more matches
the commit root of the
snapshot, resuts will be
unpredictable
The consequences of this race can be various, and can lead to searches in
the commit root performed by the send task failing unexpectedly (unable to
find inode items, returning -ENOENT to user space, for example) or not
failing because an inode item with the same number was added to the tree
that reused the metadata extent, in which case send can behave incorrectly
in the worst case or just fail later for some reason.
Fix this by performing a copy of the commit root's extent buffer when doing
a search in the context of a send operation.
CC: stable@vger.kernel.org # 4.4.x: 1fc28d8e2e: Btrfs: move get root out of btrfs_search_slot to a helper
CC: stable@vger.kernel.org # 4.4.x: f9ddfd0592: Btrfs: remove unused check of skip_locking
CC: stable@vger.kernel.org # 4.4.x
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When initializing the security xattrs, we are holding a transaction handle
therefore we need to use a GFP_NOFS context in order to avoid a deadlock
with reclaim in case it's triggered.
Fixes: 39a27ec100 ("btrfs: use GFP_KERNEL for xattr and acl allocations")
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
With my delayed refs patches in place we started seeing a large amount
of aborts in __btrfs_free_extent:
BTRFS error (device sdb1): unable to find ref byte nr 91947008 parent 0 root 35964 owner 1 offset 0
Call Trace:
? btrfs_merge_delayed_refs+0xaf/0x340
__btrfs_run_delayed_refs+0x6ea/0xfc0
? btrfs_set_path_blocking+0x31/0x60
btrfs_run_delayed_refs+0xeb/0x180
btrfs_commit_transaction+0x179/0x7f0
? btrfs_check_space_for_delayed_refs+0x30/0x50
? should_end_transaction.isra.19+0xe/0x40
btrfs_drop_snapshot+0x41c/0x7c0
btrfs_clean_one_deleted_snapshot+0xb5/0xd0
cleaner_kthread+0xf6/0x120
kthread+0xf8/0x130
? btree_invalidatepage+0x90/0x90
? kthread_bind+0x10/0x10
ret_from_fork+0x35/0x40
This was because btrfs_drop_snapshot depends on the root not being
modified while it's dropping the snapshot. It will unlock the root node
(and really every node) as it walks down the tree, only to re-lock it
when it needs to do something. This is a problem because if we modify
the tree we could cow a block in our path, which frees our reference to
that block. Then once we get back to that shared block we'll free our
reference to it again, and get ENOENT when trying to lookup our extent
reference to that block in __btrfs_free_extent.
This is ultimately happening because we have delayed items left to be
processed for our deleted snapshot _after_ all of the inodes are closed
for the snapshot. We only run the delayed inode item if we're deleting
the inode, and even then we do not run the delayed insertions or delayed
removals. These can be run at any point after our final inode does its
last iput, which is what triggers the snapshot deletion. We can end up
with the snapshot deletion happening and then have the delayed items run
on that file system, resulting in the above problem.
This problem has existed forever, however my patches made it much easier
to hit as I wake up the cleaner much more often to deal with delayed
iputs, which made us more likely to start the snapshot dropping work
before the transaction commits, which is when the delayed items would
generally be run. Before, generally speaking, we would run the delayed
items, commit the transaction, and wakeup the cleaner thread to start
deleting snapshots, which means we were less likely to hit this problem.
You could still hit it if you had multiple snapshots to be deleted and
ended up with lots of delayed items, but it was definitely harder.
Fix for now by simply running all the delayed items before starting to
drop the snapshot. We could make this smarter in the future by making
the delayed items per-root, and then simply drop any delayed items for
roots that we are going to delete. But for now just a quick and easy
solution is the safest.
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When debugging some weird extent reference bug I suspected that we were
changing a snapshot while we were deleting it, which could explain my
bug. This was indeed what was happening, and this patch helped me
verify my theory. It is never correct to modify the snapshot once it's
being deleted, so mark the root when we are deleting it and make sure we
complain about it when it happens.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
@blocksize variable in do_walk_down() is only used once, really no need
to declare it.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since scrub workers only do memory allocation with GFP_KERNEL when they
need to perform repair, we can move the recent setup of the nofs context
up to scrub_handle_errored_block() instead of setting it up down the call
chain at insert_full_stripe_lock() and scrub_add_page_to_wr_bio(),
removing some duplicate code and comment. So the only paths for which a
scrub worker can do memory allocations using GFP_KERNEL are the following:
scrub_bio_end_io_worker()
scrub_block_complete()
scrub_handle_errored_block()
lock_full_stripe()
insert_full_stripe_lock()
-> kmalloc with GFP_KERNEL
scrub_bio_end_io_worker()
scrub_block_complete()
scrub_handle_errored_block()
scrub_write_page_to_dev_replace()
scrub_add_page_to_wr_bio()
-> kzalloc with GFP_KERNEL
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The scrub context is allocated with GFP_KERNEL and called from
btrfs_scrub_dev under the fs_info::device_list_mutex. This is not safe
regarding reclaim that could try to flush filesystem data in order to
get the memory. And the device_list_mutex is held during superblock
commit, so this would cause a lockup.
Move the alocation and initialization before any changes that require
the mutex.
Signed-off-by: David Sterba <dsterba@suse.com>
We can pass fs_info directly as this is the only member of btrfs_device
that's bing used inside scrub_setup_ctx.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have a bunch of magic to make sure we're throttling delayed refs when
truncating a file. Now that we have a delayed refs rsv and a mechanism
for refilling that reserve simply use that instead of all of this magic.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Over the years we have built up a lot of infrastructure to keep delayed
refs in check, mostly by running them at btrfs_end_transaction() time.
We have a lot of different maths we do to figure out how much, if we
should do it inline or async, etc. This existed because we had no
feedback mechanism to force the flushing of delayed refs when they
became a problem. However with the enospc flushing infrastructure in
place for flushing delayed refs when they put too much pressure on the
enospc system we have this problem solved. Rip out all of this code as
it is no longer needed.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now with the delayed_refs_rsv we can now know exactly how much pending
delayed refs space we need. This means we can drastically simplify
btrfs_check_space_for_delayed_refs by simply checking how much space we
have reserved for the global rsv (which acts as a spill over buffer) and
the delayed refs rsv. If our total size is beyond that amount then we
know it's time to commit the transaction and stop any more delayed refs
from being generated.
With the introduction of dealyed_refs_rsv infrastructure, namely
btrfs_update_delayed_refs_rsv we now know exactly how much pending
delayed refs space is required.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
A nice thing we gain with the delayed refs rsv is the ability to flush
the delayed refs on demand to deal with enospc pressure. Add states to
flush delayed refs on demand, and this will allow us to remove a lot of
ad-hoc work around checking to see if we should commit the transaction
to run our delayed refs.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Any space used in the delayed_refs_rsv will be freed up by a transaction
commit, so instead of just counting the pinned space we also need to
account for any space in the delayed_refs_rsv when deciding if it will
make a different to commit the transaction to satisfy our space
reservation. If we have enough bytes to satisfy our reservation ticket
then we are good to go, otherwise subtract out what space we would gain
back by committing the transaction and compare that against the pinned
space to make our decision.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Traditionally we've had voodoo in btrfs to account for the space that
delayed refs may take up by having a global_block_rsv. This works most
of the time, except when it doesn't. We've had issues reported and seen
in production where sometimes the global reserve is exhausted during
transaction commit before we can run all of our delayed refs, resulting
in an aborted transaction. Because of this voodoo we have equally
dubious flushing semantics around throttling delayed refs which we often
get wrong.
So instead give them their own block_rsv. This way we can always know
exactly how much outstanding space we need for delayed refs. This
allows us to make sure we are constantly filling that reservation up
with space, and allows us to put more precise pressure on the enospc
system. Instead of doing math to see if its a good time to throttle,
the normal enospc code will be invoked if we have a lot of delayed refs
pending, and they will be run via the normal flushing mechanism.
For now the delayed_refs_rsv will hold the reservations for the delayed
refs, the block group updates, and deleting csums. We could have a
separate rsv for the block group updates, but the csum deletion stuff is
still handled via the delayed_refs so that will stay there.
Historical background:
The global reserve has grown to cover everything we don't reserve space
explicitly for, and we've grown a lot of weird ad-hoc heuristics to know
if we're running short on space and when it's time to force a commit. A
failure rate of 20-40 file systems when we run hundreds of thousands of
them isn't super high, but cleaning up this code will make things less
ugly and more predictible.
Thus the delayed refs rsv. We always know how many delayed refs we have
outstanding, and although running them generates more we can use the
global reserve for that spill over, which fits better into it's desired
use than a full blown reservation. This first approach is to simply
take how many times we're reserving space for and multiply that by 2 in
order to save enough space for the delayed refs that could be generated.
This is a niave approach and will probably evolve, but for now it works.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com> # high-level review
[ added background notes from the cover letter ]
Signed-off-by: David Sterba <dsterba@suse.com>
We use this number to figure out how many delayed refs to run, but
__btrfs_run_delayed_refs really only checks every time we need a new
delayed ref head, so we always run at least one ref head completely no
matter what the number of items on it. Fix the accounting to only be
adjusted when we add/remove a ref head.
In addition to using this number to limit the number of delayed refs
run, a future patch is also going to use it to calculate the amount of
space required for delayed refs space reservation.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The cleanup_extent_op function actually would run the extent_op if it
needed running, which made the name sort of a misnomer. Change it to
run_and_cleanup_extent_op, and move the actual cleanup work to
cleanup_extent_op so it can be used by check_ref_cleanup() in order to
unify the extent op handling.
Reviewed-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We were missing some quota cleanups in check_ref_cleanup, so break the
ref head accounting cleanup into a helper and call that from both
check_ref_cleanup and cleanup_ref_head. This will hopefully ensure that
we don't screw up accounting in the future for other things that we add.
Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We do this dance in cleanup_ref_head and check_ref_cleanup, unify it
into a helper and cleanup the calling functions.
Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When using a 'var & (PAGE_SIZE - 1)' construct one is checking for a page
alignment and thus should use the PAGE_ALIGNED() macro instead of
open-coding it.
Convert all open-coded occurrences of PAGE_ALIGNED().
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Constructs like 'var & (PAGE_SIZE - 1)' or 'var & ~PAGE_MASK' can denote an
offset into a page.
So replace them by the offset_in_page() macro instead of open-coding it if
they're not used as an alignment check.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The dev-replace locking functions are now trivial wrappers around rw
semaphore that can be used directly everywhere. No functional change.
Signed-off-by: David Sterba <dsterba@suse.com>
After the rw semaphore has been added, the custom blocking using
::blocking_readers and ::read_lock_wq is redundant.
The blocking logic in __btrfs_map_block is replaced by extending the
time the semaphore is held, that has the same blocking effect on writes
as the previous custom scheme that waited until ::blocking_readers was
zero.
Signed-off-by: David Sterba <dsterba@suse.com>
This is the first part of removing the custom locking and waiting scheme
used for device replace. It was probably copied from extent buffer
locking, but there's nothing that would require more than is provided by
the common locking primitives.
The rw spinlock protects waiting tasks counter in case of incompatible
locks and the waitqueue. Same as rw semaphore.
This patch only switches the locking primitive, for better
bisectability. There should be no functional change other than the
overhead of the locking and potential sleeping instead of spinning when
the lock is contended.
Signed-off-by: David Sterba <dsterba@suse.com>
The device-replace read lock is going to use rw semaphore in followup
commits. The semaphore might sleep which is not possible in the radix
tree preload section. The lock nesting is now:
* device replace
* radix tree preload
* readahead spinlock
Signed-off-by: David Sterba <dsterba@suse.com>
Running btrfs/124 in a loop hung up on me sporadically with the
following call trace:
btrfs D 0 5760 5324 0x00000000
Call Trace:
? __schedule+0x243/0x800
schedule+0x33/0x90
btrfs_start_ordered_extent+0x10c/0x1b0 [btrfs]
? wait_woken+0xa0/0xa0
btrfs_wait_ordered_range+0xbb/0x100 [btrfs]
btrfs_relocate_block_group+0x1ff/0x230 [btrfs]
btrfs_relocate_chunk+0x49/0x100 [btrfs]
btrfs_balance+0xbeb/0x1740 [btrfs]
btrfs_ioctl_balance+0x2ee/0x380 [btrfs]
btrfs_ioctl+0x1691/0x3110 [btrfs]
? lockdep_hardirqs_on+0xed/0x180
? __handle_mm_fault+0x8e7/0xfb0
? _raw_spin_unlock+0x24/0x30
? __handle_mm_fault+0x8e7/0xfb0
? do_vfs_ioctl+0xa5/0x6e0
? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
do_vfs_ioctl+0xa5/0x6e0
? entry_SYSCALL_64_after_hwframe+0x3e/0xbe
ksys_ioctl+0x3a/0x70
__x64_sys_ioctl+0x16/0x20
do_syscall_64+0x60/0x1b0
entry_SYSCALL_64_after_hwframe+0x49/0xbe
This happens because during page writeback it's valid for
writepage_delalloc to instantiate a delalloc range which doesn't belong
to the page currently being written back.
The reason this case is valid is due to find_lock_delalloc_range
returning any available range after the passed delalloc_start and
ignoring whether the page under writeback is within that range.
In turn ordered extents (OE) are always created for the returned range
from find_lock_delalloc_range. If, however, a failure occurs while OE
are being created then the clean up code in btrfs_cleanup_ordered_extents
will be called.
Unfortunately the code in btrfs_cleanup_ordered_extents doesn't consider
the case of such 'foreign' range being processed and instead it always
assumes that the range OE are created for belongs to the page. This
leads to the first page of such foregin range to not be cleaned up since
it's deliberately missed and skipped by the current cleaning up code.
Fix this by correctly checking whether the current page belongs to the
range being instantiated and if so adjsut the range parameters passed
for cleaning up. If it doesn't, then just clean the whole OE range
directly.
Fixes: 524272607e ("btrfs: Handle delalloc error correctly to avoid ordered extent hang")
CC: stable@vger.kernel.org # 4.14+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The @found is always false when it comes to the if branch. Besides, the
bool type is more suitable for @found. Change the return value of the
function and its caller to bool as well.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The test case btrfs/001 with inode_cache mount option will encounter the
following warning:
WARNING: CPU: 1 PID: 23700 at fs/btrfs/inode.c:956 cow_file_range.isra.19+0x32b/0x430 [btrfs]
CPU: 1 PID: 23700 Comm: btrfs Kdump: loaded Tainted: G W O 4.20.0-rc4-custom+ #30
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
RIP: 0010:cow_file_range.isra.19+0x32b/0x430 [btrfs]
Call Trace:
? free_extent_buffer+0x46/0x90 [btrfs]
run_delalloc_nocow+0x455/0x900 [btrfs]
btrfs_run_delalloc_range+0x1a7/0x360 [btrfs]
writepage_delalloc+0xf9/0x150 [btrfs]
__extent_writepage+0x125/0x3e0 [btrfs]
extent_write_cache_pages+0x1b6/0x3e0 [btrfs]
? __wake_up_common_lock+0x63/0xc0
extent_writepages+0x50/0x80 [btrfs]
do_writepages+0x41/0xd0
? __filemap_fdatawrite_range+0x9e/0xf0
__filemap_fdatawrite_range+0xbe/0xf0
btrfs_fdatawrite_range+0x1b/0x50 [btrfs]
__btrfs_write_out_cache+0x42c/0x480 [btrfs]
btrfs_write_out_ino_cache+0x84/0xd0 [btrfs]
btrfs_save_ino_cache+0x551/0x660 [btrfs]
commit_fs_roots+0xc5/0x190 [btrfs]
btrfs_commit_transaction+0x2bf/0x8d0 [btrfs]
btrfs_mksubvol+0x48d/0x4d0 [btrfs]
btrfs_ioctl_snap_create_transid+0x170/0x180 [btrfs]
btrfs_ioctl_snap_create_v2+0x124/0x180 [btrfs]
btrfs_ioctl+0x123f/0x3030 [btrfs]
The file extent generation of the free space inode is equal to the last
snapshot of the file root, so the inode will be passed to cow_file_rage.
But the inode was created and its extents were preallocated in
btrfs_save_ino_cache, there are no cow copies on disk.
The preallocated extent is not yet in the extent tree, and
btrfs_cross_ref_exist will ignore the -ENOENT returned by
check_committed_ref, so we can directly write the inode to the disk.
Fixes: 78d4295b1e ("btrfs: lift some btrfs_cross_ref_exist checks in nocow path")
CC: stable@vger.kernel.org # 4.18+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The log tree has a long standing problem that when a file is fsync'ed we
only check for new ancestors, created in the current transaction, by
following only the hard link for which the fsync was issued. We follow the
ancestors using the VFS' dget_parent() API. This means that if we create a
new link for a file in a directory that is new (or in an any other new
ancestor directory) and then fsync the file using an old hard link, we end
up not logging the new ancestor, and on log replay that new hard link and
ancestor do not exist. In some cases, involving renames, the file will not
exist at all.
Example:
mkfs.btrfs -f /dev/sdb
mount /dev/sdb /mnt
mkdir /mnt/A
touch /mnt/foo
ln /mnt/foo /mnt/A/bar
xfs_io -c fsync /mnt/foo
<power failure>
In this example after log replay only the hard link named 'foo' exists
and directory A does not exist, which is unexpected. In other major linux
filesystems, such as ext4, xfs and f2fs for example, both hard links exist
and so does directory A after mounting again the filesystem.
Checking if any new ancestors are new and need to be logged was added in
2009 by commit 12fcfd22fe ("Btrfs: tree logging unlink/rename fixes"),
however only for the ancestors of the hard link (dentry) for which the
fsync was issued, instead of checking for all ancestors for all of the
inode's hard links.
So fix this by tracking the id of the last transaction where a hard link
was created for an inode and then on fsync fallback to a full transaction
commit when an inode has more than one hard link and at least one new hard
link was created in the current transaction. This is the simplest solution
since this is not a common use case (adding frequently hard links for
which there's an ancestor created in the current transaction and then
fsync the file). In case it ever becomes a common use case, a solution
that consists of iterating the fs/subvol btree for each hard link and
check if any ancestor is new, could be implemented.
This solves many unexpected scenarios reported by Jayashree Mohan and
Vijay Chidambaram, and for which there is a new test case for fstests
under review.
Fixes: 12fcfd22fe ("Btrfs: tree logging unlink/rename fixes")
CC: stable@vger.kernel.org # 4.4+
Reported-by: Vijay Chidambaram <vvijay03@gmail.com>
Reported-by: Jayashree Mohan <jayashree2912@gmail.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The first auto-assigned value to enum is 0, we can use that and not
initialize all members where the auto-increment does the same. This is
used for values that are not part of on-disk format.
Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
We can use simple enum for values that are not part of on-disk format:
ordered extent flags.
Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
We can use simple enum for values that are not part of on-disk format:
extent map flags.
Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
We can use simple enum for values that are not part of on-disk format:
extent buffer flags;
Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
We can use simple enum for values that are not part of on-disk format:
root tree flags.
Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
We can use simple enum for values that are not part of on-disk format:
internal filesystem states.
Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We can use simple enum for values that are not part of on-disk format:
block reserve types.
Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
We can use simple enum for values that are not part of on-disk format:
global filesystem states.
Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
This function really checks whether adding more data to the bio will
straddle a stripe/chunk. So first let's give it a more appropraite name
- btrfs_bio_fits_in_stripe. Secondly, the offset parameter was never
used to just remove it. Thirdly, pages are submitted to either btree or
data inodes so it's guaranteed that tree->ops is set so replace the
check with an ASSERT. Finally, document the parameters of the function.
No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When it was introduced in commit f094ac32ab ("Btrfs: fix NULL pointer
after aborting a transaction"), it was not used.
Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
Document why map_private_extent_buffer() cannot return '1' (i.e. the map
spans two pages) for the csum_tree_block() case.
The current algorithm for detecting a page boundary crossing in
map_private_extent_buffer() will return a '1' *IFF* the extent buffer's
offset in the page + the offset passed in by csum_tree_block() and the
minimal length passed in by csum_tree_block() - 1 are bigger than
PAGE_SIZE.
We always pass BTRFS_CSUM_SIZE (32) as offset and a minimal length of 32
and the current extent buffer allocator always guarantees page aligned
extends, so the above condition can't be true.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
In map_private_extent_buffer() the 'offset' variable is initialized to a
page aligned version of the 'start' parameter.
But later on it is overwritten with either the offset from the extent
buffer's start or 0.
So get rid of the initial initialization.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
When a transaction commit starts, it attempts to pause scrub and it blocks
until the scrub is paused. So while the transaction is blocked waiting for
scrub to pause, we can not do memory allocation with GFP_KERNEL from scrub,
otherwise we risk getting into a deadlock with reclaim.
Checking for scrub pause requests is done early at the beginning of the
while loop of scrub_stripe() and later in the loop, scrub_extent() and
scrub_raid56_parity() are called, which in turn call scrub_pages() and
scrub_pages_for_parity() respectively. These last two functions do memory
allocations using GFP_KERNEL. Same problem could happen while scrubbing
the super blocks, since it calls scrub_pages().
We also can not have any of the worker tasks, created by the scrub task,
doing GFP_KERNEL allocations, because before pausing, the scrub task waits
for all the worker tasks to complete (also done at scrub_stripe()).
So make sure GFP_NOFS is used for the memory allocations because at any
time a scrub pause request can happen from another task that started to
commit a transaction.
Fixes: 58c4e17384 ("btrfs: scrub: use GFP_KERNEL on the submission path")
CC: stable@vger.kernel.org # 4.6+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
For data inodes this hook does nothing but to return -EAGAIN which is
used to signal to the endio routines that this bio belongs to a data
inode. If this is the case the actual retrying is handled by
bio_readpage_error. Alternatively, if this bio belongs to the btree
inode then btree_io_failed_hook just does some cleanup and doesn't retry
anything.
This patch simplifies the code flow by eliminating
readpage_io_failed_hook and instead open-coding btree_io_failed_hook in
end_bio_extent_readpage. Also eliminate some needless checks since IO is
always performed on either data inode or btree inode, both of which are
guaranteed to have their extent_io_tree::ops set.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The btrfs_bio_end_io_t typedef was introduced with commit
a1d3c4786a ("btrfs: btrfs_multi_bio replaced with btrfs_bio")
but never used anywhere. This commit also introduced a forward declaration
of 'struct btrfs_bio' which is only needed for btrfs_bio_end_io_t.
Remove both as they're not needed anywhere.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
The end_io callback implemented as btrfs_io_bio_endio_readpage only
calls kfree. Also the callback is set only in case the csum buffer is
allocated and not pointing to the inline buffer. We can use that
information to drop the indirection and call a helper that will free the
csums only in the right case.
This shrinks struct btrfs_io_bio by 8 bytes.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
The io_bio tracks checksums and has an inline buffer or an allocated
one. And there's a third member that points to the right one, but we
don't need to use an extra pointer for that. Let btrfs_io_bio::csum
point to the right buffer and check that the inline buffer is not
accidentally freed.
This shrinks struct btrfs_io_bio by 8 bytes.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
The async_cow::root is used to propagate fs_info to async_cow_submit.
We can't use inode to reach it because it could become NULL after
write without compression in async_cow_start.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
There's one caller and its code is simple, we can open code it in
run_one_async_done. The errors are passed through bio.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
Print a kernel log message when the balance ends, either for cancel or
completed or if it is paused.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The information about balance arguments is important for system audit,
this patch prints the textual representation when balance starts or is
resumed.
Example command:
$ btrfs balance start -f -mprofiles=raid1,convert=single,soft -dlimit=10..20,usage=50 /btrfs
Example kernel log output:
BTRFS info (device sdb): balance: start -f -dusage=50,limit=10..20 -mconvert=single,soft,profiles=raid1 -sconvert=single,soft,profiles=raid1
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog, simplify code ]
Signed-off-by: David Sterba <dsterba@suse.com>
Factor out helper that describes block group flags from
describe_relocation. The result will not be longer than the given size.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add comments ]
Signed-off-by: David Sterba <dsterba@suse.com>
If the quota enable and snapshot creation ioctls are called concurrently
we can get into a deadlock where the task enabling quotas will deadlock
on the fs_info->qgroup_ioctl_lock mutex because it attempts to lock it
twice, or the task creating a snapshot tries to commit the transaction
while the task enabling quota waits for the former task to commit the
transaction while holding the mutex. The following time diagrams show how
both cases happen.
First scenario:
CPU 0 CPU 1
btrfs_ioctl()
btrfs_ioctl_quota_ctl()
btrfs_quota_enable()
mutex_lock(fs_info->qgroup_ioctl_lock)
btrfs_start_transaction()
btrfs_ioctl()
btrfs_ioctl_snap_create_v2
create_snapshot()
--> adds snapshot to the
list pending_snapshots
of the current
transaction
btrfs_commit_transaction()
create_pending_snapshots()
create_pending_snapshot()
qgroup_account_snapshot()
btrfs_qgroup_inherit()
mutex_lock(fs_info->qgroup_ioctl_lock)
--> deadlock, mutex already locked
by this task at
btrfs_quota_enable()
Second scenario:
CPU 0 CPU 1
btrfs_ioctl()
btrfs_ioctl_quota_ctl()
btrfs_quota_enable()
mutex_lock(fs_info->qgroup_ioctl_lock)
btrfs_start_transaction()
btrfs_ioctl()
btrfs_ioctl_snap_create_v2
create_snapshot()
--> adds snapshot to the
list pending_snapshots
of the current
transaction
btrfs_commit_transaction()
--> waits for task at
CPU 0 to release
its transaction
handle
btrfs_commit_transaction()
--> sees another task started
the transaction commit first
--> releases its transaction
handle
--> waits for the transaction
commit to be completed by
the task at CPU 1
create_pending_snapshot()
qgroup_account_snapshot()
btrfs_qgroup_inherit()
mutex_lock(fs_info->qgroup_ioctl_lock)
--> deadlock, task at CPU 0
has the mutex locked but
it is waiting for us to
finish the transaction
commit
So fix this by setting the quota enabled flag in fs_info after committing
the transaction at btrfs_quota_enable(). This ends up serializing quota
enable and snapshot creation as if the snapshot creation happened just
before the quota enable request. The quota rescan task, scheduled after
committing the transaction in btrfs_quote_enable(), will do the accounting.
Fixes: 6426c7ad69 ("btrfs: qgroup: Fix qgroup accounting when creating snapshot")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The available allocation bits members from struct btrfs_fs_info are
protected by a sequence lock, and when starting balance we access them
incorrectly in two different ways:
1) In the read sequence lock loop at btrfs_balance() we use the values we
read from fs_info->avail_*_alloc_bits and we can immediately do actions
that have side effects and can not be undone (printing a message and
jumping to a label). This is wrong because a retry might be needed, so
our actions must not have side effects and must be repeatable as long
as read_seqretry() returns a non-zero value. In other words, we were
essentially ignoring the sequence lock;
2) Right below the read sequence lock loop, we were reading the values
from avail_metadata_alloc_bits and avail_data_alloc_bits without any
protection from concurrent writers, that is, reading them outside of
the read sequence lock critical section.
So fix this by making sure we only read the available allocation bits
while in a read sequence lock critical section and that what we do in the
critical section is repeatable (has nothing that can not be undone) so
that any eventual retry that is needed is handled properly.
Fixes: de98ced9e7 ("Btrfs: use seqlock to protect fs_info->avail_{data, metadata, system}_alloc_bits")
Fixes: 1450612797 ("btrfs: fix a bogus warning when converting only data or metadata")
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We can have a lot freed extents during the life span of transaction, so
the red black tree that keeps track of the ranges of each freed extent
(fs_info->freed_extents[]) can get quite big. When finishing a
transaction commit we find each range, process it (discard the extents,
unpin them) and then remove it from the red black tree.
We can use an extent state record as a cache when searching for a range,
so that when we clean the range we can use the cached extent state we
passed to the search function instead of iterating the red black tree
again. Doing things as fast as possible when finishing a transaction (in
state TRANS_STATE_UNBLOCKED) is convenient as it reduces the time we
block another task that wants to commit the next transaction.
So change clear_extent_dirty() to allow an optional extent state record to
be passed as an argument, which will be passed down to __clear_extent_bit.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This patch lands the last case which needs to be handled by the fsid
change code. Namely, this is the case where a multidisk filesystem has
already undergone at least one successful fsid change i.e all disks
have the METADATA_UUID incompat bit and power failure occurs as another
fsid change is in progress. When such an event occurs, disks could be
split in 2 groups. One of the groups will have both METADATA_UUID and
CHANGING_FSID_V2 flags set coupled with old fsid/metadata_uuid pairs.
The other group of disks will have only METADATA_UUID bit set and their
fsid will be different than the one in disks in the first group. Here
we look at the following cases:
a) A disk from the first group is scanned first, so fs_devices is
created with stale fsid/metdata_uuid. Then when a disk from the
second group is scanned it needs to first check whether there exists
such an fs_devices that has fsid_change set to true (because it was
created with a disk having the CHANGING_FSID_V2 flag), the
metadata_uuid and fsid of the fs_devices will be different (since it was
created by a disk which already has had at least 1 successful fsid change)
and finally the metadata_uuid of the fs_devices will equal that of the
currently scanned disk (because metadata_uuid never really changes).
When the correct fs_devices is found the information from the scanned
disk will replace the current one in fs_devices since the scanned disk
will have higher generation number.
b) A disk from the second group is scanned so fs_devices is created
as usual with differing fsid/metdata_uid. Then when a disk from the
first group is scanned the code detects that it has both
CHANGING_FSID_V2 and METADATA_UUID flags set and will search for
fs_devices that has differing metadata_uuid/fsid and whose
metadata_uuid is the same as that of the scanned device.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This commit continues hardening the scanning code to handle cases where
power loss could have caused disks in a multi-disk filesystem to be
in inconsistent state. Namely handle the situation that can occur when
some of the disks in multi-disk fs have completed their fsid change i.e
they have METADATA_UUID incompat flag set, have cleared the
CHANGING_FSID_V2 flag and their fsid/metadata_uuid are different. At
the same time the other half of the disks will have their
fsid/metadata_uuid unchanged and will only have CHANGING_FSID_V2 flag.
This is handled by introducing code in the scan path which:
a) Handles the case when a device with CHANGING_FSID_V2 flag is
scanned and as a result btrfs_fs_devices is created with matching
fsid/metdata_uuid. Subsequently, when a device with completed fsid
change is scanned it will detect this via the new code in find_fsid
i.e that such an fs_devices exist that fsid_change flag is set to true,
it's metadata_uuid/fsid match and the metadata_uuid of the scanned
device matches that of the fs_devices. In this case, it's important to
note that the devices which has its fsid change completed will have a
higher generation number than the device with FSID_CHANGING_V2 flag
set, so its superblock block will be used during mount. To prevent an
assertion triggering because the sb used for mounting will have
differing fsid/metadata_uuid than the ones in the fs_devices struct
also add code in device_list_add which overwrites the values in
fs_devices.
b) Alternatively we can end up with a device that completed its
fsid change be scanned first which will create the respective
btrfs_fs_devices struct with differing fsid/metadata_uuid. In this
case when a device with FSID_CHANGING_V2 flag set is scanned it will
call the newly added find_fsid_inprogress function which will return
the correct fs_devices.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In order to gracefully handle split-brain scenario during fsid change
(which are very unlikely, yet possible), two more pieces of information
will be necessary:
1. The highest generation number among all devices registered to a
particular btrfs_fs_devices
2. A boolean flag whether a given btrfs_fs_devices was created by a
device which had the FSID_CHANGING_V2 flag set.
This is a preparatory patch and just introduces the variables as well
as code which sets them, their actual use is going to happen in a later
patch.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Even though fsid change without rewrite is a very quick operation it's
still possible to experience a split-brain scenario if power loss occurs
at the most inconvenient time. This patch handles the case where power
failure occurs while the first transaction (the one setting
CHANGING_FSID_V2) flag is being persisted on disk. This can cause the
btrfs_fs_devices of this filesystem to be created by a device which:
a) has the CHANGING_FSID_V2 flag set but its fsid value is intact
b) or a device which doesn't have CHANGING_FSID_V2 flag set and its
fsid value is intact
This situation is trivially handled by the current find_fsid code since
in both cases the devices are going to be treated like ordinary devices.
Since btrfs is always mounted using the superblock of the latest
device (the one with highest generation number), meaning it will have
the CHANGING_FSID_V2 flag set, ensure it's being cleared on mount. On
the first transaction commit following mount all disks will have it
cleared.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently btrfs_fs_info structure contains a copy of the
fsid/metadata_uuid fields. Same values are also contained in the
btrfs_fs_devices structure which fs_info has a reference to. Let's
reduce duplication by removing the fields from fs_info and always refer
to the ones in fs_devices. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since the metadata_uuid is a new incompat feature it requires the
respective sysfs hooks. This patch adds the 'metdata_uuid' feature to
be shown if it supported by the kernel. Additionally it adds
/sys/fs/btrfs/UUID/metadata_uuid attribute which allows one to read
the current metadata_uuid.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This field is going to be used when the user wants to change the UUID
of the filesystem without having to rewrite all metadata blocks. This
field adds another level of indirection such that when the FSID is
changed what really happens is the current UUID (the one with which the
fs was created) is copied to the 'metadata_uuid' field in the superblock
as well as a new incompat flag is set METADATA_UUID. When the kernel
detects this flag is set it knows that the superblock in fact has 2
UUIDs:
1. Is the UUID which is user-visible, currently known as FSID.
2. Metadata UUID - this is the UUID which is stamped into all on-disk
datastructures belonging to this file system.
When the new incompat flag is present device scanning checks whether
both fsid/metadata_uuid of the scanned device match any of the
registered filesystems. When the flag is not set then both UUIDs are
equal and only the FSID is retained on disk, metadata_uuid is set only
in-memory during mount.
Additionally a new metadata_uuid field is also added to the fs_info
struct. It's initialised either with the FSID in case METADATA_UUID
incompat flag is not set or with the metdata_uuid of the superblock
otherwise.
This commit introduces the new fields as well as the new incompat flag
and switches all users of the fsid to the new logic.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ minor updates in comments ]
Signed-off-by: David Sterba <dsterba@suse.com>
Several functions in BTRFS are only used inside the source file they are
declared if CONFIG_BTRFS_FS_RUN_SANITY_TESTS is not defined. However if
CONFIG_BTRFS_FS_RUN_SANITY_TESTS is defined these functions are shared
with the unit tests code.
Before the introduction of the EXPORT_FOR_TESTS macro, these functions
could not be declared as static and the compiler had a harder task when
optimizing and inlining them.
As we have EXPORT_FOR_TESTS now, use it where appropriate to support the
compiler.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Depending on whether CONFIG_BTRFS_FS_RUN_SANITY_TESTS is set, some BTRFS
functions are either local to the file they are implemented in and thus
should be declared static or are called from within the test
implementation defined in a different file.
Introduce an EXPORT_FOR_TESTS macro which depending on
CONFIG_BTRFS_FS_RUN_SANITY_TESTS either adds the 'static' keyword to a
function or not.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
Up to commit 32955c5422 ("btrfs: switch to discard_new_inode()") the
drop_on_err variable in btrfs_mkdir() was used to check whether the
inode had to be dropped via iput().
After commit 32955c5422 ("btrfs: switch to discard_new_inode()")
discard_new_inode() is called when err is set and inode is non NULL.
Therefore drop_on_err is not used anymore and thus causes a warning when
building with -Wunused-but-set-variable.
Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
lock_delalloc_pages should only return 2 values - 0 in case of success
and -EAGAIN if the range of pages to be locked should be shrunk due to
some of gone. Manual inspections confirms that this is indeed the case
since __process_pages_contig is where lock_delalloc_pages gets its
return value. The latter always returns 0 or -EAGAIN so the invariant
holds. No functional changes.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
All callers of this function pass BTRFS_MAX_EXTENT_SIZE (128M) so let's
reduce the argument count and make that a local variable. No functional
changes.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It's unnecessary to check map->stripes[i].dev for NULL given its value
is already set and dereferenced above the the check. No functional
changes.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
As of now only user requested replace cancel can cancel the
replace-scrub so no need to log the error.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When we successfully cancel the device replace, its scrub worker returns
-ECANCELED, which is then passed to btrfs_dev_replace_finishing.
It cleans up based on the returned status and propagates the same
-ECANCELED back the parent function. As of now only user can cancel the
replace-scrub, so its ok to silence the warning here.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We recast the replace return status
BTRFS_IOCTL_DEV_REPLACE_RESULT_SCRUB_INPROGRESS to 0, to indicate no
error.
And since BTRFS_IOCTL_DEV_REPLACE_RESULT_NO_ERROR should also return 0,
which is also declared as 0, so we just return. Instead add it to the if
statement so that there is enough clarity while reading the code.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When the replace state is in the suspended state, btrfs_scrub_cancel()
should fail with -ENOTCONN as there is no scrub running. As a safety
catch check if btrfs_scrub_cancel() returns -ENOTCONN and assert if it
doesn't.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The device-replace needs to check the result code of the scrub workers
in btrfs_dev_replace_cancel and distinguish if successful cancel
operation and when the there was no operation running.
If btrfs_scrub_cancel() fails, return
BTRFS_IOCTL_DEV_REPLACE_RESULT_NOT_STARTED so that user can try
to cancel the replace again.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
The device replace cancel thread can race with the replace start thread
and if fs_info::scrubs_running is not yet set, btrfs_scrub_cancel() will
fail to stop the scrub thread.
The scrub thread continues with the scrub for replace which then will
try to write to the target device and which is already freed by the
cancel thread.
scrub_setup_ctx() warns as tgtdev is NULL.
struct scrub_ctx *scrub_setup_ctx(struct btrfs_device *dev, int is_dev_replace)
{
...
if (is_dev_replace) {
WARN_ON(!fs_info->dev_replace.tgtdev); <===
sctx->pages_per_wr_bio = SCRUB_PAGES_PER_WR_BIO;
sctx->wr_tgtdev = fs_info->dev_replace.tgtdev;
sctx->flush_all_writes = false;
}
[ 6724.497655] BTRFS info (device sdb): dev_replace from /dev/sdb (devid 1) to /dev/sdc started
[ 6753.945017] BTRFS info (device sdb): dev_replace from /dev/sdb (devid 1) to /dev/sdc canceled
[ 6852.426700] WARNING: CPU: 0 PID: 4494 at fs/btrfs/scrub.c:622 scrub_setup_ctx.isra.19+0x220/0x230 [btrfs]
...
[ 6852.428928] RIP: 0010:scrub_setup_ctx.isra.19+0x220/0x230 [btrfs]
...
[ 6852.432970] Call Trace:
[ 6852.433202] btrfs_scrub_dev+0x19b/0x5c0 [btrfs]
[ 6852.433471] btrfs_dev_replace_start+0x48c/0x6a0 [btrfs]
[ 6852.433800] btrfs_dev_replace_by_ioctl+0x3a/0x60 [btrfs]
[ 6852.434097] btrfs_ioctl+0x2476/0x2d20 [btrfs]
[ 6852.434365] ? do_sigaction+0x7d/0x1e0
[ 6852.434623] do_vfs_ioctl+0xa9/0x6c0
[ 6852.434865] ? syscall_trace_enter+0x1c8/0x310
[ 6852.435124] ? syscall_trace_enter+0x1c8/0x310
[ 6852.435387] ksys_ioctl+0x60/0x90
[ 6852.435663] __x64_sys_ioctl+0x16/0x20
[ 6852.435907] do_syscall_64+0x50/0x180
[ 6852.436150] entry_SYSCALL_64_after_hwframe+0x49/0xbe
Further, as the replace thread enters scrub_write_page_to_dev_replace()
without the target device it panics:
static int scrub_add_page_to_wr_bio(struct scrub_ctx *sctx,
struct scrub_page *spage)
{
...
bio_set_dev(bio, sbio->dev->bdev); <======
[ 6929.715145] BUG: unable to handle kernel NULL pointer dereference at 00000000000000a0
..
[ 6929.717106] Workqueue: btrfs-scrub btrfs_scrub_helper [btrfs]
[ 6929.717420] RIP: 0010:scrub_write_page_to_dev_replace+0xb4/0x260
[btrfs]
..
[ 6929.721430] Call Trace:
[ 6929.721663] scrub_write_block_to_dev_replace+0x3f/0x60 [btrfs]
[ 6929.721975] scrub_bio_end_io_worker+0x1af/0x490 [btrfs]
[ 6929.722277] normal_work_helper+0xf0/0x4c0 [btrfs]
[ 6929.722552] process_one_work+0x1f4/0x520
[ 6929.722805] ? process_one_work+0x16e/0x520
[ 6929.723063] worker_thread+0x46/0x3d0
[ 6929.723313] kthread+0xf8/0x130
[ 6929.723544] ? process_one_work+0x520/0x520
[ 6929.723800] ? kthread_delayed_work_timer_fn+0x80/0x80
[ 6929.724081] ret_from_fork+0x3a/0x50
Fix this by letting the btrfs_dev_replace_finishing() to do the job of
cleaning after the cancel, including freeing of the target device.
btrfs_dev_replace_finishing() is called when btrfs_scub_dev() returns
along with the scrub return status.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In a secnario where balance and replace co-exists as below,
- start balance
- pause balance
- start replace
- reboot
and when system restarts, balance resumes first. Then the replace is
attempted to restart but will fail as the EXCL_OP lock is already held
by the balance. If so place the replace state back to
BTRFS_IOCTL_DEV_REPLACE_STATE_SUSPENDED state.
Fixes: 010a47bde9 ("btrfs: add proper safety check before resuming dev-replace")
CC: stable@vger.kernel.org # 4.18+
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
At the time of forced unmount we place the running replace to
BTRFS_IOCTL_DEV_REPLACE_STATE_SUSPENDED state, so when the system comes
back and expect the target device is missing.
Then let the replace state continue to be in
BTRFS_IOCTL_DEV_REPLACE_STATE_SUSPENDED state instead of
BTRFS_IOCTL_DEV_REPLACE_STATE_STARTED as there isn't any matching scrub
running as part of replace.
Fixes: e93c89c1aa ("Btrfs: add new sources for device replace code")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There isn't any other consumer other than in its own file dev-replace.c.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It's not that impossible to imagine that a device OR a btrfs image is
copied just by using the dd or the cp command. Which in case both the
copies of the btrfs will have the same fsid. If on the system with
automount enabled, the copied FS gets scanned.
We have a known bug in btrfs, that we let the device path be changed
after the device has been mounted. So using this loop hole the new
copied device would appears as if its mounted immediately after it's
been copied.
For example:
Initially.. /dev/mmcblk0p4 is mounted as /
$ lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
mmcblk0 179:0 0 29.2G 0 disk
|-mmcblk0p4 179:4 0 4G 0 part /
|-mmcblk0p2 179:2 0 500M 0 part /boot
|-mmcblk0p3 179:3 0 256M 0 part [SWAP]
`-mmcblk0p1 179:1 0 256M 0 part /boot/efi
$ btrfs fi show
Label: none uuid: 07892354-ddaa-4443-90ea-f76a06accaba
Total devices 1 FS bytes used 1.40GiB
devid 1 size 4.00GiB used 3.00GiB path /dev/mmcblk0p4
Copy mmcblk0 to sda
$ dd if=/dev/mmcblk0 of=/dev/sda
And immediately after the copy completes the change in the device
superblock is notified which the automount scans using btrfs device scan
and the new device sda becomes the mounted root device.
$ lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 1 14.9G 0 disk
|-sda4 8:4 1 4G 0 part /
|-sda2 8:2 1 500M 0 part
|-sda3 8:3 1 256M 0 part
`-sda1 8:1 1 256M 0 part
mmcblk0 179:0 0 29.2G 0 disk
|-mmcblk0p4 179:4 0 4G 0 part
|-mmcblk0p2 179:2 0 500M 0 part /boot
|-mmcblk0p3 179:3 0 256M 0 part [SWAP]
`-mmcblk0p1 179:1 0 256M 0 part /boot/efi
$ btrfs fi show /
Label: none uuid: 07892354-ddaa-4443-90ea-f76a06accaba
Total devices 1 FS bytes used 1.40GiB
devid 1 size 4.00GiB used 3.00GiB path /dev/sda4
The bug is quite nasty that you can't either unmount /dev/sda4 or
/dev/mmcblk0p4. And the problem does not get solved until you take sda
out of the system on to another system to change its fsid using the
'btrfstune -u' command.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Instead of hardcoding exceptions for RAID5 and RAID6 in the code, use an
nparity field in raid_attr.
Signed-off-by: Hans van Kranenburg <hans.van.kranenburg@mendix.com>
Signed-off-by: David Sterba <dsterba@suse.com>
RAID5 and RAID6 profile store one copy of the data, not 2 or 3. These
values are not yet used anywhere so there's no change.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Hans van Kranenburg <hans.van.kranenburg@mendix.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Commit 92e222df7b "btrfs: alloc_chunk: fix DUP stripe size handling"
fixed calculating the stripe_size for a new DUP chunk.
However, the same calculation reappears a bit later, and that one was
not changed yet. The resulting bug that is exposed is that the newly
allocated device extents ('stripes') can have a few MiB overlap with the
next thing stored after them, which is another device extent or the end
of the disk.
The scenario in which this can happen is:
* The block device for the filesystem is less than 10GiB in size.
* The amount of contiguous free unallocated disk space chosen to use for
chunk allocation is 20% of the total device size, or a few MiB more or
less.
An example:
- The filesystem device is 7880MiB (max_chunk_size gets set to 788MiB)
- There's 1578MiB unallocated raw disk space left in one contiguous
piece.
In this case stripe_size is first calculated as 789MiB, (half of
1578MiB).
Since 789MiB (stripe_size * data_stripes) > 788MiB (max_chunk_size), we
enter the if block. Now stripe_size value is immediately overwritten
while calculating an adjusted value based on max_chunk_size, which ends
up as 788MiB.
Next, the value is rounded up to a 16MiB boundary, 800MiB, which is
actually more than the value we had before. However, the last comparison
fails to detect this, because it's comparing the value with the total
amount of free space, which is about twice the size of stripe_size.
In the example above, this means that the resulting raw disk space being
allocated is 1600MiB, while only a gap of 1578MiB has been found. The
second device extent object for this DUP chunk will overlap for 22MiB
with whatever comes next.
The underlying problem here is that the stripe_size is reused all the
time for different things. So, when entering the code in the if block,
stripe_size is immediately overwritten with something else. If later we
decide we want to have the previous value back, then the logic to
compute it was copy pasted in again.
With this change, the value in stripe_size is not unnecessarily
destroyed, so the duplicated calculation is not needed any more.
Signed-off-by: Hans van Kranenburg <hans.van.kranenburg@mendix.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The variable num_bytes is really a way too generic name for a variable
in this function. There are a dozen other variables that hold a number
of bytes as value.
Give it a name that actually describes what it does, which is holding
the size of the chunk that we're allocating.
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Hans van Kranenburg <hans.van.kranenburg@mendix.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The variable num_bytes is used to store the chunk length of the chunk
that we're allocating. Do not reuse it for something really different in
the same function.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Hans van Kranenburg <hans.van.kranenburg@mendix.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Snapshot is expected to be fast. But if there are writers steadily
creating dirty pages in our subvolume, the snapshot may take a very long
time to complete. To fix the problem, we use tagged writepage for
snapshot flusher as we do in the generic write_cache_pages(), so we can
omit pages dirtied after the snapshot command.
This does not change the semantics regarding which data get to the
snapshot, if there are pages being dirtied during the snapshotting
operation. There's a sync called before snapshot is taken in old/new
case, any IO in flight just after that may be in the snapshot but this
depends on other system effects that might still sync the IO.
We do a simple snapshot speed test on a Intel D-1531 box:
fio --ioengine=libaio --iodepth=32 --bs=4k --rw=write --size=64G
--direct=0 --thread=1 --numjobs=1 --time_based --runtime=120
--filename=/mnt/sub/testfile --name=job1 --group_reporting & sleep 5;
time btrfs sub snap -r /mnt/sub /mnt/snap; killall fio
original: 1m58sec
patched: 6.54sec
This is the best case for this patch since for a sequential write case,
we omit nearly all pages dirtied after the snapshot command.
For a multi writers, random write test:
fio --ioengine=libaio --iodepth=32 --bs=4k --rw=randwrite --size=64G
--direct=0 --thread=1 --numjobs=4 --time_based --runtime=120
--filename=/mnt/sub/testfile --name=job1 --group_reporting & sleep 5;
time btrfs sub snap -r /mnt/sub /mnt/snap; killall fio
original: 15.83sec
patched: 10.35sec
The improvement is smaller compared to the sequential write case,
since we omit only half of the pages dirtied after snapshot command.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Ethan Lien <ethanlien@synology.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This parameter was never used, yet was part of the interface of the
function ever since its introduction as extent_io_ops::writepage_end_io_hook
in e6dcd2dc9c ("Btrfs: New data=ordered implementation"). Now that
NULL is passed everywhere as a value for this parameter let's remove it
for good. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The only remaining use of the 'epd' argument in writepage_delalloc is
to reference the extent_io_tree which was set in extent_writepages. Since
it is guaranteed that page->mapping of any page passed to
writepage_delalloc (and __extent_writepage as the sole caller) to be
equal to that passed in extent_writepages we can directly get the
io_tree via the already passed inode (which is also taken from
page->mapping->host). No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If epd::extent_locked is set then writepage_delalloc terminates. Make
this a bit more apparent in the caller by simply bubbling the check up.
This enables to remove epd as an argument to writepage_delalloc in a
future patch. No functional change.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Before btrfs_map_bio submits all stripe bios it does a number of checks
to ensure the device for every stripe is present. However, it doesn't do
a DEV_STATE_MISSING check, instead this is relegated to the lower level
btrfs_schedule_bio (in the async submission case, sync submission
doesn't check DEV_STATE_MISSING at all). Additionally
btrfs_schedule_bios does the duplicate device->bdev check which has
already been performed in btrfs_map_bio.
This patch moves the DEV_STATE_MISSING check in btrfs_map_bio and
removes the duplicate device->bdev check. Doing so ensures that no bio
cloning/submission happens for both async/sync requests in the face of
missing device. This makes the async io submission path slightly shorter
in terms of instruction count. No functional changes.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
dev_replace::replace_state has been set to
BTRFS_DEV_REPLACE_ITEM_STATE_NEVER_STARTED (0) in the same function,
So delete the line which sets replace_state = 0;
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The io_err field of struct btrfs_log_ctx is no longer used after the
recent simplification of the fast fsync path, where we now wait for
ordered extents to complete before logging the inode. We did this in
commit b5e6c3e170 ("btrfs: always wait on ordered extents at fsync
time") and commit a2120a473a ("btrfs: clean up the left over
logged_list usage") removed its last use.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We currently are in a loop finding each range (corresponding to a btree
node/leaf) in a log root's extent io tree and then clean it up. This is a
waste of time since we are traversing the extent io tree's rb_tree more
times then needed (one for a range lookup and another for cleaning it up)
without any good reason.
We free the log trees when we are in the critical section of a transaction
commit (the transaction state is set to TRANS_STATE_COMMIT_DOING), so it's
of great convenience to do everything as fast as possible in order to
reduce the time we block other tasks from starting a new transaction.
So fix this by traversing the extent io tree once and cleaning up all its
records in one go while traversing it.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The loop construct in free_extent_buffer was added in
242e18c7c1 ("Btrfs: reduce lock contention on extent buffer locks")
as means of reducing the times the eb lock is taken, the non-last ref
count is decremented and lock is released. As the special handling
of UNMAPPED extent buffers was removed now there is only one decrement
op which is happening for EXTENT_BUFFER_UNMAPPED case.
This commit modifies the loop condition so that in case of UNMAPPED
buffers the eb's lock is taken only if we are 100% sure the eb is going
to be freed by the current executor of the code. Additionally, remove
superfluous ref count ops in btrfs test.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that the whole of btrfs code has been audited for eb reference count
management it's time to remove the hunk in free_extent_buffer that
essentially considered the condition
"eb->ref == 2 && EXTENT_BUFFER_DUMMY"
to equal "eb->ref = 1". Also remove the last location
which takes an extra reference count in alloc_test_extent_buffer.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In qgroup_rescan_leaf a copy is made of the target leaf by calling
btrfs_clone_extent_buffer. The latter allocates a new buffer and
attaches a new set of pages and copies the content of the source buffer.
The new scratch buffer is only used to iterate it's items, it's not
published anywhere and cannot be accessed by a third party.
Hence, it's not necessary to perform any locking on it whatsoever.
Furthermore, remove the extra extent_buffer_get call since the new
buffer is always allocated with a reference count of 1 which is
sufficient here. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When the 2 comparison trees roots are initialised they are private to
the function and already have reference counts of 1 each. There is no
need to further increment the reference count since the cloned buffers
are already accessed via struct btrfs_path. Eventually the 2 paths used
for comparison are going to be released, effectively disposing of the
cloned buffers.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When a rewound buffer is created it already has a ref count of 1 and the
dummy flag set. Then another ref is taken bumping the count to 2.
Finally when this buffer is released from btrfs_release_path the extra
reference is decremented by the special handling code in
free_extent_buffer.
However, this special code is in fact redundant sinca ref count of 1 is
still correct since the buffer is only accessed via btrfs_path struct.
This paves the way forward of removing the special handling in
free_extent_buffer.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
get_old_root used used only by btrfs_search_old_slot to initialise the
path structure. The old root is always a cloned buffer (either via alloc
dummy or via btrfs_clone_extent_buffer) and its reference count is 2: 1
from allocation, 1 from extent_buffer_get call in get_old_root.
This latter explicit ref count acquire operation is in fact unnecessary
since the semantic is such that the newly allocated buffer is handed
over to the btrfs_path for lifetime management. Considering this just
remove the extra extent_buffer_get in get_old_root.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In iterate_inode_exrefs the eb is cloned via btrfs_clone_extent_buffer
which creates a private extent buffer with the dummy flag set and ref
count of 1. Then this buffer is locked for reading and its ref count is
incremented by 1. Finally it's fed to the passed iterate_irefs_t
function. The actual iterate call back is inode_to_path (coming from
paths_from_inode) which feeds the eb to btrfs_ref_to_path. In this final
function the passed eb is only read by first assigning it to the local
eb variable. This variable is only modified in the case another eb was
referenced from the passed path that is eb != eb_in check triggers.
Considering this there is no point in locking the cloned eb in
iterate_inode_refs since it's never being modified and is not published
anywhere. Furthermore the cloned eb is completely fine having its ref
count be 1.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In iterate_inode_refs the eb is cloned via btrfs_clone_extent_buffer
which creates a private extent buffer with the dummy flag set and ref
count of 1. Then this buffer is locked for reading and its ref count is
incremented by 1. Finally it's fed to the passed iterate_irefs_t
function. The actual iterate call back is inode_to_path (coming from
paths_from_inode) which feeds the eb to btrfs_ref_to_path. In this final
function the passed eb is only read by first assigning it to the local
eb variable. This variable is only modified in the case another eb was
referenced from the passed path that is eb != eb_in check triggers.
Considering this there is no point in locking the cloned eb in
iterate_inode_refs since it's never being modified and is not published
anywhere. Furthermore the cloned eb is completely fine having its ref
count be 1.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In extent-io self test, we need 2 ordered extents at its maximum size to
do the test.
Instead of using the intermediate numbers, use BTRFS_MAX_EXTENT_SIZE for
@max_bytes, and twice @max_bytes for @total_dirty. This should explain
why we need all these magic numbers and prevent people to modify them by
accident.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Btrfs has not allowed swap files since commit 35054394c4 ("Btrfs: stop
providing a bmap operation to avoid swapfile corruptions"). However, now
that the proper restrictions are in place, Btrfs can support swap files
through the swap file a_ops, similar to iomap in commit 67482129cd
("iomap: add a swapfile activation function").
For Btrfs, activation needs to make sure that the file can be used as a
swap file, which currently means that it must be fully allocated as
NOCOW with no compression on one device. It must also do the proper
tracking so that ioctls will not interfere with the swap file.
Deactivation clears this tracking.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The Btrfs swap code is going to need it, so give it a btrfs_ prefix and
make it non-static.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
A later patch will implement swap file support for Btrfs, but before we
do that, we need to make sure that the various Btrfs ioctls cannot
change a swap file.
When a swap file is active, we must make sure that the extents of the
file are not moved and that they don't become shared. That means that
the following are not safe:
- chattr +c (enable compression)
- reflink
- dedupe
- snapshot
- defrag
Don't allow those to happen on an active swap file.
Additionally, balance, resize, device remove, and device replace are
also unsafe if they affect an active swapfile. Add a red-black tree of
block groups and devices which contain an active swapfile. Relocation
checks each block group against this tree and skips it or errors out for
balance or resize, respectively. Device remove and device replace check
the tree for the device they will operate on.
Note that we don't have to worry about chattr -C (disable nocow), which
we ignore for non-empty files, because an active swapfile must be
non-empty and can't be truncated. We also don't have to worry about
autodefrag because it's only done on COW files. Truncate and fallocate
are already taken care of by the generic code. Device add doesn't do
relocation so it's not an issue, either.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is the counterpart to merge_extent_hook, similarly, it's used only
for data/freespace inodes so let's remove it, rename it and call it
directly where necessary. No functional changes.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This callback is used only for data and free space inodes. Such inodes
are guaranteed to have their extent_io_tree::private_data set to the
inode struct. Exploit this fact to directly call the function. Also give
it a more descriptive name. No functional changes.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is the counterpart to ex-set_bit_hook (now btrfs_set_delalloc_extent),
similar to what was done before remove clear_bit_hook and rename the
function. No functional changes.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This callback is used to properly account delalloc extents for data
inodes (ordinary file inodes and freespace v1 inodes). Those can be
easily identified since they have their extent_io trees ->private_data
member point to the inode. Let's exploit this fact to remove the
needless indirection through extent_io_hooks and directly call the
function. Also give the function a name which reflects its purpose -
btrfs_set_delalloc_extent.
This patch also modified test_find_delalloc so that the extent_io_tree
used for testing doesn't have its ->private_data set which would have
caused a crash in btrfs_set_delalloc_extent due to the btrfs_inode->root
member not being initialised. The old version of the code also didn't
call set_bit_hook since the extent_io ops weren't set for the inode. No
functional changes.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This callback was only used in debug builds by btrfs_leak_debug_check.
A better approach is to move its implementation in
btrfs_leak_debug_check and ensure the latter is only executed for extent
tree which have ->private_data set i.e. relate to a data node and not
the btree one. No functional changes.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This callback is ony ever called for data page writeout so there is no
need to actually abstract it via extent_io_ops. Lets just export it,
remove the definition of the callback and call it directly in the
functions that invoke the callback. Also rename the function to
btrfs_writepage_endio_finish_ordered since what it really does is
account finished io in the ordered extent data structures. No
functional changes.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This hook is called only from __extent_writepage_io which is already
called only from the data page writeout path. So there is no need to
make an indirect call via extent_io_ops. This patch just removes the
callback definition, exports the callback function and calls it directly
at the only call site. Also give the function a more descriptive name.
No functional changes.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This callback is called only from writepage_delalloc which in turn is
guaranteed to be called from the data page writeout path. In the end
there is no reason to have the call to this function to be indrected via
the extent_io_ops structure. This patch removes the callback definition,
exports the function and calls it directly. No functional changes.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ rename to btrfs_run_delalloc_range ]
Signed-off-by: David Sterba <dsterba@suse.com>
This will be used in future patches that remove the optional
extent_io_ops callbacks.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add extra dev extent end check against device boundary.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Enhance btrfs_verify_dev_extents() to remember previous checked dev
extents, so it can verify no dev extents can overlap.
Analysis from Hans:
"Imagine allocating a DATA|DUP chunk.
In the chunk allocator, we first set...
max_stripe_size = SZ_1G;
max_chunk_size = BTRFS_MAX_DATA_CHUNK_SIZE
... which is 10GiB.
Then...
/* we don't want a chunk larger than 10% of writeable space */
max_chunk_size = min(div_factor(fs_devices->total_rw_bytes, 1),
max_chunk_size);
Imagine we only have one 7880MiB block device in this filesystem. Now
max_chunk_size is down to 788MiB.
The next step in the code is to search for max_stripe_size * dev_stripes
amount of free space on the device, which is in our example 1GiB * 2 =
2GiB. Imagine the device has exactly 1578MiB free in one contiguous
piece. This amount of bytes will be put in devices_info[ndevs - 1].max_avail
Next we recalculate the stripe_size (which is actually the device extent
length), based on the actual maximum amount of available raw disk space:
stripe_size = div_u64(devices_info[ndevs - 1].max_avail, dev_stripes);
stripe_size is now 789MiB
Next we do...
data_stripes = num_stripes / ncopies
...where data_stripes ends up as 1, because num_stripes is 2 (the amount
of device extents we're going to have), and DUP has ncopies 2.
Next there's a check...
if (stripe_size * data_stripes > max_chunk_size)
...which matches because 789MiB * 1 > 788MiB.
We go into the if code, and next is...
stripe_size = div_u64(max_chunk_size, data_stripes);
...which resets stripe_size to max_chunk_size: 788MiB
Next is a fun one...
/* bump the answer up to a 16MB boundary */
stripe_size = round_up(stripe_size, SZ_16M);
...which changes stripe_size from 788MiB to 800MiB.
We're not done changing stripe_size yet...
/* But don't go higher than the limits we found while searching
* for free extents
*/
stripe_size = min(devices_info[ndevs - 1].max_avail,
stripe_size);
This is bad. max_avail is twice the stripe_size (we need to fit 2 device
extents on the same device for DUP).
The result here is that 800MiB < 1578MiB, so it's unchanged. However,
the resulting DUP chunk will need 1600MiB disk space, which isn't there,
and the second dev_extent might extend into the next thing (next
dev_extent? end of device?) for 22MiB.
The last shown line of code relies on a situation where there's twice
the value of stripe_size present as value for the variable stripe_size
when it's DUP. This was actually the case before commit 92e222df7b
"btrfs: alloc_chunk: fix DUP stripe size handling", from which I quote:
"[...] in the meantime there's a check to see if the stripe_size does
not exceed max_chunk_size. Since during this check stripe_size is twice
the amount as intended, the check will reduce the stripe_size to
max_chunk_size if the actual correct to be used stripe_size is more than
half the amount of max_chunk_size."
In the previous version of the code, the 16MiB alignment (why is this
done, by the way?) would result in a 50% chance that it would actually
do an 8MiB alignment for the individual dev_extents, since it was
operating on double the size. Does this matter?
Does it matter that stripe_size can be set to anything which is not
16MiB aligned because of the amount of remaining available disk space
which is just taken?
What is the main purpose of this round_up?
The most straightforward thing to do seems something like...
stripe_size = min(
div_u64(devices_info[ndevs - 1].max_avail, dev_stripes),
stripe_size
)
..just putting half of the max_avail into stripe_size."
Link: https://lore.kernel.org/linux-btrfs/b3461a38-e5f8-f41d-c67c-2efac8129054@mendix.com/
Reported-by: Hans van Kranenburg <hans.van.kranenburg@mendix.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
[ add analysis from report ]
Signed-off-by: David Sterba <dsterba@suse.com>
We have a complex loop design for find_free_extent(), that has different
behavior for each loop, some even includes new chunk allocation.
Instead of putting such a long code into find_free_extent() and makes it
harder to read, just extract them into find_free_extent_update_loop().
With all the cleanups, the main find_free_extent() should be pretty
barebone:
find_free_extent()
|- Iterate through all block groups
| |- Get a valid block group
| |- Try to do clustered allocation in that block group
| |- Try to do unclustered allocation in that block group
| |- Check if the result is valid
| | |- If valid, then exit
| |- Jump to next block group
|
|- Push harder to find free extents
|- If not found, re-iterate all block groups
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Su Yue <suy.fnst@cn.fujitsu.com>
[ copy callchain from changelog to function comment ]
Signed-off-by: David Sterba <dsterba@suse.com>
This patch will extract unclsutered extent allocation code into
find_free_extent_unclustered().
And this helper function will use return value to indicate what to do
next.
This should make find_free_extent() a little easier to read.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Su Yue <suy.fnst@cn.fujitsu.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
[Update merge conflict with fb5c39d7a8 ("btrfs: don't use ctl->free_space for max_extent_size")]
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have two main methods to find free extents inside a block group:
1) clustered allocation
2) unclustered allocation
This patch will extract the clustered allocation into
find_free_extent_clustered() to make it a little easier to read.
Instead of jumping between different labels in find_free_extent(), the
helper function will use return value to indicate different behavior.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Su Yue <suy.fnst@cn.fujitsu.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Instead of tons of different local variables in find_free_extent(),
extract them into find_free_extent_ctl structure, and add better
explanation for them.
Some modification may looks redundant, but will later greatly simplify
function parameter list during find_free_extent() refactor.
Also add two comments to co-operate with fb5c39d7a8 ("btrfs: don't use
ctl->free_space for max_extent_size"), to make ffe_ctl->max_extent_size
update more reader-friendly.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Su Yue <suy.fnst@cn.fujitsu.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Introduce a new wrapper update_bytes_pinned to replace open coded
bytes_pinned modifiers. Now the underflows of space_info::bytes_pinned
get detected and reported.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Although we have space_info::bytes_may_use underflow detection in
btrfs_free_reserved_data_space_noquota(), we have more callers who are
subtracting number from space_info::bytes_may_use.
So instead of doing underflow detection for every caller, introduce a
new wrapper update_bytes_may_use() to replace open coded bytes_may_use
modifiers.
This also introduce a macro to declare more wrappers, but currently
space_info::bytes_may_use is the mostly interesting one.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Tracking pending ordered extents per transaction was introduced in commit
50d9aa99bd ("Btrfs: make sure logged extents complete in the current
transaction V3") and later updated in commit 161c3549b4 ("Btrfs: change
how we wait for pending ordered extents").
However now that on fsync we always wait for ordered extents to complete
before logging, done in commit 5636cf7d6d ("btrfs: remove the logged
extents infrastructure"), we no longer need the stuff to track for pending
ordered extents, which was not completely removed in the mentioned commit.
So remove the remaining of the pending ordered extents infrastructure.
Reviewed-by: Liu Bo <bo.liu@linux.alibaba.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The logged_start and logged_end variables, at btrfs_log_changed_extents,
were added in commit 8c6c592831 ("btrfs: log csums for all modified
extents"). However since the recent simplification for fsync, which makes
us wait for all ordered extents to complete before logging extents, we
no longer need those variables. Commit a2120a473a ("btrfs: clean up the
left over logged_list usage") forgot to remove them.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Use the aptly named function rather than open coding it. No functional
changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Merge misc fixes from Andrew Morton:
"11 fixes"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
scripts/spdxcheck.py: always open files in binary mode
checkstack.pl: fix for aarch64
userfaultfd: check VM_MAYWRITE was set after verifying the uffd is registered
fs/iomap.c: get/put the page in iomap_page_create/release()
hugetlbfs: call VM_BUG_ON_PAGE earlier in free_huge_page()
memblock: annotate memblock_is_reserved() with __init_memblock
psi: fix reference to kernel commandline enable
arch/sh/include/asm/io.h: provide prototypes for PCI I/O mapping in asm/io.h
mm/sparse: add common helper to mark all memblocks present
mm: introduce common STRUCT_PAGE_MAX_SHIFT define
alpha: fix hang caused by the bootmem removal
Calling UFFDIO_UNREGISTER on virtual ranges not yet registered in uffd
could trigger an harmless false positive WARN_ON. Check the vma is
already registered before checking VM_MAYWRITE to shut off the false
positive warning.
Link: http://lkml.kernel.org/r/20181206212028.18726-2-aarcange@redhat.com
Cc: <stable@vger.kernel.org>
Fixes: 29ec90660d ("userfaultfd: shmem/hugetlbfs: only allow to register VM_MAYWRITE vmas")
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reported-by: syzbot+06c7092e7d71218a2c16@syzkaller.appspotmail.com
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
Acked-by: Hugh Dickins <hughd@google.com>
Acked-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
migrate_page_move_mapping() expects pages with private data set to have
a page_count elevated by 1. This is what used to happen for xfs through
the buffer_heads code before the switch to iomap in commit 82cb14175e
("xfs: add support for sub-pagesize writeback without buffer_heads").
Not having the count elevated causes move_pages() to fail on memory
mapped files coming from xfs.
Make iomap compatible with the migrate_page_move_mapping() assumption by
elevating the page count as part of iomap_page_create() and lowering it
in iomap_page_release().
It causes the move_pages() syscall to misbehave on memory mapped files
from xfs. It does not not move any pages, which I suppose is "just" a
perf issue, but it also ends up returning a positive number which is out
of spec for the syscall. Talking to Michal Hocko, it sounds like
returning positive numbers might be a necessary update to move_pages()
anyway though
(https://lkml.kernel.org/r/20181116114955.GJ14706@dhcp22.suse.cz).
I only hit this in tests that verify that move_pages() actually moved
the pages. The test also got confused by the positive return from
move_pages() (it got treated as a success as positive numbers were not
expected and not handled) making it a bit harder to track down what's
going on.
Link: http://lkml.kernel.org/r/20181115184140.1388751-1-pjaroszynski@nvidia.com
Fixes: 82cb14175e ("xfs: add support for sub-pagesize writeback without buffer_heads")
Signed-off-by: Piotr Jaroszynski <pjaroszynski@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: William Kucharski <william.kucharski@oracle.com>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Brian Foster <bfoster@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAlwUCecQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpsAvD/9Uwn/Mthw/w3BBHL1eVjFnCGHJ8/zVDSSu
/XaP3nXugHicL7Je2EyHs0l4teUC42nIU7SQ+9B/UtyXA+usNvV7YEYqTD0U4ZYy
gLuBL70f6L17uDq5821jR4e64Q3sDy/OgcTpKu3VwnJPYUhJpQuU7fKqN4zi7Asn
5sI4KD99txmFMRNu33uusy98kJ4w5aeg6W1ACP7AxXYjyCY6Chi/7go8fEEK1Szb
9jx0BhbscuZplvvzHw3WAHAaVGGUIjLlmUdqJOisj/GVnCck2IqKqhXND9H/WNWk
FmzJeNTIq9Ag6kKN2WJ6xweNF6U+0Kb1Nn3F4jhoaVBhLxhpsjZ6BdHGWiT5wu4T
xd+hIC4pykA7Y7CqPfpbNZzriX4SJd508XerVjfBmgWIumJLG2D2IfquHmT0udRw
Qlb8pa3na5/dWXYbCOSjQdeBC5vCKCUuGnw15F5wA293khBNGiXVlztwCNgrkclQ
4IpAEUYpRZiK138hDx/BL++yanALYdPHHUS4HtwVMhpi89Bel3FioNzGaaBlElZc
mdSCyHc0RC7uMPMEbWl4bR+/cE3yjf43Gg3ftLaKLo5IQmZtu1pCEVXMatplZ1U/
jbbwwPtx44RYV3gjPirr+YgF0smS9ZgMkkk3oLwysWSjAV5v+wW2MxjTBae3xaSJ
VNoLhtefkw==
=wHit
-----END PGP SIGNATURE-----
Merge tag 'for-linus-20181214' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
"Three small fixes for this week. contains:
- spectre indexing fix for aio (Jeff)
- fix for the previous zeroing bio fix, we don't need it for user
mapped pages, and in fact it breaks some applications if we do
(Keith)
- allocation failure fix for null_blk with zoned (Shin'ichiro)"
* tag 'for-linus-20181214' of git://git.kernel.dk/linux-block:
block: Fix null_blk_zoned creation failure with small number of zones
aio: fix spectre gadget in lookup_ioctx
block/bio: Do not zero user pages
Commit 9d5b86ac13 ("fs/locks: Remove fl_nspid and use fs-specific l_pid
for remote locks") specified that the l_pid returned for F_GETLK on a local
file that has a remote lock should be the pid of the lock manager process.
That commit, while updating other filesystems, failed to update lockd, such
that locks created by lockd had their fl_pid set to that of the remote
process holding the lock. Fix that here to be the pid of lockd.
Also, fix the client case so that the returned lock pid is negative, which
indicates a remote lock on a remote file.
Fixes: 9d5b86ac13 ("fs/locks: Remove fl_nspid and use fs-specific...")
Cc: stable@vger.kernel.org
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
side. Disable it for now.
-----BEGIN PGP SIGNATURE-----
iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAlwTzaUTHGlkcnlvbW92
QGdtYWlsLmNvbQAKCRBKf944AhHzi6eyB/97jq3Wzipr9I+Y3LwJeNKUu/8+LZ1d
EDSbL1mgskzv0iX9fCds+wTzMksvm4V+n7I/DvoadbSEgTz2dUXYqoKBf3+4PEd1
f7K69XHe8X/+ME3wGoOc4oLgNws0yGE5kcmiP5wB9Tlo8EkfqFuu6753wWi8xaUf
so3TX9C+prqYmmhqRI8sesU84rMKjTkkK5dvIgVW4WLBJKsVAhmooyuRSwdVGZOv
m7+8m7PWUJm8stMRDqsJ/0j4COiMeteQGQx6T67gkCUF6pBdJjO2IRNx2VAhybsx
k/sMM/VfrQth/V4zsHZQD75BDQ+wu3Ncz8axIT5QYsAtuh9x0zr2av2e
=kQS2
-----END PGP SIGNATURE-----
Merge tag 'ceph-for-4.20-rc7' of https://github.com/ceph/ceph-client
Pull ceph fix from Ilya Dryomov:
"Luis discovered a problem with the new copyfrom offload on the server
side. Disable it for now"
* tag 'ceph-for-4.20-rc7' of https://github.com/ceph/ceph-client:
ceph: make 'nocopyfrom' a default mount option
This patch reorders flow from
- update page
- set_page_dirty
- wait_on_page_writeback
to
- wait_on_page_writeback
- update page
- set_page_dirty
The reason is:
- set_page_dirty will increase reference of dirty page, the reference
should be cleared before wait_on_page_writeback to keep its consistency.
- some devices need stable page during page writebacking, so we
should not change page's data.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If IPU failed, nothing is commited, we should end page writeback.
Signed-off-by: Sheng Yong <shengyong1@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This adds an option in ioctl(F2FS_IOC_SHUTDOWN) in order to trigger fsck by
setting a NEED_FSCK flag.
Generally, shutdown is used for the test to validate filesystem consistency, and
setting NEED_FSCK flag can be used for Android to trigger fsck.f2fs at boot time
explicitly so that we could measure the elapsed time as well as force filesystem
check.
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
proc_sys_lookup can fail with ENOMEM instead of ENOENT when the
corresponding sysctl table is being unregistered. In our case we see
this upon opening /proc/sys/net/*/conf files while network interfaces
are being deleted, which confuses our configuration daemon.
The problem was successfully reproduced and this fix tested on v4.9.122
and v4.20-rc6.
v2: return ERR_PTRs in all cases when proc_sys_make_inode fails instead
of mixing them with NULL. Thanks Al Viro for the feedback.
Fixes: ace0c791e6 ("proc/sysctl: Don't grab i_lock under sysctl_lock.")
Cc: stable@vger.kernel.org
Signed-off-by: Ivan Delalande <colona@arista.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
UBIFS's recovery code strictly assumes that a deleted inode will never
come back, therefore it removes all data which belongs to that inode
as soon it faces an inode with link count 0 in the replay list.
Before O_TMPFILE this assumption was perfectly fine. With O_TMPFILE
it can lead to data loss upon a power-cut.
Consider a journal with entries like:
0: inode X (nlink = 0) /* O_TMPFILE was created */
1: data for inode X /* Someone writes to the temp file */
2: inode X (nlink = 0) /* inode was changed, xattr, chmod, … */
3: inode X (nlink = 1) /* inode was re-linked via linkat() */
Upon replay of entry #2 UBIFS will drop all data that belongs to inode X,
this will lead to an empty file after mounting.
As solution for this problem, scan the replay list for a re-link entry
before dropping data.
Fixes: 474b93704f ("ubifs: Implement O_TMPFILE")
Cc: stable@vger.kernel.org
Cc: Russell Senior <russell@personaltelco.net>
Cc: Rafał Miłecki <zajec5@gmail.com>
Reported-by: Russell Senior <russell@personaltelco.net>
Reported-by: Rafał Miłecki <zajec5@gmail.com>
Tested-by: Rafał Miłecki <rafal@milecki.pl>
Signed-off-by: Richard Weinberger <richard@nod.at>
When ubifs is build without the LZO compressor and no compressor is
given the creation of the default file system will fail. before
selection the LZO compressor check if it is present and if not fall back
to the zlib or none.
Signed-off-by: Gabor Juhos <juhosg@openwrt.org>
Signed-off-by: Hauke Mehrtens <hauke@hauke-m.de>
Signed-off-by: Richard Weinberger <richard@nod.at>
If the call to ubifs_read_nnode() fails in ubifs_lpt_calc_hash() an
error is returned without freeing the memory allocated to 'buf'.
Read and check the root node before allocating the buffer.
Detected by CoverityScan, CID 1441025 ("Resource leak")
Signed-off-by: Garry McNulty <garrmcnu@gmail.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
The new authentication support causes a build failure
when CONFIG_KEYS is disabled, so add a dependency.
fs/ubifs/auth.c: In function 'ubifs_init_authentication':
fs/ubifs/auth.c:249:16: error: implicit declaration of function 'request_key'; did you mean 'request_irq'? [-Werror=implicit-function-declaration]
keyring_key = request_key(&key_type_logon, c->auth_key_name, NULL);
Fixes: d8a22773a1 ("ubifs: Enable authentication support")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Richard Weinberger <richard@nod.at>
Instead of adding yet another dependency on UBIFS_FS, wrap the whole
block of ubifs config options in a single "if UBIFS_FS".
Fixes: d8a22773a1 ("ubifs: Enable authentication support")
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Acked-by: Sascha Hauer <s.hauer@pengutronix.de>
Signed-off-by: Richard Weinberger <richard@nod.at>
Having two shash descriptors on the stack cause a very significant kernel
stack usage that can cross the warning threshold:
fs/ubifs/replay.c: In function 'authenticate_sleb':
fs/ubifs/replay.c:633:1: error: the frame size of 1144 bytes is larger than 1024 bytes [-Werror=frame-larger-than=]
Normally, gcc optimizes the out, but with CONFIG_CC_OPTIMIZE_FOR_DEBUGGING,
it does not. Splitting the two stack allocations into separate functions
means that they will use the same memory again. In normal configurations
(optimizing for size or performance), those should get inlined and we get
the same behavior as before.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Richard Weinberger <richard@nod.at>
Since mkfs always formats the filesystem with the realtime bitmap and
summary inodes immediately after the root directory, we should expect
that both of them are present and loadable, even if there isn't a
realtime volume attached. There's no reason to skip this if rbmino ==
NULLFSINO; in fact, this causes an immediate crash if the there /is/ a
realtime volume and someone writes to it.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Bill O'Donnell <billodo@redhat.com>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCXBDMLgAKCRDh3BK/laaZ
PPuRAP0X4zYWFh3mcGlcjjfzaP2W/3F8nVsXjo+YADi9nJ+wAwD+LIeL7zGr8Mw8
EixiC+OJyL31O5ZOyHGoPEhhDz4O+Ao=
=hWRh
-----END PGP SIGNATURE-----
Merge tag 'ovl-fixes-4.20-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs
Pull overlayfs fixes from Miklos Szeredi:
"Needed to revert a patch, because it possibly introduces a security
hole. Since the patch is basically a conceptual cleanup, not a bug
fix, it's safe to revert. I'm not giving up on this, and discussions
seemed to have reached an agreement over how to move forward, but that
can wait 'till the next release.
The other two patches are fixes for bugs introduced in recent
releases"
* tag 'ovl-fixes-4.20-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
Revert "ovl: relax permission checking on underlying layers"
ovl: fix decode of dir file handle with multi lower layers
ovl: fix missing override creds in link of a metacopy upper
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCXBDKkAAKCRDh3BK/laaZ
PCXdAPwOWqLXpkBL76YaIbgFVzS+S5btlhHwVSZ0w/r7HGA3uQD+IgsHbky1MdSv
rYyKcg+lVzA7GI7tcoQUhC2D9aZ8tAQ=
=I0eL
-----END PGP SIGNATURE-----
Merge tag 'fuse-fixes-4.20-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse
Pull fuse fixes from Miklos Szeredi:
"There's one patch fixing a minor but long lived bug, the others are
fixing regressions introduced in this cycle"
* tag 'fuse-fixes-4.20-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
fuse: continue to send FUSE_RELEASEDIR when FUSE_OPEN returns ENOSYS
fuse: Fix memory leak in fuse_dev_free()
fuse: fix revalidation of attributes for permission check
fuse: fix fsync on directory
fuse: Add bad inode check in fuse_destroy_inode()
The realtime summary is a two-dimensional array on disk, effectively:
u32 rsum[log2(number of realtime extents) + 1][number of blocks in the bitmap]
rsum[log][bbno] is the number of extents of size 2**log which start in
bitmap block bbno.
xfs_rtallocate_extent_near() uses xfs_rtany_summary() to check whether
rsum[log][bbno] != 0 for any log level. However, the summary array is
stored in row-major order (i.e., like an array in C), so all of these
entries are not adjacent, but rather spread across the entire summary
file. In the worst case (a full bitmap block), xfs_rtany_summary() has
to check every level.
This means that on a moderately-used realtime device, an allocation will
waste a lot of time finding, reading, and releasing buffers for the
realtime summary. In particular, one of our storage services (which runs
on servers with 8 very slow CPUs and 15 8 TB XFS realtime filesystems)
spends almost 5% of its CPU cycles in xfs_rtbuf_get() and
xfs_trans_brelse() called from xfs_rtany_summary().
One solution would be to also store the summary with the dimensions
swapped. However, this would require a disk format change to a very old
component of XFS.
Instead, we can cache the minimum size which contains any extents. We do
so lazily; rather than guaranteeing that the cache contains the precise
minimum, it always contains a loose lower bound which we tighten when we
read or update a summary block. This only uses a few kilobytes of memory
and is already serialized via the realtime bitmap and summary inode
locks, so the cost is minimal. With this change, the same workload only
spends 0.2% of its CPU cycles in the realtime allocator.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
A big block filesystem might require more than one inobt record to cover
all the inodes in the block. In these cases it is not correct to round
the irec count up to the nearest block because this causes us to
overestimate the number of inode blocks we expect to find.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Store the inode cluster alignment information in units of inodes and
blocks in the mount data so that we don't have to keep recalculating
them.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Store the number of inodes and blocks per inode cluster in the mount
data so that we don't have to keep recalculating them.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Add new helpers to convert units of fs blocks into inodes, and AG blocks
into AG inodes, respectively. Convert all the open-coded conversions
and XFS_OFFBNO_TO_AGINO(, , 0) calls to use them, as appropriate. The
OFFBNO_TO_AGINO macro is retained for xfs_repair.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Owner information for static fs metadata can be defined readonly at
build time because it never changes across filesystems. This enables us
to reduce stack usage (particularly in scrub) because we can use the
statically defined oinfo structures.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Only certain functions actually change the contents of an
xfs_owner_info; the rest can accept a const struct pointer. This will
enable us to save stack space by hoisting static owner info types to
be const global variables.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
There's no need to bundle a pointer to the defer op type into the defer
op control structure. Instead, store the defer op type enum, which
enables us to shorten some of the lines.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Recently, we forgot to port a new defer op type to xfsprogs, which
caused us some userspace pain. Reorganize the way we make libxfs
clients supply defer op type information so that all type information
has to be provided at build time instead of risky runtime dynamic
configuration.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
A log recovery failure has been reproduced where a symlink inode has
a zero length in extent form. It was caused by a shutdown during a
combined fstress+fsmark workload.
The underlying problem is the issue in xfs_inactive_symlink(): the
inode is unlocked between the symlink inactivation/truncation and
the inode being freed. This opens a window for the inode to be
written to disk before it xfs_ifree() removes it from the unlinked
list, marks it free in the inobt and zeros the mode.
For shortform inodes, the fix is simple. xfs_ifree() clears the data
fork state, so there's no need to do it in xfs_inactive_symlink().
This means the shortform fork verifier will not see a zero length
data fork as it mirrors the inode size through to xfs_ifree()), and
hence if the inode gets written back and the fork verifiers are run
they will still see a fork that matches the on-disk inode size.
For extent form (remote) symlinks, it is a little more tricky. Here
we explicitly set the inode size to zero, so the above race can lead
to zero length symlinks on disk. Because the inode is unlinked at
this point (i.e. on the unlinked list) and unreferenced, it can
never be seen again by a user. Hence when we set the inode size to
zeor, also change the type to S_IFREG. xfs_ifree() expects S_IFREG
inodes to be of zero length, and so this avoids all the problems of
zero length symlinks ever hitting the disk. It also avoids the
problem of needing to handle zero length symlink inodes in log
recovery to replay the extent free intents and the remaining
deferops to free the extents the symlink used.
Also add a couple of asserts to warn us if zero length symlinks end
up in either the symlink create or inactivation paths.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
There is a statement that has an unwanted space in the indentation.
Remove it.
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The function xfs_alloc_get_freelist calls xfs_perag_put to drop the
reference. However, pag->pagf_btreeblks is read and written after the
put operation. This patch moves the put operation later.
Signed-off-by: Pan Bian <bianpan2016@163.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
[darrick: minor changelog edits]
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
In xfs_reflink_end_cow, we allocate a single transaction for the entire
end_cow operation and then loop the CoW fork mappings to move them to
the data fork. This design fails on a heavily fragmented filesystem
where an inode's data fork has exactly one more extent than would fit in
an extents-format fork, because the unmap can collapse the data fork
into extents format (freeing the bmbt block) but the remap can expand
the data fork back into a (newly allocated) bmbt block. If the number
of extents we end up remapping is large, we can overflow the block
reservation because we reserved blocks assuming that we were adding
mappings into an already-cleared area of the data fork.
Let's say we have 8 extents in the data fork, 8 extents in the CoW fork,
and the data fork can hold at most 7 extents before needing to convert
to btree format; and that blocks A-P are discontiguous single-block
extents:
0......7
D: ABCDEFGH
C: IJKLMNOP
When a write to file blocks 0-7 completes, we must remap I-P into the
data fork. We start by removing H from the btree-format data fork. Now
we have 7 extents, so we convert the fork to extents format, freeing the
bmbt block. We then move P into the data fork and it now has 8 extents
again. We must convert the data fork back to btree format, requiring a
block allocation. If we repeat this sequence for blocks 6-5-4-3-2-1-0,
we'll need a total of 8 block allocations to remap all 8 blocks. We
reserved only enough blocks to handle one btree split (5 blocks on a 4k
block filesystem), which means we overflow the block reservation.
To fix this issue, create a separate helper function to remap a single
extent, and change _reflink_end_cow to call it in a tight loop over the
entire range we're completing. As a side effect this also removes the
size restrictions on how many extents we can end_cow at a time, though
nobody ever hit that. It is not reasonable to reserve N blocks to remap
N blocks.
Note that this can be reproduced after ~320 million fsx ops while
running generic/938 (long soak directio fsx exerciser):
XFS: Assertion failed: tp->t_blk_res >= tp->t_blk_res_used, file: fs/xfs/xfs_trans.c, line: 116
<machine registers snipped>
Call Trace:
xfs_trans_dup+0x211/0x250 [xfs]
xfs_trans_roll+0x6d/0x180 [xfs]
xfs_defer_trans_roll+0x10c/0x3b0 [xfs]
xfs_defer_finish_noroll+0xdf/0x740 [xfs]
xfs_defer_finish+0x13/0x70 [xfs]
xfs_reflink_end_cow+0x2c6/0x680 [xfs]
xfs_dio_write_end_io+0x115/0x220 [xfs]
iomap_dio_complete+0x3f/0x130
iomap_dio_rw+0x3c3/0x420
xfs_file_dio_aio_write+0x132/0x3c0 [xfs]
xfs_file_write_iter+0x8b/0xc0 [xfs]
__vfs_write+0x193/0x1f0
vfs_write+0xba/0x1c0
ksys_write+0x52/0xc0
do_syscall_64+0x50/0x160
entry_SYSCALL_64_after_hwframe+0x49/0xbe
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
When inode is corrupted so that extent type is invalid, some functions
(such as udf_truncate_extents()) will just BUG. Check that extent type
is valid when loading the inode to memory.
Reported-by: Anatoly Trosinenko <anatoly.trosinenko@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Make all the required change to start use the ib_device_ops structure.
Signed-off-by: Kamal Heib <kamalheib1@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
This patch is based on an idea from Steve Whitehouse. The idea is
to dump the number of pages for inodes in the glock dumps.
The additional locking required me to drop const from quite a few
places.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Fix the resource group wrap-around logic in gfs2_rbm_find that commit
e579ed4f44 broke. The bug can lead to unnecessary repeated scanning of the
same bitmaps; there is a risk that future changes will turn this into an
endless loop.
Fixes: e579ed4f44 ("GFS2: Introduce rbm field bii")
Cc: stable@vger.kernel.org # v3.13+
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
When FUSE_OPEN returns ENOSYS, the no_open bit is set on the connection.
Because the FUSE_RELEASE and FUSE_RELEASEDIR paths share code, this
incorrectly caused the FUSE_RELEASEDIR request to be dropped and never sent
to userspace.
Pass an isdir bool to distinguish between FUSE_RELEASE and FUSE_RELEASEDIR
inside of fuse_file_put.
Fixes: 7678ac5061 ("fuse: support clients that don't implement 'open'")
Cc: <stable@vger.kernel.org> # v3.14
Signed-off-by: Chad Austin <chadaustin@fb.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
In gfs2_create_inode, after setting and releasing the acl / default_acl, the
acl / default_acl pointers are not set to NULL as they should be. In that
state, when the function reaches label fail_free_acls, gfs2_create_inode will
try to release the same acls again.
Fix that by setting the pointers to NULL after releasing the acls. Slightly
simplify the logic. Also, posix_acl_release checks for NULL already, so
there is no need to duplicate those checks here.
Fixes: e01580bf9e ("gfs2: use generic posix ACL infrastructure")
Reported-by: Pan Bian <bianpan2016@163.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: stable@vger.kernel.org # v4.9+
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Field bd_ops was set but never used, so I removed it, and all
code supporting it.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Acked-by: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Matthew pointed out that the ioctx_table is susceptible to spectre v1,
because the index can be controlled by an attacker. The below patch
should mitigate the attack for all of the aio system calls.
Reported-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Matthew pointed out that the ioctx_table is susceptible to spectre v1,
because the index can be controlled by an attacker. The below patch
should mitigate the attack for all of the aio system calls.
Cc: stable@vger.kernel.org
Reported-by: Matthew Wilcox <willy@infradead.org>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Since we found a problem with the 'copy-from' operation after objects have
been truncated, offloading object copies to OSDs should be discouraged
until the issue is fixed.
Thus, this patch adds the 'nocopyfrom' mount option to the default mount
options which effectily means that remote copies won't be done in
copy_file_range unless they are explicitly enabled at mount time.
[ Adjust ceph_show_options() accordingly. ]
Link: https://tracker.ceph.com/issues/37378
Signed-off-by: Luis Henriques <lhenriques@suse.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Use bio(s) to read in the journal sequentially in large chunks and
locate the head of the journal.
This version addresses the issues Christoph pointed out w.r.t error handling
and using deprecated API.
Signed-off-by: Abhi Das <adas@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Cc: Christoph Hellwig <hch@infradead.org>
Move and re-order the error checks and hash/crc computations into another
function __get_log_header() so it can be used in scenarios where buffer_heads
are not being used for the log header.
Signed-off-by: Abhi Das <adas@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Change gfs2_log_XXX_bio family of functions so they can be used
with different bios, not just sdp->sd_log_bio.
This patch also contains some clean up suggested by Andreas.
Signed-off-by: Abhi Das <adas@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Tells you how many milliseconds map_journal_extents and find_jhead
take.
Signed-off-by: Abhi Das <adas@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
The gfs2_is_ordered and gfs2_is_writeback checks are weird in that they
implicitly check for !gfs2_is_jdata. This makes understanding how to
use those functions correctly a challenge. Clean this up by making
gfs2_is_ordered and gfs2_is_writeback take a super block instead of an
inode and by removing the implicit !gfs2_is_jdata checks. Update the
callers accordingly.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Use the aptly named function rather than opencoding i_writecount check.
No functional changes.
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Jan Kara <jack@suse.cz>
prepare_bprm_creds is not used outside exec.c, so there's no reason for it
to have external linkage.
Signed-off-by: Chanho Min <chanho.min@lge.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
If new mode is the same as old mode we don't have to reset
inode mode in the rest of the code, so compare old and new
mode before setting update_mode flag.
Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
-----BEGIN PGP SIGNATURE-----
iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAlwNpb0eHHRvcnZhbGRz
QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGwGwH/00UHnXfxww3ixxz
zwTVDzptA6SPm6s84yJOWatM5fXhPiAltZaHSYF9lzRzNU71NCq7Frhq3fQUIXKM
OxqDn9nfSTWcjWTk2q5n2keyRV/KIn67YX7UgqFc1bO/mqtVjEgNWaMyblhI+e9E
giu1ZXayHr43jK1cDOmGExZubXUq7Vsc9TOlrd+d2SwIqeEP7TCMrPhnHDwCNvX2
UU5dtANpVzGtHaBcr37wJj+L8kODCc0f+PQ3g2ar5jTHst5SLlHp2u0AMRnUmgdi
VkGx+mu/uk8mtwUqMIMqhplklVoqK6LTeLqsY5Xt32SKruw9UqyJGdphLjW2QP/g
MkmA1lI=
=7kaD
-----END PGP SIGNATURE-----
Merge tag 'v4.20-rc6' into for-4.21/block
Pull in v4.20-rc6 to resolve the conflict in NVMe, but also to get the
two corruption fixes. We're going to be overhauling the direct dispatch
path, and we need to do that on top of the changes we made for that
in mainline.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAlwMfwoACgkQiiy9cAdy
T1FkqAv/Tty/Od1u3r/il26vogLunG9JvmC62upgfpb684isEOnYn0oaT9FElunB
C43FXF8Zw73zKv/9IqmLExNGZJ0Jm+5eCjnGX+e05QWN8ais91V2IW+4JMLwCga3
KWBD0V1SBa355cObpvWe1WzrIzm1lwa3aCO/lpENo7Ph+kZXcFVIuOPRRxwLOUU9
dkZCtvs5Pb8LVu2k1/jUXh1hqibUGR3V46txDMH6oomsLNV97xTC9URBCbpeWb3f
x4mpUeWP3DnUyK8TTr3W1MWh1jlUHg3/8yDHfCuxuluAgRFQpI8Eru18XE8abnkz
loFu9eLVT6iwJAlzmkXFF4cGi8LXqz3x4Z+1DopH2S2DBwedw9Tc1dptwOuUbpfg
SO4jMOaNXOUSnN+0pCdEiO1z/Nc0YaNg++5AvGRM0BRohpNk6X2COPocf9HvcFI8
xWajm+nkSsEaQNtXRAj/ku7cLovNgHTJt0KXQyCgxBVol19Gf5davHTe5+pPnsfQ
e+jdUek4
=wWgw
-----END PGP SIGNATURE-----
Merge tag '4.20-rc5-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs fixes from Steve French:
"Three small fixes: a fix for smb3 direct i/o, a fix for CIFS DFS for
stable and a minor cifs Kconfig fix"
* tag '4.20-rc5-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6:
CIFS: Avoid returning EBUSY to upper layer VFS
cifs: Fix separator when building path from dentry
cifs: In Kconfig CONFIG_CIFS_POSIX needs depends on legacy (insecure cifs)
* Fix the Xarray conversion of fsdax to properly handle
dax_lock_mapping_entry() in the presense of pmd entries.
* Fix inode destruction racing a new lock request.
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJcDLC5AAoJEB7SkWpmfYgC0yAP/2NULGPapIoR2aw7GKPOStHR
tOMGE8+ZDtpd19oVoEiB3PuhMkiCj5Pkyt2ka9eUXSk3VR/N6EtGwBANFUO+0tEh
yU8nca2T2G1f+MsZbraWA9zj9aPnfpLt46r7p/AjIB2j1/vyXkYMkXmkXtCvMEUi
3Q9dVsT53fWncJBlTe/WW84wWQp77VsVafN4/UP75dv8cN9F+u91P1xvUXu2AKus
RBqjlp6G6xMZ9HcWCQmStdPCi+qbO4k3oTZcohd/DYLb70ZL1kLoEMOjK2/3GnaV
YMHZGWRAK5jroDZbdXXnJHE+4/opZpSWhW/J1WFEJUSSr7YmZsWhwegmrBlQ5cp3
V0fZ7LjBWjgtUKsjlmfmu4ccLY87LzVlfIEtJH2+QNiLy/Pl/O0P/Jzg9kHVLEgo
zph2yIQlx3yshej8EEBPylDJVtYVzOyjLNLx7r3bL6AzQv+6CKqjxtYOiIDwYQsA
Db1InfeDzAAoH6NYSbqN+yG01Bz7zeCYT7Ch4rYSKcA1i8vUMR6iObI32fC3xJQm
+TxYmqngrJaU8XOMikIem/qGn4fBpYFULrIdK7lj/WmIxQq8Hp0mqz1ikMIoErzN
ZXV5RcbbvSSZi5EOrrVKbr2IfIcx+HYo/bIAkRCR69o8bcIZnLPHDky/7gl7GS/5
sc9A/kysseeF45L2+6QG
=kUil
-----END PGP SIGNATURE-----
Merge tag 'dax-fixes-4.20-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull dax fixes from Dan Williams:
"The last of the known regression fixes and fallout from the Xarray
conversion of the filesystem-dax implementation.
On the path to debugging why the dax memory-failure injection test
started failing after the Xarray conversion a couple more fixes for
the dax_lock_mapping_entry(), now called dax_lock_page(), surfaced.
Those plus the bug that started the hunt are now addressed. These
patches have appeared in a -next release with no issues reported.
Note the touches to mm/memory-failure.c are just the conversion to the
new function signature for dax_lock_page().
Summary:
- Fix the Xarray conversion of fsdax to properly handle
dax_lock_mapping_entry() in the presense of pmd entries
- Fix inode destruction racing a new lock request"
* tag 'dax-fixes-4.20-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
dax: Fix unlock mismatch with updated API
dax: Don't access a freed inode
dax: Check page->mapping isn't NULL
- Fix broken project quota inode counts
- Fix incorrect PAGE_MASK/PAGE_SIZE usage
- Fix incorrect return value in btree verifier
- Fix WARN_ON remap flags false positive
- Fix splice read overflows
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAlwGu/sACgkQ+H93GTRK
tOum2g/5AWN2OMMqbCYru3BSrp4my26/CgGgsLdfR5go+fKvSI8txAm5Wx8r2doP
aBQ0LuJaKm16bufpFx+LQDKvE/AV8yd3tL58ozPSbpPjdQFpl4DR3N95pURQJVhP
Yt3tDuPk6ODJ/pKYSbM0GTTyJ5jr+tUlKU6kmAqNzg1W644TWZVV9zFCVU202sj9
xZ6Auan3yb5wXk2vhgxPgHrVh5ngRm2G+/V8KxCVoWyP/lHUjgEcWJEu+/Jh8xOQ
6dAYxdfngQqdlXI3dpJDOkFHbqcSrhl8+fQEnp+g0RyWcKPDS+L6wk4Xlkhaorjy
nDB4GcnjCPHdJ8x/9pUlNmBrlXpAz4oNYqlQcJKssd5q9p+eTBH2drRDc5VDM5KU
xSHsdIIxQ1YewJ3uHcIbaYBsK+XcrWCtypUTC73GeGkLqS1qKCuKCEnMQOR3czeE
/Hyq6/eTSt2uG62MAOZGdIW4uaJsLTnXnfDElZ1YsdwtMEzKbYg1Ll2s86vudrRu
Otyl3EVdXQjtWRxrelA5eexspoJNoZS69D6Haqt8MGc2HJoJGTuparV6YlsEeGoW
bZFO8JV/Q0WFCa2cWRXVe+D+kx8fyFoDsKY1mR4Vwp5s0jQXp9/rQ81+zVjQ4wB2
TU53a4+hMLKLW5aNH3ge3yB9ZlHZhEzex6hrlZH3kuqpAM3Yj00=
=18EP
-----END PGP SIGNATURE-----
Merge tag 'xfs-4.20-fixes-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs fixes from Darrick Wong:
"Here are hopefully the last set of fixes for 4.20.
There's a fix for a longstanding statfs reporting problem with project
quotas, a correction for page cache invalidation behaviors when
fallocating near EOF, and a fix for a broken metadata verifier return
code.
Finally, the most important fix is to the pipe splicing code (aka the
generic copy_file_range fallback) to avoid pointless short directio
reads by only asking the filesystem for as much data as there are
available pages in the pipe buffer. Our previous fix (simulated short
directio reads because the number of pages didn't match the length of
the read requested) caused subtle problems on overlayfs, so that part
is reverted.
Anyhow, this series passes fstests -g all on xfs and overlay+xfs, and
has passed 17 billion fsx operations problem-free since I started
testing
Summary:
- Fix broken project quota inode counts
- Fix incorrect PAGE_MASK/PAGE_SIZE usage
- Fix incorrect return value in btree verifier
- Fix WARN_ON remap flags false positive
- Fix splice read overflows"
* tag 'xfs-4.20-fixes-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
iomap: partially revert 4721a60109 (simulated directio short read on EFAULT)
splice: don't read more than available pipe space
vfs: allow some remap flags to be passed to vfs_clone_file_range
xfs: fix inverted return from xfs_btree_sblock_verify_crc
xfs: fix PAGE_MASK usage in xfs_free_file_space
fs/xfs: fix f_ffree value for statfs when project quota is set
One of the goals of this series is to remove a separate reference to
the css of the bio. This can and should be accessed via bio_blkcg(). In
this patch, wbc_init_bio() now requires a bio to have a device
associated with it.
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
- spaces before tabs,
- spaces at the end of lines,
- multiple blank lines,
- blank lines before EXPORT_SYMBOL,
can all go.
Signed-off-by: NeilBrown <neilb@suse.com>
Reviewed-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
posix_unblock_lock() is not specific to posix locks, and behaves
nearly identically to locks_delete_block() - the former returning a
status while the later doesn't.
So discard posix_unblock_lock() and use locks_delete_block() instead,
after giving that function an appropriate return value.
Signed-off-by: NeilBrown <neilb@suse.com>
Reviewed-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
When we find an existing lock which conflicts with a request,
and the request wants to wait, we currently add the request
to a list. When the lock is removed, the whole list is woken.
This can cause the thundering-herd problem.
To reduce the problem, we make use of the (new) fact that
a pending request can itself have a list of blocked requests.
When we find a conflict, we look through the existing blocked requests.
If any one of them blocks the new request, the new request is attached
below that request, otherwise it is added to the list of blocked
requests, which are now known to be mutually non-conflicting.
This way, when the lock is released, only a set of non-conflicting
locks will be woken, the rest can stay asleep.
If the lock request cannot be granted and the request needs to be
requeued, all the other requests it blocks will then be woken
To make this more concrete:
If you have a many-core machine, and have many threads all wanting to
briefly lock a give file (udev is known to do this), you can get quite
poor performance.
When one thread releases a lock, it wakes up all other threads that
are waiting (classic thundering-herd) - one will get the lock and the
others go to sleep.
When you have few cores, this is not very noticeable: by the time the
4th or 5th thread gets enough CPU time to try to claim the lock, the
earlier threads have claimed it, done what was needed, and released.
So with few cores, many of the threads don't end up contending.
With 50+ cores, lost of threads can get the CPU at the same time,
and the contention can easily be measured.
This patchset creates a tree of pending lock requests in which siblings
don't conflict and each lock request does conflict with its parent.
When a lock is released, only requests which don't conflict with each
other a woken.
Testing shows that lock-acquisitions-per-second is now fairly stable
even as the number of contending process goes to 1000. Without this
patch, locks-per-second drops off steeply after a few 10s of
processes.
There is a small cost to this extra complexity.
At 20 processes running a particular test on 72 cores, the lock
acquisitions per second drops from 1.8 million to 1.4 million with
this patch. For 100 processes, this patch still provides 1.4 million
while without this patch there are about 700,000.
Reported-and-tested-by: Martin Wilck <mwilck@suse.de>
Signed-off-by: NeilBrown <neilb@suse.com>
Reviewed-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
posix_locks_conflict() and flock_locks_conflict() both return int.
leases_conflict() returns bool.
This inconsistency will cause problems for the next patch if not
fixed.
So change posix_locks_conflict() and flock_locks_conflict() to return
bool.
Also change the locks_conflict() helper.
And convert some
return (foo);
to
return foo;
Signed-off-by: NeilBrown <neilb@suse.com>
Reviewed-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Now that requests can block other requests, we
need to be careful to always clean up those blocked
requests.
Any time that we wait for a request, we might have
other requests attached, and when we stop waiting,
we must clean them up.
If the lock was granted, the requests might have been
moved to the new lock, though when merged with a
pre-exiting lock, this might not happen.
In all cases we don't want blocked locks to remain
attached, so we remove them to be safe.
Signed-off-by: NeilBrown <neilb@suse.com>
Reviewed-by: J. Bruce Fields <bfields@redhat.com>
Tested-by: syzbot+a4a3d526b4157113ec6a@syzkaller.appspotmail.com
Tested-by: kernel test robot <rong.a.chen@intel.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
EBUSY is not handled by VFS, and will be passed to user-mode. This is not
correct as we need to wait for more credits.
This patch also fixes a bug where rsize or wsize is used uninitialized when
the call to server->ops->wait_mtu_credits() fails.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Long Li <longli@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Highlights include:
Stable fixes:
- Fix a page leak when using RPCSEC_GSS/krb5p to encrypt data.
Bugfixes:
- Fix a regression that causes the RPC receive code to hang
- Fix call_connect_status() so that it handles tasks that got transmitted
while queued waiting for the socket lock.
- Fix a memory leak in call_encode()
- Fix several other connect races.
- Fix receive code error handling.
- Use the discard iterator rather than MSG_TRUNC for compatibility with
AF_UNIX/AF_LOCAL sockets.
- nfs: don't dirty kernel pages read by direct-io
- pnfs/Flexfiles fix to enforce per-mirror stateid only for NFSv4 data
servers
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJcCWIOAAoJEA4mA3inWBJc3BsP/i/VXd0ZSxxL8i/++qCR1KGT
/p0+t2HbrhPzb3jKmuaBe/6T6bLMbpmkwbesA6cHENkaPiOqxPhxLsJlh4o2BHwg
NcjAbbov/hkakFAHlp69KqiL7DZe8YEqQE8GlUnn+3C3RM3i2TSRQ3AGXUH22P2a
MY5fqiub2PmEwe2UZR8BzIEQd5w60AzTNXzQb181/+SCTOPdJTKneh0Tw54lD4d6
vWKhi64cyQxQxshCvrX6IpcNWu9qwm7qDGQ3rDAg0whunve4YGtTz1suRUk888M4
VfNxA8skFZuaQS/UU6oek2xaeMlSzEoJQXimKLYTEJKoqf7sWxfNLAfqHwnfyo4T
Yab3cfVRs5KgEltVZyodb9oVQd6KI13hYeT+vXubz2kq1Ode4NJCnzgEefOP0hNV
ENDal0hqBrfjfVIkpg/wfgRJln/W4Y/U0oPPm50eJJxa0ZKTfftBWo4me5DwCFF9
0/XhPdFWTvZsYjmSGRC1RsaSrzUvO+wFo3tKQ2lQqf8QP3ix9ZtGQHN+h8RN9SxK
ti5OxTMsfM3jYg7+yu4yOAQkcCcoaDA37+JztpuUSlMRfNss8uM7cQKsQ4WQf6Nr
24At5Wr/ib7hVkAQ5oB98UWh5q1ZLzmmHhzsf8KacTSNcfjgu0H0DmKtm3CfThFK
xfTHotzM3IqbUXRZQ7++
=M/mt
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-4.20-5' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client bugfixes from Trond Myklebust:
"This is mainly fallout from the updates to the SUNRPC code that is
being triggered from less common combinations of NFS mount options.
Highlights include:
Stable fixes:
- Fix a page leak when using RPCSEC_GSS/krb5p to encrypt data.
Bugfixes:
- Fix a regression that causes the RPC receive code to hang
- Fix call_connect_status() so that it handles tasks that got
transmitted while queued waiting for the socket lock.
- Fix a memory leak in call_encode()
- Fix several other connect races.
- Fix receive code error handling.
- Use the discard iterator rather than MSG_TRUNC for compatibility
with AF_UNIX/AF_LOCAL sockets.
- nfs: don't dirty kernel pages read by direct-io
- pnfs/Flexfiles fix to enforce per-mirror stateid only for NFSv4
data servers"
* tag 'nfs-for-4.20-5' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
SUNRPC: Don't force a redundant disconnection in xs_read_stream()
SUNRPC: Fix up socket polling
SUNRPC: Use the discard iterator rather than MSG_TRUNC
SUNRPC: Treat EFAULT as a truncated message in xs_read_stream_request()
SUNRPC: Fix up handling of the XDRBUF_SPARSE_PAGES flag
SUNRPC: Fix RPC receive hangs
SUNRPC: Fix a potential race in xprt_connect()
SUNRPC: Fix a memory leak in call_encode()
SUNRPC: Fix leak of krb5p encode pages
SUNRPC: call_connect_status() must handle tasks that got transmitted
nfs: don't dirty kernel pages read by direct-io
flexfiles: enforce per-mirror stateid only for v4 DSes
struct timespec is not y2038 safe.
struct __kernel_timespec is the new y2038 safe structure for all
syscalls that are using struct timespec.
Update io_pgetevents interfaces to use struct __kernel_timespec.
sigset_t also has different representations on 32 bit and 64 bit
architectures. Hence, we need to support the following different
syscalls:
New y2038 safe syscalls:
(Controlled by CONFIG_64BIT_TIME for 32 bit ABIs)
Native 64 bit(unchanged) and native 32 bit : sys_io_pgetevents
Compat : compat_sys_io_pgetevents_time64
Older y2038 unsafe syscalls:
(Controlled by CONFIG_32BIT_COMPAT_TIME for 32 bit ABIs)
Native 32 bit : sys_io_pgetevents_time32
Compat : compat_sys_io_pgetevents
Note that io_getevents syscalls do not have a y2038 safe solution.
Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
struct timespec is not y2038 safe.
struct __kernel_timespec is the new y2038 safe structure for all
syscalls that are using struct timespec.
Update pselect interfaces to use struct __kernel_timespec.
sigset_t also has different representations on 32 bit and 64 bit
architectures. Hence, we need to support the following different
syscalls:
New y2038 safe syscalls:
(Controlled by CONFIG_64BIT_TIME for 32 bit ABIs)
Native 64 bit(unchanged) and native 32 bit : sys_pselect6
Compat : compat_sys_pselect6_time64
Older y2038 unsafe syscalls:
(Controlled by CONFIG_32BIT_COMPAT_TIME for 32 bit ABIs)
Native 32 bit : pselect6_time32
Compat : compat_sys_pselect6
Note that all other versions of select syscalls will not have
y2038 safe versions.
Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
struct timespec is not y2038 safe.
struct __kernel_timespec is the new y2038 safe structure for all
syscalls that are using struct timespec.
Update ppoll interfaces to use struct __kernel_timespec.
sigset_t also has different representations on 32 bit and 64 bit
architectures. Hence, we need to support the following different
syscalls:
New y2038 safe syscalls:
(Controlled by CONFIG_64BIT_TIME for 32 bit ABIs)
Native 64 bit(unchanged) and native 32 bit : sys_ppoll
Compat : compat_sys_ppoll_time64
Older y2038 unsafe syscalls:
(Controlled by CONFIG_32BIT_COMPAT_TIME for 32 bit ABIs)
Native 32 bit : ppoll_time32
Compat : compat_sys_ppoll
Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Refactor the logic to restore the sigmask before the syscall
returns into an api.
This is useful for versions of syscalls that pass in the
sigmask and expect the current->sigmask to be changed during
the execution and restored after the execution of the syscall.
With the advent of new y2038 syscalls in the subsequent patches,
we add two more new versions of the syscalls (for pselect, ppoll
and io_pgetevents) in addition to the existing native and compat
versions. Adding such an api reduces the logic that would need to
be replicated otherwise.
Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Refactor reading sigset from userspace and updating sigmask
into an api.
This is useful for versions of syscalls that pass in the
sigmask and expect the current->sigmask to be changed during,
and restored after, the execution of the syscall.
With the advent of new y2038 syscalls in the subsequent patches,
we add two more new versions of the syscalls (for pselect, ppoll,
and io_pgetevents) in addition to the existing native and compat
versions. Adding such an api reduces the logic that would need to
be replicated otherwise.
Note that the calls to sigprocmask() ignored the return value
from the api as the function only returns an error on an invalid
first argument that is hardcoded at these call sites.
The updated logic uses set_current_blocked() instead.
Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Make sure to use the CIFS_DIR_SEP(cifs_sb) as path separator for
prefixpath too. Fixes a bug with smb1 UNIX extensions.
Fixes: a6b5058faf ("fs/cifs: make share unaccessible at root level mountable")
Signed-off-by: Paulo Alcantara <palcantara@suse.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
CC: Stable <stable@vger.kernel.org>
Missing a dependency. Shouldn't show cifs posix extensions
in Kconfig if CONFIG_CIFS_ALLOW_INSECURE_DIALECTS (ie SMB1
protocol) is disabled.
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAlwH0k8ACgkQxWXV+ddt
WDtmVg/+Kgvk7laQI9bLEr1/30eG1JfBUMHcVE1F8+g99l28m1Yihjd21j9norVd
YexBz53jgKou+zV+37CKWBYT1uDPq7CIoxctkdE2j9U0s+RmsqDrhech0dsBsfMR
jo9VnHJFuJSxGMhjfGnFV+wMtAr4q5aQptNGBl+hR1MvMneroktFv+0WiLmp0Vhj
+6Iq9WAClJYpgk//cI7nhKkscdzWwRyN3V9RUtdNeYklk1D7l1WprlaPzw22WA9u
VjQVMICjEaJeIixIwT/D8lz05QgjKlqy1z6faYG5JuJxoYQikuNv/xe2dhZVm35A
aNsBR0byf3zzuXKQZAlvXJ6/gYPvep+KI7epPyBOdycaqoZza7rQ+/MkSAgQ77Vk
yBnQuhqiw9Srjh6LDWFkNclVln2wymRKd1SqpZmFPRZre/8L+DU+I8RRaeS2/WcE
M2k+awRD0oVofbB+hxkFIoR+I1Ggkp2rxQlTT/41tGx0geWC3AGX+TlKSW6ZM5HD
lRmRXIsVocfighKEnI3Zy7ecZuwCI4/4D6+PQtyhCJb3tDigZ/a4UEYdSVucG8CG
SuQ5YMn+MyyKT0wH8xkGKDGT15YZ+u9Q/BmPHZRL6sSouFpiCQHA5miD1YA+t1d9
qMjH6Ycz46Y3j2M0BDfDcm714zoD5/bgeSy5SPC3Zh5lQCGpeIk=
=VW/F
-----END PGP SIGNATURE-----
Merge tag 'for-4.20-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fix from David Sterba:
"A patch in 4.19 introduced a sanity check that was too strict and a
filesystem cannot be mounted.
This happens for filesystems with more than 10 devices and has been
reported by a few users so we need the fix to propagate to stable"
* tag 'for-4.20-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: tree-checker: Don't check max block group size as current max chunk size limit is unreliable
As a precaution, make sure we check event_len when copying to userspace.
Based on old feedback: https://lkml.kernel.org/r/542D9FE5.3010009@gmx.de
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Internal to dax_unlock_mapping_entry(), dax_unlock_entry() is used to
store a replacement entry in the Xarray at the given xas-index with the
DAX_LOCKED bit clear. When called, dax_unlock_entry() expects the unlocked
value of the entry relative to the current Xarray state to be specified.
In most contexts dax_unlock_entry() is operating in the same scope as
the matched dax_lock_entry(). However, in the dax_unlock_mapping_entry()
case the implementation needs to recall the original entry. In the case
where the original entry is a 'pmd' entry it is possible that the pfn
performed to do the lookup is misaligned to the value retrieved in the
Xarray.
Change the api to return the unlock cookie from dax_lock_page() and pass
it to dax_unlock_page(). This fixes a bug where dax_unlock_page() was
assuming that the page was PMD-aligned if the entry was a PMD entry with
signatures like:
WARNING: CPU: 38 PID: 1396 at fs/dax.c:340 dax_insert_entry+0x2b2/0x2d0
RIP: 0010:dax_insert_entry+0x2b2/0x2d0
[..]
Call Trace:
dax_iomap_pte_fault.isra.41+0x791/0xde0
ext4_dax_huge_fault+0x16f/0x1f0
? up_read+0x1c/0xa0
__do_fault+0x1f/0x160
__handle_mm_fault+0x1033/0x1490
handle_mm_fault+0x18b/0x3d0
Link: https://lkml.kernel.org/r/20181130154902.GL10377@bombadil.infradead.org
Fixes: 9f32d22130 ("dax: Convert dax_lock_mapping_entry to XArray")
Reported-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Matthew Wilcox <willy@infradead.org>
Tested-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
As the man(2) page for utime/utimes states, EPERM is returned when the
second parameter of utime or utimes is not NULL, the caller's effective UID
does not match the owner of the file, and the caller is not privileged.
However, in a NFS directory mounted from knfsd, it will return EACCES
(from nfsd_setattr-> fh_verify->nfsd_permission). This patch fixes
that.
Signed-off-by: zhengbin <zhengbin13@huawei.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
In commit 4721a60109, we tried to fix a problem wherein directio reads
into a splice pipe will bounce EFAULT/EAGAIN all the way out to
userspace by simulating a zero-byte short read. This happens because
some directio read implementations (xfs) will call
bio_iov_iter_get_pages to grab pipe buffer pages and issue asynchronous
reads, but as soon as we run out of pipe buffers that _get_pages call
returns EFAULT, which the splice code translates to EAGAIN and bounces
out to userspace.
In that commit, the iomap code catches the EFAULT and simulates a
zero-byte read, but that causes assertion errors on regular splice reads
because xfs doesn't allow short directio reads. This causes infinite
splice() loops and assertion failures on generic/095 on overlayfs
because xfs only permit total success or total failure of a directio
operation. The underlying issue in the pipe splice code has now been
fixed by changing the pipe splice loop to avoid avoid reading more data
than there is space in the pipe.
Therefore, it's no longer necessary to simulate the short directio, so
remove the hack from iomap.
Fixes: 4721a60109 ("iomap: dio data corruption and spurious errors when pipes fill")
Reported-by: Murphy Zhou <jencce.kernel@gmail.com>
Ranted-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
In commit 4721a60109, we tried to fix a problem wherein directio reads
into a splice pipe will bounce EFAULT/EAGAIN all the way out to
userspace by simulating a zero-byte short read. This happens because
some directio read implementations (xfs) will call
bio_iov_iter_get_pages to grab pipe buffer pages and issue asynchronous
reads, but as soon as we run out of pipe buffers that _get_pages call
returns EFAULT, which the splice code translates to EAGAIN and bounces
out to userspace.
In that commit, the iomap code catches the EFAULT and simulates a
zero-byte read, but that causes assertion errors on regular splice reads
because xfs doesn't allow short directio reads.
The brokenness is compounded by splice_direct_to_actor immediately
bailing on do_splice_to returning <= 0 without ever calling ->actor
(which empties out the pipe), so if userspace calls back we'll EFAULT
again on the full pipe, and nothing ever gets copied.
Therefore, teach splice_direct_to_actor to clamp its requests to the
amount of free space in the pipe and remove the simulated short read
from the iomap directio code.
Fixes: 4721a60109 ("iomap: dio data corruption and spurious errors when pipes fill")
Reported-by: Murphy Zhou <jencce.kernel@gmail.com>
Ranted-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
In overlayfs, ovl_remap_file_range calls vfs_clone_file_range on the
lower filesystem's inode, passing through whatever remap flags it got
from its caller. Since vfs_copy_file_range first tries a filesystem's
remap function with REMAP_FILE_CAN_SHORTEN, this can get passed through
to the second vfs_copy_file_range call, and this isn't an issue.
Change the WARN_ON to look only for the DEDUP flag.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
xfs_btree_sblock_verify_crc is a bool so should not be returning
a failaddr_t; worse, if xfs_log_check_lsn fails it returns
__this_address which looks like a boolean true (i.e. success)
to the caller.
(interestingly xfs_btree_lblock_verify_crc doesn't have the issue)
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
In commit e53c4b598, I *tried* to teach xfs to force writeback when we
fzero/fpunch right up to EOF so that if EOF is in the middle of a page,
the post-EOF part of the page gets zeroed before we return to userspace.
Unfortunately, I missed the part where PAGE_MASK is ~(PAGE_SIZE - 1),
which means that we totally fail to zero if we're fpunching and EOF is
within the first page. Worse yet, the same PAGE_MASK thinko plagues the
filemap_write_and_wait_range call, so we'd initiate writeback of the
entire file, which (mostly) masked the thinko.
Drop the tricky PAGE_MASK and replace it with correct usage of PAGE_SIZE
and the proper rounding macros.
Fixes: e53c4b598 ("xfs: ensure post-EOF zeroing happens after zeroing part of a file")
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
No one is going to poll for aio (yet), so we must clear the HIPRI
flag, as we would otherwise send it down the poll queues, where no
one will be polling for completions.
Signed-off-by: Christoph Hellwig <hch@lst.de>
IOCB_HIPRI, not RWF_HIPRI.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAlwEZdIeHHRvcnZhbGRz
QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGAlQH/19oax2Za3IPqF4X
DM3lal5M6zlUVkoYstqzpbR3MqUwgEnMfvoeMDC6mI9N4/+r2LkV7cRR8HzqQCCS
jDfD69IzRGb52VSeJmbOrkxBWsR1Nn0t4Z3rEeLPxwaOoNpRc8H973MbAQ2FKMpY
S4Y3jIK1dNiRRxdh52NupVkQF+djAUwkBuVk/rrvRJmTDij4la03cuCDAO+Di9lt
GHlVvygKw2SJhDR+z3ArwZNmE0ceCcE6+W7zPHzj2KeWuKrZg22kfUD454f2YEIw
FG0hu9qecgtpYCkLSm2vr4jQzmpsDoyq3ZfwhjGrP4qtvPC3Db3vL3dbQnkzUcJu
JtwhVCE=
=O1q1
-----END PGP SIGNATURE-----
Merge tag 'v4.20-rc5' into for-4.21/block
Pull in v4.20-rc5, solving a conflict we'll otherwise get in aio.c and
also getting the merge fix that went into mainline that users are
hitting testing for-4.21/block and/or for-next.
* tag 'v4.20-rc5': (664 commits)
Linux 4.20-rc5
PCI: Fix incorrect value returned from pcie_get_speed_cap()
MAINTAINERS: Update linux-mips mailing list address
ocfs2: fix potential use after free
mm/khugepaged: fix the xas_create_range() error path
mm/khugepaged: collapse_shmem() do not crash on Compound
mm/khugepaged: collapse_shmem() without freezing new_page
mm/khugepaged: minor reorderings in collapse_shmem()
mm/khugepaged: collapse_shmem() remember to clear holes
mm/khugepaged: fix crashes due to misaccounted holes
mm/khugepaged: collapse_shmem() stop if punched or truncated
mm/huge_memory: fix lockdep complaint on 32-bit i_size_read()
mm/huge_memory: splitting set mapping+index before unfreeze
mm/huge_memory: rename freeze_page() to unmap_page()
initramfs: clean old path before creating a hardlink
kernel/kcov.c: mark funcs in __sanitizer_cov_trace_pc() as notrace
psi: make disabling/enabling easier for vendor kernels
proc: fixup map_files test on arm
debugobjects: avoid recursive calls with kmemleak
userfaultfd: shmem: UFFDIO_COPY: set the page dirty if VM_WRITE is not set
...
Revert commit c22397888f "exec: make de_thread() freezable" as
requested by Ingo Molnar:
"So there's a new regression in v4.20-rc4, my desktop produces this
lockdep splat:
[ 1772.588771] WARNING: pkexec/4633 still has locks held!
[ 1772.588773] 4.20.0-rc4-custom-00213-g93a49841322b #1 Not tainted
[ 1772.588775] ------------------------------------
[ 1772.588776] 1 lock held by pkexec/4633:
[ 1772.588778] #0: 00000000ed85fbf8 (&sig->cred_guard_mutex){+.+.}, at: prepare_bprm_creds+0x2a/0x70
[ 1772.588786] stack backtrace:
[ 1772.588789] CPU: 7 PID: 4633 Comm: pkexec Not tainted 4.20.0-rc4-custom-00213-g93a49841322b #1
[ 1772.588792] Call Trace:
[ 1772.588800] dump_stack+0x85/0xcb
[ 1772.588803] flush_old_exec+0x116/0x890
[ 1772.588807] ? load_elf_phdrs+0x72/0xb0
[ 1772.588809] load_elf_binary+0x291/0x1620
[ 1772.588815] ? sched_clock+0x5/0x10
[ 1772.588817] ? search_binary_handler+0x6d/0x240
[ 1772.588820] search_binary_handler+0x80/0x240
[ 1772.588823] load_script+0x201/0x220
[ 1772.588825] search_binary_handler+0x80/0x240
[ 1772.588828] __do_execve_file.isra.32+0x7d2/0xa60
[ 1772.588832] ? strncpy_from_user+0x40/0x180
[ 1772.588835] __x64_sys_execve+0x34/0x40
[ 1772.588838] do_syscall_64+0x60/0x1c0
The warning gets triggered by an ancient lockdep check in the freezer:
(gdb) list *0xffffffff812ece06
0xffffffff812ece06 is in flush_old_exec (./include/linux/freezer.h:57).
52 * DO NOT ADD ANY NEW CALLERS OF THIS FUNCTION
53 * If try_to_freeze causes a lockdep warning it means the caller may deadlock
54 */
55 static inline bool try_to_freeze_unsafe(void)
56 {
57 might_sleep();
58 if (likely(!freezing(current)))
59 return false;
60 return __refrigerator(false);
61 }
I reviewed the ->cred_guard_mutex code, and the mutex is held across all
of exec() - and we always did this.
But there's this recent -rc4 commit:
> Chanho Min (1):
> exec: make de_thread() freezable
c22397888f: exec: make de_thread() freezable
I believe this commit is bogus, you cannot call try_to_freeze() from
de_thread(), because it's holding the ->cred_guard_mutex."
Reported-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
[BUG]
A completely valid btrfs will refuse to mount, with error message like:
BTRFS critical (device sdb2): corrupt leaf: root=2 block=239681536 slot=172 \
bg_start=12018974720 bg_len=10888413184, invalid block group size, \
have 10888413184 expect (0, 10737418240]
This has been reported several times as the 4.19 kernel is now being
used. The filesystem refuses to mount, but is otherwise ok and booting
4.18 is a workaround.
Btrfs check returns no error, and all kernels used on this fs is later
than 2011, which should all have the 10G size limit commit.
[CAUSE]
For a 12 devices btrfs, we could allocate a chunk larger than 10G due to
stripe stripe bump up.
__btrfs_alloc_chunk()
|- max_stripe_size = 1G
|- max_chunk_size = 10G
|- data_stripe = 11
|- if (1G * 11 > 10G) {
stripe_size = 976128930;
stripe_size = round_up(976128930, SZ_16M) = 989855744
However the final stripe_size (989855744) * 11 = 10888413184, which is
still larger than 10G.
[FIX]
For the comprehensive check, we need to do the full check at chunk read
time, and rely on bg <-> chunk mapping to do the check.
We could just skip the length check for now.
Fixes: fce466eab7 ("btrfs: tree-checker: Verify block_group_item")
Cc: stable@vger.kernel.org # v4.19+
Reported-by: Wang Yugui <wangyugui@e16-tech.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This reverts commit 007ea44892.
The commit broke some selinux-testsuite cases, and it looks like there's no
straightforward fix keeping the direction of this patch, so revert for now.
The original patch was trying to fix the consistency of permission checks, and
not an observed bug. So reverting should be safe.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Pull RCU changes from Paul E. McKenney:
- Convert RCU's BUG_ON() and similar calls to WARN_ON() and similar.
- Replace calls of RCU-bh and RCU-sched update-side functions
to their vanilla RCU counterparts. This series is a step
towards complete removal of the RCU-bh and RCU-sched update-side
functions.
( Note that some of these conversions are going upstream via their
respective maintainers. )
- Documentation updates, including a number of flavor-consolidation
updates from Joel Fernandes.
- Miscellaneous fixes.
- Automate generation of the initrd filesystem used for
rcutorture testing.
- Convert spin_is_locked() assertions to instead use lockdep.
( Note that some of these conversions are going upstream via their
respective maintainers. )
- SRCU updates, especially including a fix from Dennis Krein
for a bag-on-head-class bug.
- RCU torture-test updates.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Commit e2b911c535 ("ext4: clean up feature test macros with
predicate functions") broke the EXT4_IOC_GROUP_ADD ioctl. This was
not noticed since only very old versions of resize2fs (before
e2fsprogs 1.42) use this ioctl. However, using a new kernel with an
enterprise Linux userspace will cause attempts to use online resize to
fail with "No reserved GDT blocks".
Fixes: e2b911c535 ("ext4: clean up feature test macros with predicate...")
Cc: stable@kernel.org # v4.4
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: ruippan (潘睿) <ruippan@tencent.com>
As dax inches closer to production use, an administrator should not
be surprised by silently disabling the feature they asked for.
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
ext4_xattr_destroy_cache() can handle NULL pointer correctly,
so there is no need to check NULL pointer before calling
ext4_xattr_destroy_cache().
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
There is a statement that is indented with spaces, replace it with
a tab.
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
There are several lines that are indented too far, clean these
up by removing the tabs.
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
In case of error, ext4_try_to_write_inline_data() should unlock
and release the page it holds.
Fixes: f19d5870cb ("ext4: add normal write support for inline data")
Cc: stable@kernel.org # 3.8
Signed-off-by: Maurizio Lombardi <mlombard@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
The function frees qf_inode via iput but then pass qf_inode to
lockdep_set_quota_inode on the failure path. This may result in a
use-after-free bug. The patch frees df_inode only when it is never used.
Fixes: daf647d2dd ("ext4: add lockdep annotations for i_data_sem")
Cc: stable@kernel.org # 4.6
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Pan Bian <bianpan2016@163.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
We can hold j_state_lock for writing at the beginning of
jbd2_journal_commit_transaction() for a rather long time (reportedly for
30 ms) due cleaning revoke bits of all revoked buffers under it. The
handling of revoke tables as well as cleaning of t_reserved_list, and
checkpoint lists does not need j_state_lock for anything. It is only
needed to prevent new handles from joining the transaction. Generally
T_LOCKED transaction state prevents new handles from joining the
transaction - except for reserved handles which have to allowed to join
while we wait for other handles to complete.
To prevent reserved handles from joining the transaction while cleaning
up lists, add new transaction state T_SWITCH and watch for it when
starting reserved handles. With this we can just drop the lock for
operations that don't need it.
Reported-and-tested-by: Adrian Hunter <adrian.hunter@intel.com>
Suggested-by: "Theodore Y. Ts'o" <tytso@mit.edu>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Given corruption in the ftrace records, it might be possible to allocate
tmp_prz without assigning prz to it, but still marking it as needing to
be freed, which would cause at least a NULL dereference.
smatch warnings:
fs/pstore/ram.c:340 ramoops_pstore_read() error: we previously assumed 'prz' could be null (see line 255)
https://lists.01.org/pipermail/kbuild-all/2018-December/055528.html
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Fixes: 2fbea82bbb ("pstore: Merge per-CPU ftrace records into one")
Cc: "Joel Fernandes (Google)" <joel@joelfernandes.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Instead of running with interrupts disabled, use a semaphore. This should
make it easier for backends that may need to sleep (e.g. EFI) when
performing a write:
|BUG: sleeping function called from invalid context at kernel/sched/completion.c:99
|in_atomic(): 1, irqs_disabled(): 1, pid: 2236, name: sig-xstate-bum
|Preemption disabled at:
|[<ffffffff99d60512>] pstore_dump+0x72/0x330
|CPU: 26 PID: 2236 Comm: sig-xstate-bum Tainted: G D 4.20.0-rc3 #45
|Call Trace:
| dump_stack+0x4f/0x6a
| ___might_sleep.cold.91+0xd3/0xe4
| __might_sleep+0x50/0x90
| wait_for_completion+0x32/0x130
| virt_efi_query_variable_info+0x14e/0x160
| efi_query_variable_store+0x51/0x1a0
| efivar_entry_set_safe+0xa3/0x1b0
| efi_pstore_write+0x109/0x140
| pstore_dump+0x11c/0x330
| kmsg_dump+0xa4/0xd0
| oops_exit+0x22/0x30
...
Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Fixes: 21b3ddd39f ("efi: Don't use spinlocks for efi vars")
Signed-off-by: Kees Cook <keescook@chromium.org>
Bool initializations should use true and false. Bool tests don't need
comparisons.
Signed-off-by: Thomas Meyer <thomas@m3y3r.de>
Signed-off-by: Kees Cook <keescook@chromium.org>
The ramoops backend currently calls persistent_ram_save_old() even
if a buffer is empty. While this appears to work, it is does not seem
like the right thing to do and could lead to future bugs so lets avoid
that. It also prevents misleading prints in the logs which claim the
buffer is valid.
I got something like:
found existing buffer, size 0, start 0
When I was expecting:
no valid data in buffer (sig = ...)
This bails out early (and reports with pr_debug()), since it's an
acceptable state.
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Co-developed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
(1) remove type argument from ramoops_get_next_prz()
Since we store the type of the prz when we initialize it, we no longer
need to pass it again in ramoops_get_next_prz() since we can just use
that to setup the pstore record. So lets remove it from the argument list.
(2) remove max argument from ramoops_get_next_prz()
Looking at the code flow, the 'max' checks are already being done on
the prz passed to ramoops_get_next_prz(). Lets remove it to simplify
this function and reduce its arguments.
(3) further reduce ramoops_get_next_prz() arguments by passing record
Both the id and type fields of a pstore_record are set by
ramoops_get_next_prz(). So we can just pass a pointer to the pstore_record
instead of passing individual elements. This results in cleaner more
readable code and fewer lines.
In addition lets also remove the 'update' argument since we can detect
that. Changes are squashed into a single patch to reduce fixup conflicts.
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
In later patches we will need to map types to names, so create a
constant table for that which can also be used in different parts of
old and new code. This saves the type in the PRZ which will be useful
in later patches.
Instead of having an explicit PSTORE_TYPE_UNKNOWN, just use ..._MAX.
This includes removing the now redundant filename templates which can use
a single format string. Also, there's no reason to limit the "is it still
compressed?" test to only PSTORE_TYPE_DMESG when building the pstorefs
filename. Records are zero-initialized, so a backend would need to have
explicitly set compressed=1.
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Co-developed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
This improves and updates some comments:
- dump handler comment out of sync from calling convention
- fix kern-doc typo
and improves status output:
- reminder that only kernel crash dumps are compressed
- do not be silent about ECC infrastructure failures
Signed-off-by: Kees Cook <keescook@chromium.org>
In order to more easily perform automated regression testing, this
adds pr_debug() calls to report each prz allocation which can then be
verified against persistent storage. Specifically, seeing the dividing
line between header, data, any ECC bytes. (And the general assignment
output is updated to remove the bogus ECC blocksize which isn't actually
recorded outside the prz instance.)
Signed-off-by: Kees Cook <keescook@chromium.org>
With both ram.c and ram_core.c built into ramoops.ko, it doesn't make
sense to have differing pr_fmt prefixes. This fixes ram_core.c to use
the module name (as ram.c already does). Additionally improves region
reservation error to include the region name.
Signed-off-by: Kees Cook <keescook@chromium.org>
When initialing a prz, if invalid data is found (no PERSISTENT_RAM_SIG),
the function call path looks like this:
ramoops_init_prz ->
persistent_ram_new -> persistent_ram_post_init -> persistent_ram_zap
persistent_ram_zap
As we can see, persistent_ram_zap() is called twice.
We can avoid this by adding an option to persistent_ram_new(), and
only call persistent_ram_zap() when it is needed.
Signed-off-by: Peng Wang <wangpeng15@xiaomi.com>
[kees: minor tweak to exit path and commit log]
Signed-off-by: Kees Cook <keescook@chromium.org>
Since the console writer does not use the preallocated crash dump buffer
any more, there is no reason to perform locking around it.
Fixes: 70ad35db33 ("pstore: Convert console write to use ->write_buf")
Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
The pre-allocated compression buffer used for crash dumping was also
being used for decompression. This isn't technically safe, since it's
possible the kernel may attempt a crashdump while pstore is populating the
pstore filesystem (and performing decompression). Instead, just allocate
a separate buffer for decompression. Correctness is preferred over
performance here.
Signed-off-by: Kees Cook <keescook@chromium.org>
The warning added in commit 3b0e761ba8
"dlm: print log message when cluster name is not set"
did not account for the fact that lockspaces created
from userland do not supply a cluster name, so bogus
warnings are printed every time a userland lockspace
is created.
Signed-off-by: David Teigland <teigland@redhat.com>
Let the passed in array be const (and thus placed in rodata) instead of
a mutable array of const pointers.
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Jani Nikula <jani.nikula@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20181004143750.30880-1-jani.nikula@intel.com
NULL check before some freeing functions is not needed.
Signed-off-by: Thomas Meyer <thomas@m3y3r.de>
Signed-off-by: David Teigland <teigland@redhat.com>
fuse_invalidate_attr() now sets fi->inval_mask instead of fi->i_time, hence
we need to check the inval mask in fuse_permission() as well.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Fixes: 2f1e81965f ("fuse: allow fine grained attr cache invaldation")
Commit ab2257e994 ("fuse: reduce size of struct fuse_inode") moved parts
of fields related to writeback on regular file and to directory caching
into a union. However fuse_fsync_common() called from fuse_dir_fsync()
touches some writeback related fields, resulting in a crash.
Move writeback related parts from fuse_fsync_common() to fuse_fysnc().
Reported-by: Brett Girton <btgirton@gmail.com>
Tested-by: Brett Girton <btgirton@gmail.com>
Fixes: ab2257e994 ("fuse: reduce size of struct fuse_inode")
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Since commit bb21ce0ad2 we always enforce per-mirror stateid.
However, this makes sense only for v4+ servers.
Signed-off-by: Tigran Mkrtchyan <tigran.mkrtchyan@desy.de>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
jffs2_sync_fs makes the assumption that if CONFIG_JFFS2_FS_WRITEBUFFER
is defined then a write buffer is available and has been initialized.
However, this does is not the case when the mtd device has no
out-of-band buffer:
int jffs2_nand_flash_setup(struct jffs2_sb_info *c)
{
if (!c->mtd->oobsize)
return 0;
...
The resulting call to cancel_delayed_work_sync passing a uninitialized
(but zeroed) delayed_work struct forces lockdep to become disabled.
[ 90.050639] overlayfs: upper fs does not support tmpfile.
[ 90.652264] INFO: trying to register non-static key.
[ 90.662171] the code is fine but needs lockdep annotation.
[ 90.673090] turning off the locking correctness validator.
[ 90.684021] CPU: 0 PID: 1762 Comm: mount_root Not tainted 4.14.63 #0
[ 90.696672] Stack : 00000000 00000000 80d8f6a2 00000038 805f0000 80444600 8fe364f4 805dfbe7
[ 90.713349] 80563a30 000006e2 8068370c 00000001 00000000 00000001 8e2fdc48 ffffffff
[ 90.730020] 00000000 00000000 80d90000 00000000 00000106 00000000 6465746e 312e3420
[ 90.746690] 6b636f6c 03bf0000 f8000000 20676e69 00000000 80000000 00000000 8e2c2a90
[ 90.763362] 80d90000 00000001 00000000 8e2c2a90 00000003 80260dc0 08052098 80680000
[ 90.780033] ...
[ 90.784902] Call Trace:
[ 90.789793] [<8000f0d8>] show_stack+0xb8/0x148
[ 90.798659] [<8005a000>] register_lock_class+0x270/0x55c
[ 90.809247] [<8005cb64>] __lock_acquire+0x13c/0xf7c
[ 90.818964] [<8005e314>] lock_acquire+0x194/0x1dc
[ 90.828345] [<8003f27c>] flush_work+0x200/0x24c
[ 90.837374] [<80041dfc>] __cancel_work_timer+0x158/0x210
[ 90.847958] [<801a8770>] jffs2_sync_fs+0x20/0x54
[ 90.857173] [<80125cf4>] iterate_supers+0xf4/0x120
[ 90.866729] [<80158fc4>] sys_sync+0x44/0x9c
[ 90.875067] [<80014424>] syscall_common+0x34/0x58
Signed-off-by: Daniel Santos <daniel.santos@pobox.com>
Reviewed-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Boris Brezillon <boris.brezillon@bootlin.com>
bug.2018.11.12a: Get rid of BUG_ON() and friends
consolidate.2018.12.01a: Continued RCU flavor-consolidation cleanup
doc.2018.11.12a: Documentation updates
fixes.2018.11.12a: Miscellaneous fixes
initrd.2018.11.08b: Automate creation of rcutorture initrd
sil.2018.11.12a: Remove more spin_unlock_wait() calls
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAlwC1c4QHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgppxmD/4pqn8REEh/QUXWhCJbOXLLLxfQju7Uxs/v
j2Bc6W/e7Z9jvKAs06IIhaV6SxBrM0oUebf/hJY0E/kTSHiNPJqx/X3W9hFYOo+p
EJau3vavOrxVzgq5zt8S/i//HeanT+H37nE9WDqSRKXTta8JFDw+DoysepILTUvN
WGDjuplPcurwmf2W1qES+5vNy/Jpln9ErNuqPBSjc6shozQ8WAzvuupVs+uZEpeK
+gqrx0pJYrtoU+pSUK+Bt6bSzzp8Z0qHGIVMAabNULbz43qblK0ILRE+qLFbFwsB
62EMMtX9b2Lsvqpoe2cQ+deQlUalsGVmpyE+7GP/evZbVmtD/NoH6cJQ/dA/tFtw
cluL3rWBJKB5OZ1yatDE2/rUYsGo5FzqMUz/tIWSf2FdZcLfhRNLka7DueSA6NQe
wtLJU9GrME67+t+PqncjDxoyQYma4oynAcc5dfqlBQv5OP7HDf4TP28g8FdkHjcy
fEXAp58516YZiCpoWZf6dPR9fUQ0A1eF+qxHnUacy5tHN4AKPrccU3+k+0WStFNf
qaOPkj4kWtv17d2DO4UoqAtBqFO16QCYSsa5+drpDeTOq9QgGqA6O+sGngN0LsxS
F7x3msgBIkgEFYFtpuMBXnamdooiZMKrzI0Ctn7PK8b5Qx1OgRNCZcTQD4uql1Fj
L6R/6Ynibg==
=lMlT
-----END PGP SIGNATURE-----
Merge tag 'for-linus-20181201' of git://git.kernel.dk/linux-block
Pull block layer fixes from Jens Axboe:
- Single range elevator discard merge fix, that caused crashes (Ming)
- Fix for a regression in O_DIRECT, where we could potentially lose the
error value (Maximilian Heyne)
- NVMe pull request from Christoph, with little fixes all over the map
for NVMe.
* tag 'for-linus-20181201' of git://git.kernel.dk/linux-block:
block: fix single range discard merge
nvme-rdma: fix double freeing of async event data
nvme: flush namespace scanning work just before removing namespaces
nvme: warn when finding multi-port subsystems without multipathing enabled
fs: fix lost error code in dio_complete
nvme-pci: fix surprise removal
nvme-fc: initialize nvme_req(rq)->ctrl after calling __nvme_fc_init_request()
nvme: Free ctrl device name on init failure
Merge misc fixes from Andrew Morton:
"31 fixes"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (31 commits)
ocfs2: fix potential use after free
mm/khugepaged: fix the xas_create_range() error path
mm/khugepaged: collapse_shmem() do not crash on Compound
mm/khugepaged: collapse_shmem() without freezing new_page
mm/khugepaged: minor reorderings in collapse_shmem()
mm/khugepaged: collapse_shmem() remember to clear holes
mm/khugepaged: fix crashes due to misaccounted holes
mm/khugepaged: collapse_shmem() stop if punched or truncated
mm/huge_memory: fix lockdep complaint on 32-bit i_size_read()
mm/huge_memory: splitting set mapping+index before unfreeze
mm/huge_memory: rename freeze_page() to unmap_page()
initramfs: clean old path before creating a hardlink
kernel/kcov.c: mark funcs in __sanitizer_cov_trace_pc() as notrace
psi: make disabling/enabling easier for vendor kernels
proc: fixup map_files test on arm
debugobjects: avoid recursive calls with kmemleak
userfaultfd: shmem: UFFDIO_COPY: set the page dirty if VM_WRITE is not set
userfaultfd: shmem: add i_size checks
userfaultfd: shmem/hugetlbfs: only allow to register VM_MAYWRITE vmas
userfaultfd: shmem: allocate anonymous memory for MAP_PRIVATE shmem
...
-----BEGIN PGP SIGNATURE-----
iQIVAwUAXAFfSPu3V2unywtrAQKPQRAAiHDs2d35Kc2qkTFLwGiP+wr+3+7Cyz7A
hrWAvR7Oe7nBFVPmp6pwEnpBhf3mPsWlQpw3ZKZPo4fDQyRX+mDFC+2C7hkU1Q/J
BkjTG4vYn1jiQGlL3SD1PfUxcWfwzoK4cz+V3hnFY5y0dsKiBZBR1Lw5G+UkaCnD
4VaC3VAG56Vh14o5qSF3TWLZFyZ+JN6YA/M/DnwRPl8y4jnj1tJLs1DjdpEcWv6r
15FKb2FRYaC7MRehpXd22JX6fv5ii2xazU3IfLucBrb4Vj+wAJrBY4wA3x/CFkAa
as1VmxLkgoJEWa3M71tQOJBC8+QqkRb++PRUI3aadt2H4hXHfx3AmBuKkVroeS8o
0BDhWGiTW4AqXUajkQcTc/mKV2x6h83V3DLyBRL1iC3+7qaBVhPNtxW+v6ln0Ce1
FRG2I9LZp+RtWrVVyIPsa03V2V5OD7PTIBXK6TYtuqL+3uu7TNNc+UySvqDHWLL+
Zo2ogpq//kZbjMdntNVhDEj12LW3zG05dtNuFEeJeuPM28yiXXtoWDmI49RAUQ4v
RN6SwEXnKWehwG+YITYavV6gfHWlXdZ7grgCMHyViF/s9khBp7AGxbRzR0JXgXqL
ko1Ojpbq2mdvjwGFQfde4MAqAxM3FPxdxGVLrgi+lgGTsEKv6IzrTo28teyAM81O
D6cH0ldY90w=
=6y+F
-----END PGP SIGNATURE-----
Merge tag 'fscache-fixes-20181130' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
Pull fscache and cachefiles fixes from David Howells:
"Misc fixes:
- Fix an assertion failure at fs/cachefiles/xattr.c:138 caused by a
race between a cache object lookup failing and someone attempting
to reenable that object, thereby triggering an update of the
object's attributes.
- Fix an assertion failure at fs/fscache/operation.c:449 caused by a
split atomic subtract and atomic read that allows a race to happen.
- Fix a leak of backing pages when simultaneously reading the same
page from the same object from two or more threads.
- Fix a hang due to a race between a cache object being discarded and
the corresponding cookie being reenabled.
There are also some minor cleanups:
- Cast an enum value to a different enum type to prevent clang from
generating a warning. This shouldn't cause any sort of change in
the emitted code.
- Use ktime_get_real_seconds() instead of get_seconds(). This is just
used to uniquify a filename for an object to be placed in the
graveyard. Objects placed there are deleted by cachfilesd in
userspace immediately thereafter.
- Remove an initialised, but otherwise unused variable. This should
have been entirely optimised away anyway"
* tag 'fscache-fixes-20181130' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
fscache, cachefiles: remove redundant variable 'cache'
cachefiles: avoid deprecated get_seconds()
cachefiles: Explicitly cast enumerated type in put_object
fscache: fix race between enablement and dropping of object
cachefiles: Fix page leak in cachefiles_read_backing_file while vmscan is active
fscache: Fix race in fscache_op_complete() due to split atomic_sub & read
cachefiles: Fix an assertion failure when trying to update a failed object
ocfs2_get_dentry() calls iput(inode) to drop the reference count of
inode, and if the reference count hits 0, inode is freed. However, in
this function, it then reads inode->i_generation, which may result in a
use after free bug. Move the put operation later.
Link: http://lkml.kernel.org/r/1543109237-110227-1-git-send-email-bianpan2016@163.com
Fixes: 781f200cb7a("ocfs2: Remove masklog ML_EXPORT.")
Signed-off-by: Pan Bian <bianpan2016@163.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Cc: Changwei Ge <ge.changwei@h3c.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
After the VMA to register the uffd onto is found, check that it has
VM_MAYWRITE set before allowing registration. This way we inherit all
common code checks before allowing to fill file holes in shmem and
hugetlbfs with UFFDIO_COPY.
The userfaultfd memory model is not applicable for readonly files unless
it's a MAP_PRIVATE.
Link: http://lkml.kernel.org/r/20181126173452.26955-4-aarcange@redhat.com
Fixes: ff62a34210 ("hugetlb: implement memfd sealing")
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Hugh Dickins <hughd@google.com>
Reported-by: Jann Horn <jannh@google.com>
Fixes: 4c27fe4c4c ("userfaultfd: shmem: add shmem_mcopy_atomic_pte for userfaultfd support")
Cc: <stable@vger.kernel.org>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
hfs_bmap_free() frees node via hfs_bnode_put(node). However it then
reads node->this when dumping error message on an error path, which may
result in a use-after-free bug. This patch frees node only when it is
never used.
Link: http://lkml.kernel.org/r/1543053441-66942-1-git-send-email-bianpan2016@163.com
Signed-off-by: Pan Bian <bianpan2016@163.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Ernesto A. Fernandez <ernesto.mnd.fernandez@gmail.com>
Cc: Joe Perches <joe@perches.com>
Cc: Viacheslav Dubeyko <slava@dubeyko.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
hfs_bmap_free() frees the node via hfs_bnode_put(node). However, it
then reads node->this when dumping error message on an error path, which
may result in a use-after-free bug. This patch frees the node only when
it is never again used.
Link: http://lkml.kernel.org/r/1542963889-128825-1-git-send-email-bianpan2016@163.com
Fixes: a1185ffa2fc ("HFS rewrite")
Signed-off-by: Pan Bian <bianpan2016@163.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Joe Perches <joe@perches.com>
Cc: Ernesto A. Fernandez <ernesto.mnd.fernandez@gmail.com>
Cc: Viacheslav Dubeyko <slava@dubeyko.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
ocfs2_defrag_extent may fall into deadlock.
ocfs2_ioctl_move_extents
ocfs2_ioctl_move_extents
ocfs2_move_extents
ocfs2_defrag_extent
ocfs2_lock_allocators_move_extents
ocfs2_reserve_clusters
inode_lock GLOBAL_BITMAP_SYSTEM_INODE
__ocfs2_flush_truncate_log
inode_lock GLOBAL_BITMAP_SYSTEM_INODE
As backtrace shows above, ocfs2_reserve_clusters() will call inode_lock
against the global bitmap if local allocator has not sufficient cluters.
Once global bitmap could meet the demand, ocfs2_reserve_cluster will
return success with global bitmap locked.
After ocfs2_reserve_cluster(), if truncate log is full,
__ocfs2_flush_truncate_log() will definitely fall into deadlock because
it needs to inode_lock global bitmap, which has already been locked.
To fix this bug, we could remove from
ocfs2_lock_allocators_move_extents() the code which intends to lock
global allocator, and put the removed code after
__ocfs2_flush_truncate_log().
ocfs2_lock_allocators_move_extents() is referred by 2 places, one is
here, the other does not need the data allocator context, which means
this patch does not affect the caller so far.
Link: http://lkml.kernel.org/r/20181101071422.14470-1-lchen@suse.com
Signed-off-by: Larry Chen <lchen@suse.com>
Reviewed-by: Changwei Ge <ge.changwei@h3c.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull vfs fixes from Al Viro:
"Assorted fixes all over the place.
The iov_iter one is this cycle regression (splice from UDP triggering
WARN_ON()), the rest is older"
* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
afs: Use d_instantiate() rather than d_add() and don't d_drop()
afs: Fix missing net error handling
afs: Fix validation/callback interaction
iov_iter: teach csum_and_copy_to_iter() to handle pipe-backed ones
exportfs: do not read dentry after free
exportfs: fix 'passing zero to ERR_PTR()' warning
aio: fix failure to put the file pointer
sysv: return 'err' instead of 0 in __sysv_write_inode
Currently, a lock can block pending requests, but all pending
requests are equal. If lots of pending requests are
mutually exclusive, this means they will all be woken up
and all but one will fail. This can hurt performance.
So we will allow pending requests to block other requests.
Only the first request will be woken, and it will wake the others.
This patch doesn't implement this fully, but prepares the way.
- It acknowledges that a request might be blocking other requests,
and when the request is converted to a lock, those blocked
requests are moved across.
- When a request is requeued or discarded, all blocked requests are
woken.
- When deadlock-detection looks for the lock which blocks a
given request, we follow the chain of ->fl_blocker all
the way to the top.
Tested-by: kernel test robot <rong.a.chen@intel.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Reviewed-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Both locks_remove_posix() and locks_remove_flock() use a
struct file_lock without calling locks_init_lock() on it.
This means the various list_heads are not initialized, which
will become a problem with a later patch.
So change them both to initialize properly. For flock locks,
this involves using flock_make_lock(), and changing it to
allow a file_lock to be passed in, so memory allocation isn't
always needed.
Signed-off-by: NeilBrown <neilb@suse.com>
Reviewed-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Rather than assuming all-zeros is sufficient, use the available API to
initialize the file_lock structure use for unlock. VFS-level changes
will soon make it important that the list_heads in file_lock are
always properly initialized.
Signed-off-by: NeilBrown <neilb@suse.com>
Reviewed-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Rather than assuming all-zeros is sufficient, use the available API to
initialize the file_lock structure use for unlock. VFS-level changes
will soon make it important that the list_heads in file_lock are
always properly initialized.
Signed-off-by: NeilBrown <neilb@suse.com>
Reviewed-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Using memcpy() to copy lock requests leaves the various
list_head in an inconsistent state.
As we will soon attach a list of conflicting request to
another pending request, we need these lists to be consistent.
So change NFS to use locks_init_lock/locks_copy_lock instead
of memcpy.
Signed-off-by: NeilBrown <neilb@suse.com>
Reviewed-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
This functionality will be useful in future patches, so
split it out from locks_wake_up_blocks().
Signed-off-by: NeilBrown <neilb@suse.com>
Reviewed-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
struct file lock contains an 'fl_next' pointer which
is used to point to the lock that this request is blocked
waiting for. So rename it to fl_blocker.
The fl_blocked list_head in an active lock is the head of a list of
blocked requests. In a request it is a node in that list.
These are two distinct uses, so replace with two list_heads
with different names.
fl_blocked_requests is the head of a list of blocked requests
fl_blocked_member is a node in a member of that list.
The two different list_heads are never used at the same time, but that
will change in a future patch.
Note that a tracepoint is changed to report fl_blocker instead
of fl_next.
Signed-off-by: NeilBrown <neilb@suse.com>
Reviewed-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Variable 'cache' is being assigned but is never used hence it is
redundant and can be removed.
Cleans up clang warning:
warning: variable 'cache' set but not used [-Wunused-but-set-variable]
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David Howells <dhowells@redhat.com>
get_seconds() returns an unsigned long can overflow on some architectures
and is deprecated because of that. In cachefs, we cast that number to
a a 32-bit integer, which will overflow in year 2106 on all architectures.
As confirmed by David Howells, the overflow probably isn't harmful
in the end, since the timestamps are only used to make the file names
unique, but they don't strictly have to be in monotonically increasing
order since the files only exist in order to be deleted as quickly
as possible.
Moving to ktime_get_real_seconds() avoids the deprecated interface.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David Howells <dhowells@redhat.com>
Clang warns when one enumerated type is implicitly converted to another.
fs/cachefiles/namei.c:247:50: warning: implicit conversion from
enumeration type 'enum cachefiles_obj_ref_trace' to different
enumeration type 'enum fscache_obj_ref_trace' [-Wenum-conversion]
cache->cache.ops->put_object(&xobject->fscache,
cachefiles_obj_put_wait_retry);
Silence this warning by explicitly casting to fscache_obj_ref_trace,
which is also done in put_object.
Reported-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>
Signed-off-by: David Howells <dhowells@redhat.com>
It was observed that a process blocked indefintely in
__fscache_read_or_alloc_page(), waiting for FSCACHE_COOKIE_LOOKING_UP
to be cleared via fscache_wait_for_deferred_lookup().
At this time, ->backing_objects was empty, which would normaly prevent
__fscache_read_or_alloc_page() from getting to the point of waiting.
This implies that ->backing_objects was cleared *after*
__fscache_read_or_alloc_page was was entered.
When an object is "killed" and then "dropped",
FSCACHE_COOKIE_LOOKING_UP is cleared in fscache_lookup_failure(), then
KILL_OBJECT and DROP_OBJECT are "called" and only in DROP_OBJECT is
->backing_objects cleared. This leaves a window where
something else can set FSCACHE_COOKIE_LOOKING_UP and
__fscache_read_or_alloc_page() can start waiting, before
->backing_objects is cleared
There is some uncertainty in this analysis, but it seems to be fit the
observations. Adding the wake in this patch will be handled correctly
by __fscache_read_or_alloc_page(), as it checks if ->backing_objects
is empty again, after waiting.
Customer which reported the hang, also report that the hang cannot be
reproduced with this fix.
The backtrace for the blocked process looked like:
PID: 29360 TASK: ffff881ff2ac0f80 CPU: 3 COMMAND: "zsh"
#0 [ffff881ff43efbf8] schedule at ffffffff815e56f1
#1 [ffff881ff43efc58] bit_wait at ffffffff815e64ed
#2 [ffff881ff43efc68] __wait_on_bit at ffffffff815e61b8
#3 [ffff881ff43efca0] out_of_line_wait_on_bit at ffffffff815e625e
#4 [ffff881ff43efd08] fscache_wait_for_deferred_lookup at ffffffffa04f2e8f [fscache]
#5 [ffff881ff43efd18] __fscache_read_or_alloc_page at ffffffffa04f2ffe [fscache]
#6 [ffff881ff43efd58] __nfs_readpage_from_fscache at ffffffffa0679668 [nfs]
#7 [ffff881ff43efd78] nfs_readpage at ffffffffa067092b [nfs]
#8 [ffff881ff43efda0] generic_file_read_iter at ffffffff81187a73
#9 [ffff881ff43efe50] nfs_file_read at ffffffffa066544b [nfs]
#10 [ffff881ff43efe70] __vfs_read at ffffffff811fc756
#11 [ffff881ff43efee8] vfs_read at ffffffff811fccfa
#12 [ffff881ff43eff18] sys_read at ffffffff811fda62
#13 [ffff881ff43eff50] entry_SYSCALL_64_fastpath at ffffffff815e986e
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: David Howells <dhowells@redhat.com>
commit e259221763 ("fs: simplify the
generic_write_sync prototype") reworked callers of generic_write_sync(),
and ended up dropping the error return for the directio path. Prior to
that commit, in dio_complete(), an error would be bubbled up the stack,
but after that commit, errors passed on to dio_complete were eaten up.
This was reported on the list earlier, and a fix was proposed in
https://lore.kernel.org/lkml/20160921141539.GA17898@infradead.org/, but
never followed up with. We recently hit this bug in our testing where
fencing io errors, which were previously erroring out with EIO, were
being returned as success operations after this commit.
The fix proposed on the list earlier was a little short -- it would have
still called generic_write_sync() in case `ret` already contained an
error. This fix ensures generic_write_sync() is only called when there's
no pending error in the write. Additionally, transferred is replaced
with ret to bring this code in line with other callers.
Fixes: e259221763 ("fs: simplify the generic_write_sync prototype")
Reported-by: Ravi Nankani <rnankani@amazon.com>
Signed-off-by: Maximilian Heyne <mheyne@amazon.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
CC: Torsten Mehlan <tomeh@amazon.de>
CC: Uwe Dannowski <uwed@amazon.de>
CC: Amit Shah <aams@amazon.de>
CC: David Woodhouse <dwmw@amazon.co.uk>
CC: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The bio referencing has a trick that doesn't do any actual atomic
inc/dec on the reference count until we have to elevator to > 1. For the
async IO O_DIRECT case, we can't use the simple DIO variants, so we use
__blkdev_direct_IO(). It always grabs an extra reference to the bio
after allocation, which means we then enter the slower path of actually
having to do atomic_inc/dec on the count.
We don't need to do that for the async case, unless we end up going
multi-bio, in which case we're already doing huge amounts of IO. For the
smaller IO case (< BIO_MAX_PAGES), we can do without the extra ref.
Based on an earlier patch (and commit log) from Jens Axboe.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Use d_instantiate() rather than d_add() and don't d_drop() in
afs_vnode_new_inode(). The dentry shouldn't be removed as it's not
changing its name.
Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
kAFS can be given certain network errors (EADDRNOTAVAIL, EHOSTDOWN and
ERFKILL) that it doesn't handle in its server/address rotation algorithms.
They cause the probing and rotation to abort immediately rather than
rotating.
Fix this by:
(1) Abstracting out the error prioritisation from the VL and FS rotation
algorithms into a common function and expand usage into the server
probing code.
When multiple errors are available, this code selects the one we'd
prefer to return.
(2) Add handling for EADDRNOTAVAIL, EHOSTDOWN and ERFKILL.
Fixes: 0fafdc9f88 ("afs: Fix file locking")
Fixes: 0338747d8454 ("afs: Probe multiple fileservers simultaneously")
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
When afs_validate() is called to validate a vnode (inode), there are two
unhandled cases in the fastpath at the top of the function:
(1) If the vnode is promised (AFS_VNODE_CB_PROMISED is set), the break
counters match and the data has expired, then there's an implicit case
in which the vnode needs revalidating.
This has no consequences since the default "valid = false" set at the
top of the function happens to do the right thing.
(2) If the vnode is not promised and it hasn't been deleted
(AFS_VNODE_DELETED is not set) then there's a default case we're not
handling in which the vnode is invalid. If the vnode is invalid, we
need to bring cb_s_break and cb_v_break up to date before we refetch
the status.
As a consequence, once the server loses track of the client
(ie. sufficient time has passed since we last sent it an operation),
it will send us a CB.InitCallBackState* operation when we next try to
talk to it. This calls afs_init_callback_state() which increments
afs_server::cb_s_break, but this then doesn't propagate to the
afs_vnode record.
The result being that every afs_validate() call thereafter sends a
status fetch operation to the server.
Clarify and fix this by:
(A) Setting valid in all the branches rather than initialising it at the
top so that the compiler catches where we've missed.
(B) Restructuring the logic in the 'promised' branch so that we set valid
to false if the callback is due to expire (or has expired) and so that
the final case is that the vnode is still valid.
(C) Adding an else-statement that ups cb_s_break and cb_v_break if the
promised and deleted cases don't match.
Fixes: c435ee3455 ("afs: Overhaul the callback handling")
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The synchronize_rcu() in namespace_unlock() is called every time
a filesystem is unmounted. If a great many filesystems are mounted,
this can cause a noticable slow-down in, for example, system shutdown.
The sequence:
mkdir -p /tmp/Mtest/{0..5000}
time for i in /tmp/Mtest/*; do mount -t tmpfs tmpfs $i ; done
time umount /tmp/Mtest/*
on a 4-cpu VM can report 8 seconds to mount the tmpfs filesystems, and
100 seconds to unmount them.
Boot the same VM with 1 CPU and it takes 18 seconds to mount the
tmpfs filesystems, but only 36 to unmount.
If we change the synchronize_rcu() to synchronize_rcu_expedited()
the umount time on a 4-cpu VM drop to 0.6 seconds
I think this 200-fold speed up is worth the slightly high system
impact of using synchronize_rcu_expedited().
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> (from general rcu perspective)
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The actual number of bytes stored in a PRZ is smaller than the
bytes requested by platform data, since there is a header on each
PRZ. Additionally, if ECC is enabled, there are trailing bytes used
as well. Normally this mismatch doesn't matter since PRZs are circular
buffers and the leading "overflow" bytes are just thrown away. However, in
the case of a compressed record, this rather badly corrupts the results.
This corruption was visible with "ramoops.mem_size=204800 ramoops.ecc=1".
Any stored crashes would not be uncompressable (producing a pstorefs
"dmesg-*.enc.z" file), and triggering errors at boot:
[ 2.790759] pstore: crypto_comp_decompress failed, ret = -22!
Backporting this depends on commit 70ad35db33 ("pstore: Convert console
write to use ->write_buf")
Reported-by: Joel Fernandes <joel@joelfernandes.org>
Fixes: b0aad7a99c ("pstore: Add compression support to pstore")
Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAlv/mEEACgkQnJ2qBz9k
QNm0kggA8SckBccdUZqpjxohASx9crp+jJeor2TRYCwyVfqdbDjkCOfv9k7+ddD4
D1qN/AEudQXPzr/DrjI0W9fnFG957FLo60RQ0aGwsq/3wfmzfMJ0qXIw2tyoqjIH
VxFecL2f3TKMr5zU9N1QoxJVAMAo5LOGbXO+qNYgTaOJ0EBpw9kxK14ib9JOi+pQ
CItZkAv4eOWMsA/AaRsWN7AGmsziwcMdoAZYRT2wiWdTmubYn66Negnffq2jVHjP
/kYlFjNcgpsuTg1AiTM+JlBwctEeLm37PUyimip3cdYwu6HcfNmCHDjEIzN2YphJ
twzAauTEr0bjNFiNWYraUPVHEcspNw==
=kKDs
-----END PGP SIGNATURE-----
Merge tag 'fixes_for_v4.20-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull ext2 and udf fixes from Jan Kara:
"Three small ext2 and udf fixes"
* tag 'fixes_for_v4.20-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
ext2: fix potential use after free
ext2: initialize opts.s_mount_opt as zero before using it
udf: Allow mounting volumes with incorrect identification strings
Trivial fix to clean up indentation, add in missing tabs.
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
__cld_pipe_upcall() emits a "do not call blocking ops when
!TASK_RUNNING" warning due to the dput() call in rpc_queue_upcall().
Fix it by using a completion instead of hand coding the wait.
Signed-off-by: Scott Mayhew <smayhew@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Anatoly Trosinenko reports that this:
1) Checkout fresh master Linux branch (tested with commit e195ca6cb)
2) Copy x84_64-config-4.14 to .config, then enable NFS server v4 and build
3) From `kvm-xfstests shell`:
results in NULL dereference in locks_end_grace.
Check that nfsd has been started before trying to end the grace period.
Reported-by: Anatoly Trosinenko <anatoly.trosinenko@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
After we drop the i_pages lock, the inode can be freed at any time.
The get_unlocked_entry() code has no choice but to reacquire the lock,
so it can't be used here. Create a new wait_entry_unlocked() which takes
care not to acquire the lock or dereference the address_space in any way.
Fixes: c2a7d2a115 ("filesystem-dax: Introduce dax_lock_mapping_entry()")
Cc: <stable@vger.kernel.org>
Signed-off-by: Matthew Wilcox <willy@infradead.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
If we race with inode destroy, it's possible for page->mapping to be
NULL before we even enter this routine, as well as after having slept
waiting for the dax entry to become unlocked.
Fixes: c2a7d2a115 ("filesystem-dax: Introduce dax_lock_mapping_entry()")
Cc: <stable@vger.kernel.org>
Reported-by: Jan Kara <jack@suse.cz>
Signed-off-by: Matthew Wilcox <willy@infradead.org>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAlv9qYgACgkQxWXV+ddt
WDupPQ/8DdeLZYQG1tlx2Q+X4/tqPVyAzUjguYzbIY7wvSs1zbEEedENsD8E97yC
So8ooGnP5B6/dqVidLFQBPwTXN59GybYbrDci8qh0DOJTl3+1r8byD9JC+iofrOF
tltJkZ+eCOQyyqHHzlzw15uNOg48Qzj1oXvTAcE0P6iN5UcvcfwRW/S39pjsn63C
63zc09XJ1hmJMJTWZo5h3GoD2UvzrwGXPKXNdv/NWkw9sqQbWdjvZFdqKbvY1VeM
Oa6FPAPErJqEEEePhpDYbyRcnzjJRMs0deLGpGGChGldQxgMO8ILzBwh/KalfzK7
h7LIuv1EclUqlyv0mXPqg2E/C3n2UMPqQYFsK9Lt+4Y/PkrWA2jx0lSg0fBl3k8c
7PyiTqPNPNF8LU48tPEnOzJuNPkquOycgdyQOUpHnS43OF5OLIb6tVyjK4eJHRWw
xtP65M72qM8T65+gsxYcdm0lvIDLidIwFS+2g4ibKU7EwlYkTC9AHFIAyFKTgxeP
MpkIH90mKhSxOpbq8RICgr2jWcJZYoFQ4soi1oE+bgyjv75PyhJ0eXOprCh/4KZp
nkXlPy2skkO9gGecyvr51x/opDEjEkObyOjQm2LhhWYvgcnHgW8Zp1jhQKxabHvz
iZdVIs/agOerpk1d9ZBHhIXOeS2UcE5klqVRAdf961Wobh+HNis=
=cCvI
-----END PGP SIGNATURE-----
Merge tag 'for-4.20-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"Some of these bugs are being hit during testing so we'd like to get
them merged, otherwise there are usual stability fixes for stable
trees"
* tag 'for-4.20-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: relocation: set trans to be NULL after ending transaction
Btrfs: fix race between enabling quotas and subvolume creation
Btrfs: send, fix infinite loop due to directory rename dependencies
Btrfs: ensure path name is null terminated at btrfs_control_ioctl
Btrfs: fix rare chances for data loss when doing a fast fsync
btrfs: Always try all copies when reading extent buffers
[Description]
In a heavily loaded system where the system pagecache is nearing memory
limits and fscache is enabled, pages can be leaked by fscache while trying
read pages from cachefiles backend. This can happen because two
applications can be reading same page from a single mount, two threads can
be trying to read the backing page at same time. This results in one of
the threads finding that a page for the backing file or netfs file is
already in the radix tree. During the error handling cachefiles does not
clean up the reference on backing page, leading to page leak.
[Fix]
The fix is straightforward, to decrement the reference when error is
encountered.
[dhowells: Note that I've removed the clearance and put of newpage as
they aren't attested in the commit message and don't appear to actually
achieve anything since a new page is only allocated is newpage!=NULL and
any residual new page is cleared before returning.]
[Testing]
I have tested the fix using following method for 12+ hrs.
1) mkdir -p /mnt/nfs ; mount -o vers=3,fsc <server_ip>:/export /mnt/nfs
2) create 10000 files of 2.8MB in a NFS mount.
3) start a thread to simulate heavy VM presssure
(while true ; do echo 3 > /proc/sys/vm/drop_caches ; sleep 1 ; done)&
4) start multiple parallel reader for data set at same time
find /mnt/nfs -type f | xargs -P 80 cat > /dev/null &
find /mnt/nfs -type f | xargs -P 80 cat > /dev/null &
find /mnt/nfs -type f | xargs -P 80 cat > /dev/null &
..
..
find /mnt/nfs -type f | xargs -P 80 cat > /dev/null &
find /mnt/nfs -type f | xargs -P 80 cat > /dev/null &
5) finally check using cat /proc/fs/fscache/stats | grep -i pages ;
free -h , cat /proc/meminfo and page-types -r -b lru
to ensure all pages are freed.
Reviewed-by: Daniel Axtens <dja@axtens.net>
Signed-off-by: Shantanu Goel <sgoel01@yahoo.com>
Signed-off-by: Kiran Kumar Modukuri <kiran.modukuri@gmail.com>
[dja: forward ported to current upstream]
Signed-off-by: Daniel Axtens <dja@axtens.net>
Signed-off-by: David Howells <dhowells@redhat.com>
kmem_cache_destroy(NULL) is safe, so removes NULL check before
freeing the mem. This patch also fix ifnullfree.cocci warnings.
Signed-off-by: Wen Yang <wen.yang99@zte.com.cn>
Signed-off-by: David Teigland <teigland@redhat.com>
If cachefiles gets an error other then ENOENT when trying to look up an
object in the cache (in this case, EACCES), the object state machine will
eventually transition to the DROP_OBJECT state.
This state invokes fscache_drop_object() which tries to sync the auxiliary
data with the cache (this is done lazily since commit 402cb8dda9) on an
incomplete cache object struct.
The problem comes when cachefiles_update_object_xattr() is called to
rewrite the xattr holding the data. There's an assertion there that the
cache object points to a dentry as we're going to update its xattr. The
assertion trips, however, as dentry didn't get set.
Fix the problem by skipping the update in cachefiles if the object doesn't
refer to a dentry. A better way to do it could be to skip the update from
the DROP_OBJECT state handler in fscache, but that might deny the cache the
opportunity to update intermediate state.
If this error occurs, the kernel log includes lines that look like the
following:
CacheFiles: Lookup failed error -13
CacheFiles:
CacheFiles: Assertion failed
------------[ cut here ]------------
kernel BUG at fs/cachefiles/xattr.c:138!
...
Workqueue: fscache_object fscache_object_work_func [fscache]
RIP: 0010:cachefiles_update_object_xattr.cold.4+0x18/0x1a [cachefiles]
...
Call Trace:
cachefiles_update_object+0xdd/0x1c0 [cachefiles]
fscache_update_aux_data+0x23/0x30 [fscache]
fscache_drop_object+0x18e/0x1c0 [fscache]
fscache_object_work_func+0x74/0x2b0 [fscache]
process_one_work+0x18d/0x340
worker_thread+0x2e/0x390
? pwq_unbound_release_workfn+0xd0/0xd0
kthread+0x112/0x130
? kthread_bind+0x30/0x30
ret_from_fork+0x35/0x40
Note that there are actually two issues here: (1) EACCES happened on a
cache object and (2) an oops occurred. I think that the second is a
consequence of the first (it certainly looks like it ought to be). This
patch only deals with the second.
Fixes: 402cb8dda9 ("fscache: Attach the index key and aux data to the cookie")
Reported-by: Zhibin Li <zhibli@redhat.com>
Signed-off-by: David Howells <dhowells@redhat.com>
If we want to re-enable nat_bits, we rely on fsck which requires full scan
of directory tree. Let's do that by regular fsck or unclean shutdown.
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
We fail to advance the read pointer when reading the stat.oh field that
identifies the lock-holder in a TEST result.
This turns out not to matter if the server is knfsd, which always
returns a zero-length field. But other servers (Ganesha is an example)
may not do this. The result is bad values in fcntl F_GETLK results.
Fix this.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
The idea here was that renaming a file on a nosubtreecheck export would
make lookups of the old filehandle return STALE, making it impossible
for clients to reclaim opens.
But during the grace period I think we should also hold off on
operations that would break delegations.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Zero-length writes are legal; from 5661 section 18.32.3: "If the count
is zero, the WRITE will succeed and return a count of zero subject to
permissions checking".
This check is unnecessary and is causing zero-length reads to return
EINVAL.
Cc: stable@vger.kernel.org
Fixes: 3fd9557aec "NFSD: Refactor the generic write vector fill helper"
Cc: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Now that synchronize_rcu() waits for preempt-disable regions of code
as well as RCU read-side critical sections, synchronize_sched() can be
replaced by synchronize_rcu(). This commit therefore makes this change.
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: <linux-fsdevel@vger.kernel.org>
kernfs_notify() does two notifications: poll and fsnotify. Originally,
both notifications were done from scheduled work context and all that
kernfs_notify() did was schedule the work.
This patch simply moves the poll notification from the scheduled work
handler to kernfs_notify(). The fsnotify notification still needs to be
done from scheduled work context because it can sleep (it needs to lock
a mutex).
If the poll notification is time critical (the notified thread needs to
wake as quickly as possible), it's better to do it from kernfs_notify()
directly. One example is calling sysfs_notify_dirent() from a hardware
interrupt handler to wake up a thread and handle the interrupt in user
space.
Signed-off-by: Radu Rendec <radu.rendec@gmail.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The function ext2_xattr_set calls brelse(bh) to drop the reference count
of bh. After that, bh may be freed. However, following brelse(bh),
it reads bh->b_data via macro HDR(bh). This may result in a
use-after-free bug. This patch moves brelse(bh) after reading field.
CC: stable@vger.kernel.org
Signed-off-by: Pan Bian <bianpan2016@163.com>
Signed-off-by: Jan Kara <jack@suse.cz>
We need to initialize opts.s_mount_opt as zero before using it, else we
may get some unexpected mount options.
Fixes: 088519572c ("ext2: Parse mount options into a dedicated structure")
CC: stable@vger.kernel.org
Signed-off-by: xingaopeng <xingaopeng@huawei.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Previously, we added a parameter @map.m_may_create to trigger OPU
allocation and call f2fs_balance_fs() correctly.
But in get_more_blocks(), @create has been overwritten by below code.
So the function f2fs_map_blocks() will not allocate new block address
but directly go out. Meanwile,there are several functions calling
f2fs_map_blocks() directly and @map.m_may_create not initialized.
CODE:
create = dio->op == REQ_OP_WRITE;
if (dio->flags & DIO_SKIP_HOLES) {
if (fs_startblk <= ((i_size_read(dio->inode) - 1) >>
i_blkbits))
create = 0;
}
This patch fixes it.
Signed-off-by: Jia Zhu <zhujia13@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Previously, we allocated a new block address for OPU mode in direct_IO.
But the new address couldn't be assigned to @map->m_pblk correctly.
This patch fix it.
Cc: <stable@vger.kernel.org>
Fixes: 511f52d02f05 ("f2fs: allow out-place-update for direct IO in LFS mode")
Signed-off-by: Jia Zhu <zhujia13@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Adjust the trace print in f2fs_get_victim() to cover GC done by
F2FS_IOC_GARBAGE_COLLECT_RANGE.
Signed-off-by: Sahitya Tummala <stummala@codeaurora.org>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Allow node type segments also to be GC'd via f2fs ioctl
F2FS_IOC_GARBAGE_COLLECT_RANGE.
Signed-off-by: Sahitya Tummala <stummala@codeaurora.org>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The function truncate_node frees the page with f2fs_put_page. However,
the page index is read after that. So, the patch reads the index before
freeing the page.
Fixes: bf39c00a9a ("f2fs: drop obsolete node page when it is truncated")
Cc: <stable@vger.kernel.org>
Signed-off-by: Pan Bian <bianpan2016@163.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
When call f2fs_acl_create_masq() failed, the caller f2fs_acl_create()
should return -EIO instead of -ENOMEM, this patch makes it consistent
with posix_acl_create() which has been fixed in commit beaf226b86
("posix_acl: don't ignore return value of posix_acl_create_masq()").
Fixes: 83dfe53c18 ("f2fs: fix reference leaks in f2fs_acl_create")
Signed-off-by: Tiezhu Yang <kernelpatch@126.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
After merging the f2fs tree, today's linux-next build
(x86_64_allmodconfig) produced this warning:
In file included from fs/f2fs/dir.c:11:
fs/f2fs/f2fs.h: In function '__mark_inode_dirty_flag':
fs/f2fs/f2fs.h:2388:6: warning: this statement may fall through [-Wimplicit-fallthrough=]
if (set)
^
fs/f2fs/f2fs.h:2390:2: note: here
case FI_DATA_EXIST:
^~~~
Exposed by my use of -Wimplicit-fallthrough
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The following race could lead to inconsistent SIT bitmap:
Task A Task B
====== ======
f2fs_write_checkpoint
block_operations
f2fs_lock_all
down_write(node_change)
down_write(node_write)
... sync ...
up_write(node_change)
f2fs_file_write_iter
set_inode_flag(FI_NO_PREALLOC)
......
f2fs_write_begin(index=0, has inline data)
prepare_write_begin
__do_map_lock(AIO) => down_read(node_change)
f2fs_convert_inline_page => update SIT
__do_map_lock(AIO) => up_read(node_change)
f2fs_flush_sit_entries <= inconsistent SIT
finish write checkpoint
sudden-power-off
If SPO occurs after checkpoint is finished, SIT bitmap will be set
incorrectly.
Signed-off-by: Sheng Yong <shengyong1@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If namelen is corrupted to have very long value, fill_dentries can copy
wrong memory area.
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Previously, when f2fs finds which temp bio cache owns the target page,
it will flush all the three temp bio caches, but we only need to flush
one single bio cache indeed, which can help to keep bio merged.
Signed-off-by: Yunlong Song <yunlong.song@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In get_more_blocks(), we may override @create as below code:
create = dio->op == REQ_OP_WRITE;
if (dio->flags & DIO_SKIP_HOLES) {
if (fs_startblk <= ((i_size_read(dio->inode) - 1) >>
i_blkbits))
create = 0;
}
But in f2fs_map_blocks(), we only trigger f2fs_balance_fs() if @create
is 1, so in LFS mode, dio overwrite under LFS mode can easily run out
of free segments, result in below panic.
Call Trace:
allocate_segment_by_default+0xa8/0x270 [f2fs]
f2fs_allocate_data_block+0x1ea/0x5c0 [f2fs]
__allocate_data_block+0x306/0x480 [f2fs]
f2fs_map_blocks+0x6f6/0x920 [f2fs]
__get_data_block+0x4f/0xb0 [f2fs]
get_data_block_dio_write+0x50/0x60 [f2fs]
do_blockdev_direct_IO+0xcd5/0x21e0
__blockdev_direct_IO+0x3a/0x3c
f2fs_direct_IO+0x1ff/0x4a0 [f2fs]
generic_file_direct_write+0xd9/0x160
__generic_file_write_iter+0xbb/0x1e0
f2fs_file_write_iter+0xaf/0x220 [f2fs]
__vfs_write+0xd0/0x130
vfs_write+0xb2/0x1b0
SyS_pwrite64+0x69/0xa0
? vtime_user_exit+0x29/0x70
do_syscall_64+0x6e/0x160
entry_SYSCALL64_slow_path+0x25/0x25
RIP: new_curseg+0x36f/0x380 [f2fs] RSP: ffffac570393f7a8
So this patch introduces a parameter map.m_may_create to indicate that
f2fs_map_blocks() is called from write or read path, which can give the
right hint to let f2fs_map_blocks() trigger OPU allocation and call
f2fs_balanc_fs() correctly.
BTW, it disables physical address preallocation for direct IO in
f2fs_preallocate_blocks, which is redundant to OPU allocation of
f2fs_map_blocks.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch adds f2fs_dio_submit_bio() to hook submit_io/end_io functions
in direct IO path, in order to account DIO.
Later, we will add this count into is_idle() to let background GC/Discard
thread be aware of DIO.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch move dir data flush to write checkpoint process, by
doing this, it may reduce some time for dir fsync.
pre:
-f2fs_do_sync_file enter
-file_write_and_wait_range <- flush & wait
-write_checkpoint
-do_checkpoint <- wait all
-f2fs_do_sync_file exit
now:
-f2fs_do_sync_file enter
-write_checkpoint
-block_operations <- flush dir & no wait
-do_checkpoint <- wait all
-f2fs_do_sync_file exit
Signed-off-by: Yunlei He <heyunlei@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Use DEFINE_SHOW_ATTRIBUTE macro to simplify the code.
Signed-off-by: Yangtao Li <tiny.windzz@gmail.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
f2fs_ioc_gc_range skips blocks_per_seg each time, however, f2fs_gc moves
blocks of section each time, so fix it from segment to section.
Signed-off-by: Yunlong Song <yunlong.song@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Add one sysfs entry to control migration granularity of GC in large
section f2fs, it can be tuned to mitigate heavy overhead of migrating
huge number of blocks in large section.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In F2FS_HAS_FEATURE(), we will use F2FS_SB(sb) to get sbi pointer to
access .raw_super field, to avoid unneeded pointer conversion, this
patch changes to F2FS_HAS_FEATURE() accept sbi parameter directly.
Just do cleanup, no logic change.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Section is minimal garbage collection unit of f2fs, in zoned block
device, or ancient block mapping flash device, in order to improve
GC efficiency, we can align GC unit to lower device erase unit,
normally, it consists of multiple of segments.
Once background or foreground GC triggers, it brings a large number
of IOs, which will impact user IO, and also occupy cpu/memory resource
intensively.
So, to reduce impact of GC on large size section, this patch supports
subsectional GC, in one cycle of GC, it only migrate partial segment{s}
in victim section. Currently, by default, we use sbi->segs_per_sec as
migration granularity.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Introduce a wrapper __is_large_section() to clean up codes.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
When sbi->segs_per_sec > 1, and if some segno has 0 valid blocks before
gc starts, do_garbage_collect will skip counting seg_freed++, and this
will cause seg_freed < sbi->segs_per_sec and finally skip sec_freed++.
Signed-off-by: Yunlong Song <yunlong.song@huawei.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Previously, we only account preflush command for flush_merge mode,
so for noflush_merge mode, we can not know in-flight preflush
command count, fix it.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The encrypted file may be corrupted by GC in following case:
Time 1: | segment 1 blkaddr = A | GC -> | segment 2 blkaddr = B |
Encrypted block 1 is moved from blkaddr A of segment 1 to blkaddr B of
segment 2,
Time 2: | segment 1 blkaddr = B | GC -> | segment 3 blkaddr = C |
Before page 1 is written back and if segment 2 become a victim, then
page 1 is moved from blkaddr B of segment 2 to blkaddr Cof segment 3,
during the GC process of Time 2, f2fs should wait for page 1 written back
before reading it, or move_data_block will read a garbage block from
blkaddr B since page is not written back to blkaddr B yet.
Commit 6aa58d8a ("f2fs: readahead encrypted block during GC") introduce
ra_data_block to read encrypted block, but it forgets to add
f2fs_wait_on_page_writeback to avoid racing between GC and flush.
Signed-off-by: Yunlong Song <yunlong.song@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
When project is set, we should use inode limit minus the used count
Signed-off-by: Ye Yin <dbyin@tencent.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
blk_poll() has always kept spinning until it found an IO. This is
fine for SYNC polling, since we need to find one request we have
pending, but in preparation for ASYNC polling it can be beneficial
to just check if we have any entries available or not.
Existing callers are converted to pass in 'spin == true', to retain
the old behavior.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Today, when sb_bread() returns NULL, this can either be because of an
I/O error or because the system failed to allocate the buffer. Since
it's an old interface, changing would require changing many call
sites.
So instead we create our own ext4_sb_bread(), which also allows us to
set the REQ_META flag.
Also fixed a problem in the xattr code where a NULL return in a
function could also mean that the xattr was not found, which could
lead to the wrong error getting returned to userspace.
Fixes: ac27a0ec11 ("ext4: initial copy of files from ext3")
Cc: stable@kernel.org # 2.6.19
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Highlights include:
Bugfixes:
- Fix a NFSv4 state manager deadlock when returning a delegation
- NFSv4.2 copy do not allocate memory under the lock
- flexfiles: Use the correct stateid for IO in the tightly coupled case
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJb+hCNAAoJEA4mA3inWBJc8ZQP/jR+uemJycwgyWINvnn6PEtE
hyiSwL+c3jhBHwX2IroF1KvaHIa8GXMbIWj+DfW1iB2htYnIJYz4IFJOGpfN1S7n
bKCgonV0V06+dFF4DqcL3HcM1L6bo26n16voi3otgY0R5U5HGwB1tocZPCbR6DpK
meiRbrmB6O962zluUlTuu9zFSvsALyZR0h4tYMGYA0MlgWQJVLH6+dufyG2Zgp2Z
OH9tUzRFknD/Q4KrJv7zrMY198mHa+RQovsO2Jc/iE4bbrSMyVNtrPuVJphsP1BD
lZ5SvvWLXjNepUMsDCK+Es7i6dUmtHsGPS6gNDwUwY9/UlwOPYlp44VJzmEYmQcz
/VrrHn3LSoKDSAVNrksghto9O4T1NPnuVja1Q+SHf5hVX5OlsxyDkvX23ZUdgdkW
BeXeNWZuAJdDTI1KU+ahm2ilfUnuFpRGRHUrH2sYczV2okC38cO5YCIRI3Tckz6e
jrhmJcw+zCWv3Yl3h2Rbf8fVRcWJHA+qLWT3Str5nCyZiqPCag7Z7br9r5316zDv
Yma7nITZO7HH1cZUv+byA0PVHU96kDsMhhpxYISrSr4lf2BcZNnjQC/0IHb7qdWz
FgpYzv/BsIi+KxyZKshiR5E60kOmVxv2wIhre8uLOuuabcGsh/wit6URVnQJ+GDp
7klRY1t1P24XaIbgBR9U
=hqbe
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-4.20-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client bugfixes from Trond Myklebust:
- Fix a NFSv4 state manager deadlock when returning a delegation
- NFSv4.2 copy do not allocate memory under the lock
- flexfiles: Use the correct stateid for IO in the tightly coupled case
* tag 'nfs-for-4.20-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
flexfiles: use per-mirror specified stateid for IO
NFSv4.2 copy do not allocate memory under the lock
NFSv4: Fix a NFSv4 state manager deadlock
We found some bugs in the DAX conversion to XArray (and one bug which
predated the XArray conversion). There were a couple of bugs in some of
the higher-level functions, which aren't actually being called in today's
kernel, but surfaced as a result of converting existing radix tree &
IDR users over to the XArray. Some of the other changes to how the
higher-level APIs work were also motivated by converting various users;
again, they're not in use in today's kernel, so changing them has a low
probability of introducing a bug.
Dan can still trigger a bug in the DAX code with hot-offline/online,
and we're working on tracking that down.
-----BEGIN PGP SIGNATURE-----
iQFIBAABCgAyFiEEejHryeLBw/spnjHrDpNsjXcpgj4FAlv542AUHHdpbGx5QGlu
ZnJhZGVhZC5vcmcACgkQDpNsjXcpgj5BoAf/QZzbBcYuYMLMDYofvHKGlmk2yx/a
ObUlxlQtXGHvPp3oC3rdwAvcN/KAMDpU0u+PXab2MnrNw5okhpS6ZwGODlkarNA4
XbVQNGbtEbACr1V3CWc0NzLbYm6JtGpMum0Wx9MVR/VdTnGArBLBYQMYa/c1YhKA
vEBPf+w0j0QoCTAgPiIvq0aksuBQERUvjhlUvoaMY7F4sAhnaW558lvaEcc1xGxq
70+3cRPT6Uh12tEvi0LKP1NNEXebvQSftMvFEUPF2xo5z2v//KEobzv/anbojxQ8
BtxouIGSr4tME9g3xSpd9rTbUcW3bwDAhuWZvpP/ViRwW2UkEQonpApdaw==
=0Ert
-----END PGP SIGNATURE-----
Merge tag 'xarray-4.20-rc4' of git://git.infradead.org/users/willy/linux-dax
Pull XArray updates from Matthew Wilcox:
"We found some bugs in the DAX conversion to XArray (and one bug which
predated the XArray conversion). There were a couple of bugs in some
of the higher-level functions, which aren't actually being called in
today's kernel, but surfaced as a result of converting existing radix
tree & IDR users over to the XArray.
Some of the other changes to how the higher-level APIs work were also
motivated by converting various users; again, they're not in use in
today's kernel, so changing them has a low probability of introducing
a bug.
Dan can still trigger a bug in the DAX code with hot-offline/online,
and we're working on tracking that down"
* tag 'xarray-4.20-rc4' of git://git.infradead.org/users/willy/linux-dax:
XArray tests: Add missing locking
dax: Avoid losing wakeup in dax_lock_mapping_entry
dax: Fix huge page faults
dax: Fix dax_unlock_mapping_entry for PMD pages
dax: Reinstate RCU protection of inode
dax: Make sure the unlocking entry isn't locked
dax: Remove optimisation from dax_lock_mapping_entry
XArray tests: Correct some 64-bit assumptions
XArray: Correct xa_store_range
XArray: Fix Documentation
XArray: Handle NULL pointers differently for allocation
XArray: Unify xa_store and __xa_store
XArray: Add xa_store_bh() and xa_store_irq()
XArray: Turn xa_erase into an exported function
XArray: Unify xa_cmpxchg and __xa_cmpxchg
XArray: Regularise xa_reserve
nilfs2: Use xa_erase_irq
XArray: Export __xa_foo to non-GPL modules
XArray: Fix xa_for_each with a single element at 0
- Numerous corruption fixes for copy on write
- Numerous corruption fixes for blocksize < pagesize writes
- Don't miscalculate AG reservations for small final AGs
- Fix page cache truncation to work properly for reflink and extent
shifting
- Fix use-after-free when retrying failed inode/dquot buffer logging
- Fix corruptions seen when using copy_file_range in directio mode
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAlv1oroACgkQ+H93GTRK
tOu+1RAAnteFwaq3WLDYmSrMia/4m52sxvatxlCCSjCGdvfvuZozTwbTdB6FFfuc
Ql6Z6F2Lx1sHDNJvwBCsO8qPB0qOhjSnBI/wPe2kz/NETGYNp88vHX7OZvkPVONl
jDaCWTcu0BWNiOGi17uTY8sBa8u1izbo5F+pEQIyUjoCgUc9JB2di9dVnUJ0byrh
wZjrmPD95ojqOozqppXfFQ0QIbozpTXR3kyU9S0EhHmbnWJZ9t08Iuhd2LjOoDB4
cUFG/1qDXuFvALyM8m1mA7xSBZpA/glFgNeAtmz53aIOZ9Tl8w8cLJJBRx5AqUDU
bpBU1y08Bm3OIw+uiTMkiPkCQRMDgtiHKlPxuiKqlsNY0KqYgwWlWcbU/OTvHly8
In+CnbEBqLejKyEIz3nEQ4YimfvHbAlC/3V+nx2qO45hvTXA5lEIGAbBmiLW0ni8
6eBXGeIjKAw0zYOoXC4OuKIiHlQh7AHJB25i9xJTzknRI9jqwZFGkxgdl33Vrq8W
gTnfgOhMX2dGmcPrgMgtu+aiBwKf+GJv94/2EJwligExnWXQSsQmGCwKl7ysoE1g
iU/MhJT5IYYP/TDqldkahUPSwD2FN4UFtzNfpeDX3H6kxM1R41l+aerdu64UPNji
G98U+cWyyUmbu9ziLyREM/XyWz4UhNAz7lRId3ryeu8GPUm2AoY=
=TiLQ
-----END PGP SIGNATURE-----
Merge tag 'xfs-4.20-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs fixes from Darrick Wong:
"Dave and I have continued our work fixing corruption problems that can
be found when running long-term burn-in exercisers on xfs. Here are
some patches fixing most of the problems, but there will likely be
more. :/
- Numerous corruption fixes for copy on write
- Numerous corruption fixes for blocksize < pagesize writes
- Don't miscalculate AG reservations for small final AGs
- Fix page cache truncation to work properly for reflink and extent
shifting
- Fix use-after-free when retrying failed inode/dquot buffer logging
- Fix corruptions seen when using copy_file_range in directio mode"
* tag 'xfs-4.20-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
iomap: readpages doesn't zero page tail beyond EOF
vfs: vfs_dedupe_file_range() doesn't return EOPNOTSUPP
iomap: dio data corruption and spurious errors when pipes fill
iomap: sub-block dio needs to zeroout beyond EOF
iomap: FUA is wrong for DIO O_DSYNC writes into unwritten extents
xfs: delalloc -> unwritten COW fork allocation can go wrong
xfs: flush removing page cache in xfs_reflink_remap_prep
xfs: extent shifting doesn't fully invalidate page cache
xfs: finobt AG reserves don't consider last AG can be a runt
xfs: fix transient reference count error in xfs_buf_resubmit_failed_buffers
xfs: uncached buffer tracing needs to print bno
xfs: make xfs_file_remap_range() static
xfs: fix shared extent data corruption due to missing cow reservation
- Fix tasks freezer deadlock in de_thread() that occurs if one
of its sub-threads has been frozen already (Chanho Min).
- Avoid registering a platform device by the ti-cpufreq driver
on platforms that cannot use it (Dave Gerlach).
- Fix a mistake in the ti-opp-supply operating performance points
(OPP) driver that caused an incorrect reference voltage to be
used and make it adjust the minimum voltage dynamically to avoid
hangs or crashes in some cases (Keerthy).
- Fix issues related to compiler flags in the cpupower utility
and correct a linking problem in it by renaming a file with
a duplicate name (Jiri Olsa, Konstantin Khlebnikov).
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCAAGBQJb99ANAAoJEILEb/54YlRxUk4QAIkpXmLdEm0zvQXJZKJxuxu/
8eiglPcoCAAVgR3uRjezTyQeFCDjqGIqmpiipFOkxgAaeb+aiFta2B88B4oZ3OAf
G5yJG4CMs5+Tp/oH2+McX4PMo2WfoEIDoOGVOdtlk4iGAIh9tT5KvNfCUyxHLP7c
cFjdeUQuc8PRPzutESxYHB23FmECCEprVmEFVckQ7vSCnWX6dm1mFteCiNZADH8/
YNgRWRYtyWXlPRNCPJuCvYmVHLgm0Tw6CVhG4ttXT9wIkYyiLK9Flx9X6gsBWdoS
ej1jbJM7R/EAUzHvCjUCSfMKDNWvQySR3gC2m0vG5Goext2UXWieSHaI6BfqjPP2
wTBX9lAKu3iIGISoorjJZAMe4v26Tpbs0v5ApsOGr50WNZKZtRkBwaJ0EFT2hEa8
MsDtUjr+d7CIQ+ZDHmAbOzghpDlSuhGypfkD7IPtA/AY1dy1wJ3BUYya8ucqSdr3
iKkQcndN3ASR1+Fyb3c1cflh9fWe84aeSmYPihcX2m+/D43keYXbj2ANdCH1dF3E
cwAXw9+VteUQ1tihqgJapFogK3VphMbRErPSGXryuyiMxr38g7tgHAd2rId0JG4g
QftqpZa19aEed8tJNaQWWjBaBFgYeAT9BQqcP6CjFOO9klPH//NSiqzQUFMdzsEc
Qql1V/aoOIsflb7gJH3/
=JQxr
-----END PGP SIGNATURE-----
Merge tag 'pm-4.20-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management fixes from Rafael Wysocki:
"These fix two issues in the Operating Performance Points (OPP)
framework, one cpufreq driver issue, one problem related to the tasks
freezer and a few build-related issues in the cpupower utility.
Specifics:
- Fix tasks freezer deadlock in de_thread() that occurs if one of its
sub-threads has been frozen already (Chanho Min).
- Avoid registering a platform device by the ti-cpufreq driver on
platforms that cannot use it (Dave Gerlach).
- Fix a mistake in the ti-opp-supply operating performance points
(OPP) driver that caused an incorrect reference voltage to be used
and make it adjust the minimum voltage dynamically to avoid hangs
or crashes in some cases (Keerthy).
- Fix issues related to compiler flags in the cpupower utility and
correct a linking problem in it by renaming a file with a duplicate
name (Jiri Olsa, Konstantin Khlebnikov)"
* tag 'pm-4.20-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
exec: make de_thread() freezable
cpufreq: ti-cpufreq: Only register platform_device when supported
opp: ti-opp-supply: Correct the supply in _get_optimal_vdd_voltage call
opp: ti-opp-supply: Dynamically update u_volt_min
tools cpupower: Override CFLAGS assignments
tools cpupower debug: Allow to use outside build flags
tools/power/cpupower: fix compilation with STATIC=true
The function dentry_connected calls dput(dentry) to drop the previously
acquired reference to dentry. In this case, dentry can be released.
After that, IS_ROOT(dentry) checks the condition
(dentry == dentry->d_parent), which may result in a use-after-free bug.
This patch directly compares dentry with its parent obtained before
dropping the reference.
Fixes: a056cc8934c("exportfs: stop retrying once we race with
rename/remove")
Signed-off-by: Pan Bian <bianpan2016@163.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The function relocate_block_group calls btrfs_end_transaction to release
trans when update_backref_cache returns 1, and then continues the loop
body. If btrfs_block_rsv_refill fails this time, it will jump out the
loop and the freed trans will be accessed. This may result in a
use-after-free bug. The patch assigns NULL to trans after trans is
released so that it will not be accessed.
Fixes: 0647bf564f ("Btrfs: improve forever loop when doing balance relocation")
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Pan Bian <bianpan2016@163.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
rfc8435 says:
For tight coupling, ffds_stateid provides the stateid to be used by
the client to access the file.
However current implementation replaces per-mirror provided stateid with
by open or lock stateid.
Ensure that per-mirror stateid is used by ff_layout_write_prepare_v4 and
nfs4_ff_layout_prepare_ds.
Signed-off-by: Tigran Mkrtchyan <tigran.mkrtchyan@desy.de>
Signed-off-by: Rick Macklem <rmacklem@uoguelph.ca>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Bruce pointed out that we shouldn't allocate memory while holding
a lock in the nfs4_callback_offload() and handle_async_copy()
that deal with a racing CB_OFFLOAD and reply to COPY case.
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
We have a race between enabling quotas end subvolume creation that cause
subvolume creation to fail with -EINVAL, and the following diagram shows
how it happens:
CPU 0 CPU 1
btrfs_ioctl()
btrfs_ioctl_quota_ctl()
btrfs_quota_enable()
mutex_lock(fs_info->qgroup_ioctl_lock)
btrfs_ioctl()
create_subvol()
btrfs_qgroup_inherit()
-> save fs_info->quota_root
into quota_root
-> stores a NULL value
-> tries to lock the mutex
qgroup_ioctl_lock
-> blocks waiting for
the task at CPU0
-> sets BTRFS_FS_QUOTA_ENABLED in fs_info
-> sets quota_root in fs_info->quota_root
(non-NULL value)
mutex_unlock(fs_info->qgroup_ioctl_lock)
-> checks quota enabled
flag is set
-> returns -EINVAL because
fs_info->quota_root was
NULL before it acquired
the mutex
qgroup_ioctl_lock
-> ioctl returns -EINVAL
Returning -EINVAL to user space will be confusing if all the arguments
passed to the subvolume creation ioctl were valid.
Fix it by grabbing the value from fs_info->quota_root after acquiring
the mutex.
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
make_bad_inode() sets inode->i_mode to S_IFREG if I/O error is detected
in fuse_do_getattr()/fuse_do_setattr(). If the inode is not a regular
file, write_files and queued_writes in fuse_inode are not initialized
and have NULL or invalid pointers written by other members in a union.
So, list_empty() returns false in fuse_destroy_inode(). Add
is_bad_inode() to check if make_bad_inode() was called.
Reported-by: syzbot+b9c89b84423073226299@syzkaller.appspotmail.com
Fixes: ab2257e994 ("fuse: reduce size of struct fuse_inode")
Signed-off-by: Myungho Jung <mhjungk@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
When we read the EOF page of the file via readpages, we need
to zero the region beyond EOF that we either do not read or
should not contain data so that mmap does not expose stale data to
user applications.
However, iomap_adjust_read_range() fails to detect EOF correctly,
and so fsx on 1k block size filesystems fails very quickly with
mapreads exposing data beyond EOF. There are two problems here.
Firstly, when calculating the end block of the EOF byte, we have
to round the size by one to avoid a block aligned EOF from reporting
a block too large. i.e. a size of 1024 bytes is 1 block, which in
index terms is block 0. Therefore we have to calculate the end block
from (isize - 1), not isize.
The second bug is determining if the current page spans EOF, and so
whether we need split it into two half, one for the IO, and the
other for zeroing. Unfortunately, the code that checks whether
we should split the block doesn't actually check if we span EOF, it
just checks if the read spans the /offset in the page/ that EOF
sits on. So it splits every read into two if EOF is not page
aligned, regardless of whether we are reading the EOF block or not.
Hence we need to restrict the "does the read span EOF" check to
just the page that spans EOF, not every page we read.
This patch results in correct EOF detection through readpages:
xfs_vm_readpages: dev 259:0 ino 0x43 nr_pages 24
xfs_iomap_found: dev 259:0 ino 0x43 size 0x66c00 offset 0x4f000 count 98304 type hole startoff 0x13c startblock 1368 blockcount 0x4
iomap_readpage_actor: orig pos 323584 pos 323584, length 4096, poff 0 plen 4096, isize 420864
xfs_iomap_found: dev 259:0 ino 0x43 size 0x66c00 offset 0x50000 count 94208 type hole startoff 0x140 startblock 1497 blockcount 0x5c
iomap_readpage_actor: orig pos 327680 pos 327680, length 94208, poff 0 plen 4096, isize 420864
iomap_readpage_actor: orig pos 331776 pos 331776, length 90112, poff 0 plen 4096, isize 420864
iomap_readpage_actor: orig pos 335872 pos 335872, length 86016, poff 0 plen 4096, isize 420864
iomap_readpage_actor: orig pos 339968 pos 339968, length 81920, poff 0 plen 4096, isize 420864
iomap_readpage_actor: orig pos 344064 pos 344064, length 77824, poff 0 plen 4096, isize 420864
iomap_readpage_actor: orig pos 348160 pos 348160, length 73728, poff 0 plen 4096, isize 420864
iomap_readpage_actor: orig pos 352256 pos 352256, length 69632, poff 0 plen 4096, isize 420864
iomap_readpage_actor: orig pos 356352 pos 356352, length 65536, poff 0 plen 4096, isize 420864
iomap_readpage_actor: orig pos 360448 pos 360448, length 61440, poff 0 plen 4096, isize 420864
iomap_readpage_actor: orig pos 364544 pos 364544, length 57344, poff 0 plen 4096, isize 420864
iomap_readpage_actor: orig pos 368640 pos 368640, length 53248, poff 0 plen 4096, isize 420864
iomap_readpage_actor: orig pos 372736 pos 372736, length 49152, poff 0 plen 4096, isize 420864
iomap_readpage_actor: orig pos 376832 pos 376832, length 45056, poff 0 plen 4096, isize 420864
iomap_readpage_actor: orig pos 380928 pos 380928, length 40960, poff 0 plen 4096, isize 420864
iomap_readpage_actor: orig pos 385024 pos 385024, length 36864, poff 0 plen 4096, isize 420864
iomap_readpage_actor: orig pos 389120 pos 389120, length 32768, poff 0 plen 4096, isize 420864
iomap_readpage_actor: orig pos 393216 pos 393216, length 28672, poff 0 plen 4096, isize 420864
iomap_readpage_actor: orig pos 397312 pos 397312, length 24576, poff 0 plen 4096, isize 420864
iomap_readpage_actor: orig pos 401408 pos 401408, length 20480, poff 0 plen 4096, isize 420864
iomap_readpage_actor: orig pos 405504 pos 405504, length 16384, poff 0 plen 4096, isize 420864
iomap_readpage_actor: orig pos 409600 pos 409600, length 12288, poff 0 plen 4096, isize 420864
iomap_readpage_actor: orig pos 413696 pos 413696, length 8192, poff 0 plen 4096, isize 420864
iomap_readpage_actor: orig pos 417792 pos 417792, length 4096, poff 0 plen 3072, isize 420864
iomap_readpage_actor: orig pos 420864 pos 420864, length 1024, poff 3072 plen 1024, isize 420864
As you can see, it now does full page reads until the last one which
is split correctly at the block aligned EOF, reading 3072 bytes and
zeroing the last 1024 bytes. The original version of the patch got
this right, but it got another case wrong.
The EOF detection crossing really needs to the the original length
as plen, while it starts at the end of the block, will be shortened
as up-to-date blocks are found on the page. This means "orig_pos +
plen" no longer points to the end of the page, and so will not
correctly detect EOF crossing. Hence we have to use the length
passed in to detect this partial page case:
xfs_filemap_fault: dev 259:1 ino 0x43 write_fault 0
xfs_vm_readpage: dev 259:1 ino 0x43 nr_pages 1
xfs_iomap_found: dev 259:1 ino 0x43 size 0x2cc00 offset 0x2c000 count 4096 type hole startoff 0xb0 startblock 282 blockcount 0x4
iomap_readpage_actor: orig pos 180224 pos 181248, length 4096, poff 1024 plen 2048, isize 183296
xfs_iomap_found: dev 259:1 ino 0x43 size 0x2cc00 offset 0x2cc00 count 1024 type hole startoff 0xb3 startblock 285 blockcount 0x1
iomap_readpage_actor: orig pos 183296 pos 183296, length 1024, poff 3072 plen 1024, isize 183296
Heere we see a trace where the first block on the EOF page is up to
date, hence poff = 1024 bytes. The offset into the page of EOF is
3072, so the range we want to read is 1024 - 3071, and the range we
want to zero is 3072 - 4095. You can see this is split correctly
now.
This fixes the stale data beyond EOF problem that fsx quickly
uncovers on 1k block size filesystems.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
It returns EINVAL when the operation is not supported by the
filesystem. Fix it to return EOPNOTSUPP to be consistent with
the man page and clone_file_range().
Clean up the inconsistent error return handling while I'm there.
(I know, lipstick on a pig, but every little bit helps...)
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
When doing direct IO to a pipe for do_splice_direct(), then pipe is
trivial to fill up and overflow as it can only hold 16 pages. At
this point bio_iov_iter_get_pages() then returns -EFAULT, and we
abort the IO submission process. Unfortunately, iomap_dio_rw()
propagates the error back up the stack.
The error is converted from the EFAULT to EAGAIN in
generic_file_splice_read() to tell the splice layers that the pipe
is full. do_splice_direct() completely fails to handle EAGAIN errors
(it aborts on error) and returns EAGAIN to the caller.
copy_file_write() then completely fails to handle EAGAIN as well,
and so returns EAGAIN to userspace, having failed to copy the data
it was asked to.
Avoid this whole steaming pile of fail by having iomap_dio_rw()
silently swallow EFAULT errors and so do short reads.
To make matters worse, iomap_dio_actor() has a stale data exposure
bug bio_iov_iter_get_pages() fails - it does not zero the tail block
that it may have been left uncovered by partial IO. Fix the error
handling case to drop to the sub-block zeroing rather than
immmediately returning the -EFAULT error.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
If we are doing sub-block dio that extends EOF, we need to zero
the unused tail of the block to initialise the data in it it. If we
do not zero the tail of the block, then an immediate mmap read of
the EOF block will expose stale data beyond EOF to userspace. Found
with fsx running sub-block DIO sizes vs MAPREAD/MAPWRITE operations.
Fix this by detecting if the end of the DIO write is beyond EOF
and zeroing the tail if necessary.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
When we write into an unwritten extent via direct IO, we dirty
metadata on IO completion to convert the unwritten extent to
written. However, when we do the FUA optimisation checks, the inode
may be clean and so we issue a FUA write into the unwritten extent.
This means we then bypass the generic_write_sync() call after
unwritten extent conversion has ben done and we don't force the
modified metadata to stable storage.
This violates O_DSYNC semantics. The window of exposure is a single
IO, as the next DIO write will see the inode has dirty metadata and
hence will not use the FUA optimisation. Calling
generic_write_sync() after completion of the second IO will also
sync the first write and it's metadata.
Fix this by avoiding the FUA optimisation when writing to unwritten
extents.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Long saga. There have been days spent following this through dead end
after dead end in multi-GB event traces. This morning, after writing
a trace-cmd wrapper that enabled me to be more selective about XFS
trace points, I discovered that I could get just enough essential
tracepoints enabled that there was a 50:50 chance the fsx config
would fail at ~115k ops. If it didn't fail at op 115547, I stopped
fsx at op 115548 anyway.
That gave me two traces - one where the problem manifested, and one
where it didn't. After refining the traces to have the necessary
information, I found that in the failing case there was a real
extent in the COW fork compared to an unwritten extent in the
working case.
Walking back through the two traces to the point where the CWO fork
extents actually diverged, I found that the bad case had an extra
unwritten extent in it. This is likely because the bug it led me to
had triggered multiple times in those 115k ops, leaving stray
COW extents around. What I saw was a COW delalloc conversion to an
unwritten extent (as they should always be through
xfs_iomap_write_allocate()) resulted in a /written extent/:
xfs_writepage: dev 259:0 ino 0x83 pgoff 0x17000 size 0x79a00 offset 0 length 0
xfs_iext_remove: dev 259:0 ino 0x83 state RC|LF|RF|COW cur 0xffff888247b899c0/2 offset 32 block 152 count 20 flag 1 caller xfs_bmap_add_extent_delay_real
xfs_bmap_pre_update: dev 259:0 ino 0x83 state RC|LF|RF|COW cur 0xffff888247b899c0/1 offset 1 block 4503599627239429 count 31 flag 0 caller xfs_bmap_add_extent_delay_real
xfs_bmap_post_update: dev 259:0 ino 0x83 state RC|LF|RF|COW cur 0xffff888247b899c0/1 offset 1 block 121 count 51 flag 0 caller xfs_bmap_add_ex
Basically, Cow fork before:
0 1 32 52
+H+DDDDDDDDDDDD+UUUUUUUUUUU+
PREV RIGHT
COW delalloc conversion allocates:
1 32
+uuuuuuuuuuuu+
NEW
And the result according to the xfs_bmap_post_update trace was:
0 1 32 52
+H+wwwwwwwwwwwwwwwwwwwwwwww+
PREV
Which is clearly wrong - it should be a merged unwritten extent,
not an unwritten extent.
That lead me to look at the LEFT_FILLING|RIGHT_FILLING|RIGHT_CONTIG
case in xfs_bmap_add_extent_delay_real(), and sure enough, there's
the bug.
It takes the old delalloc extent (PREV) and adds the length of the
RIGHT extent to it, takes the start block from NEW, removes the
RIGHT extent and then updates PREV with the new extent.
What it fails to do is update PREV.br_state. For delalloc, this is
always XFS_EXT_NORM, while in this case we are converting the
delayed allocation to unwritten, so it needs to be updated to
XFS_EXT_UNWRITTEN. This LF|RF|RC case does not do this, and so
the resultant extent is always written.
And that's the bug I've been chasing for a week - a bmap btree bug,
not a reflink/dedupe/copy_file_range bug, but a BMBT bug introduced
with the recent in core extent tree scalability enhancements.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
On a sub-page block size filesystem, fsx is failing with a data
corruption after a series of operations involving copying a file
with the destination offset beyond EOF of the destination of the file:
8093(157 mod 256): TRUNCATE DOWN from 0x7a120 to 0x50000 ******WWWW
8094(158 mod 256): INSERT 0x25000 thru 0x25fff (0x1000 bytes)
8095(159 mod 256): COPY 0x18000 thru 0x1afff (0x3000 bytes) to 0x2f400
8096(160 mod 256): WRITE 0x5da00 thru 0x651ff (0x7800 bytes) HOLE
8097(161 mod 256): COPY 0x2000 thru 0x5fff (0x4000 bytes) to 0x6fc00
The second copy here is beyond EOF, and it is to sub-page (4k) but
block aligned (1k) offset. The clone runs the EOF zeroing, landing
in a pre-existing post-eof delalloc extent. This zeroes the post-eof
extents in the page cache just fine, dirtying the pages correctly.
The problem is that xfs_reflink_remap_prep() now truncates the page
cache over the range that it is copying it to, and rounds that down
to cover the entire start page. This removes the dirty page over the
delalloc extent from the page cache without having written it back.
Hence later, when the page cache is flushed, the page at offset
0x6f000 has not been written back and hence exposes stale data,
which fsx trips over less than 10 operations later.
Fix this by changing xfs_reflink_remap_prep() to use
xfs_flush_unmap_range().
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
When doing an incremental send, due to the need of delaying directory move
(rename) operations we can end up in infinite loop at
apply_children_dir_moves().
An example scenario that triggers this problem is described below, where
directory names correspond to the numbers of their respective inodes.
Parent snapshot:
.
|--- 261/
|--- 271/
|--- 266/
|--- 259/
|--- 260/
| |--- 267
|
|--- 264/
| |--- 258/
| |--- 257/
|
|--- 265/
|--- 268/
|--- 269/
| |--- 262/
|
|--- 270/
|--- 272/
| |--- 263/
| |--- 275/
|
|--- 274/
|--- 273/
Send snapshot:
.
|-- 275/
|-- 274/
|-- 273/
|-- 262/
|-- 269/
|-- 258/
|-- 271/
|-- 268/
|-- 267/
|-- 270/
|-- 259/
| |-- 265/
|
|-- 272/
|-- 257/
|-- 260/
|-- 264/
|-- 263/
|-- 261/
|-- 266/
When processing inode 257 we delay its move (rename) operation because its
new parent in the send snapshot, inode 272, was not yet processed. Then
when processing inode 272, we delay the move operation for that inode
because inode 274 is its ancestor in the send snapshot. Finally we delay
the move operation for inode 274 when processing it because inode 275 is
its new parent in the send snapshot and was not yet moved.
When finishing processing inode 275, we start to do the move operations
that were previously delayed (at apply_children_dir_moves()), resulting in
the following iterations:
1) We issue the move operation for inode 274;
2) Because inode 262 depended on the move operation of inode 274 (it was
delayed because 274 is its ancestor in the send snapshot), we issue the
move operation for inode 262;
3) We issue the move operation for inode 272, because it was delayed by
inode 274 too (ancestor of 272 in the send snapshot);
4) We issue the move operation for inode 269 (it was delayed by 262);
5) We issue the move operation for inode 257 (it was delayed by 272);
6) We issue the move operation for inode 260 (it was delayed by 272);
7) We issue the move operation for inode 258 (it was delayed by 269);
8) We issue the move operation for inode 264 (it was delayed by 257);
9) We issue the move operation for inode 271 (it was delayed by 258);
10) We issue the move operation for inode 263 (it was delayed by 264);
11) We issue the move operation for inode 268 (it was delayed by 271);
12) We verify if we can issue the move operation for inode 270 (it was
delayed by 271). We detect a path loop in the current state, because
inode 267 needs to be moved first before we can issue the move
operation for inode 270. So we delay again the move operation for
inode 270, this time we will attempt to do it after inode 267 is
moved;
13) We issue the move operation for inode 261 (it was delayed by 263);
14) We verify if we can issue the move operation for inode 266 (it was
delayed by 263). We detect a path loop in the current state, because
inode 270 needs to be moved first before we can issue the move
operation for inode 266. So we delay again the move operation for
inode 266, this time we will attempt to do it after inode 270 is
moved (its move operation was delayed in step 12);
15) We issue the move operation for inode 267 (it was delayed by 268);
16) We verify if we can issue the move operation for inode 266 (it was
delayed by 270). We detect a path loop in the current state, because
inode 270 needs to be moved first before we can issue the move
operation for inode 266. So we delay again the move operation for
inode 266, this time we will attempt to do it after inode 270 is
moved (its move operation was delayed in step 12). So here we added
again the same delayed move operation that we added in step 14;
17) We attempt again to see if we can issue the move operation for inode
266, and as in step 16, we realize we can not due to a path loop in
the current state due to a dependency on inode 270. Again we delay
inode's 266 rename to happen after inode's 270 move operation, adding
the same dependency to the empty stack that we did in steps 14 and 16.
The next iteration will pick the same move dependency on the stack
(the only entry) and realize again there is still a path loop and then
again the same dependency to the stack, over and over, resulting in
an infinite loop.
So fix this by preventing adding the same move dependency entries to the
stack by removing each pending move record from the red black tree of
pending moves. This way the next call to get_pending_dir_moves() will
not return anything for the current parent inode.
A test case for fstests, with this reproducer, follows soon.
Signed-off-by: Robbie Ko <robbieko@synology.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
[Wrote changelog with example and more clear explanation]
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When decoding a lower file handle, we first call ovl_check_origin_fh()
with connected=false to get any real lower dentry for overlay inode
cache lookup.
If the real dentry is a disconnected dir dentry, ovl_check_origin_fh()
is called again with connected=true to get a connected real dentry
and find the lower layer the real dentry belongs to.
If the first call returned a connected real dentry, we use it to
lookup an overlay connected dentry, but the first ovl_check_origin_fh()
call with connected=false did not check that the found dentry is under
the root of the layer (see ovl_acceptable()), it only checked that
the found dentry super block matches the uuid of the lower file handle.
In case there are multiple lower layers on the same fs and the found
dentry is not from the top most lower layer, using the layer index
returned from the first ovl_check_origin_fh() is wrong and we end
up failing to decode the file handle.
Fix this by always calling ovl_check_origin_fh() with connected=true
if we got a directory dentry in the first call.
Fixes: 8b58924ad5 ("ovl: lookup in inode cache first when decoding...")
Cc: <stable@vger.kernel.org> # v4.17
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
The extent shifting code uses a flush and invalidate mechainsm prior
to shifting extents around. This is similar to what
xfs_free_file_space() does, but it doesn't take into account things
like page cache vs block size differences, and it will fail if there
is a page that it currently busy.
xfs_flush_unmap_range() handles all of these cases, so just convert
xfs_prepare_shift() to us that mechanism rather than having it's own
special sauce.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The last AG may be very small comapred to all other AGs, and hence
AG reservations based on the superblock AG size may actually consume
more space than the AG actually has. This results on assert failures
like:
XFS: Assertion failed: xfs_perag_resv(pag, XFS_AG_RESV_METADATA)->ar_reserved + xfs_perag_resv(pag, XFS_AG_RESV_RMAPBT)->ar_reserved <= pag->pagf_freeblks + pag->pagf_flcount, file: fs/xfs/libxfs/xfs_ag_resv.c, line: 319
[ 48.932891] xfs_ag_resv_init+0x1bd/0x1d0
[ 48.933853] xfs_fs_reserve_ag_blocks+0x37/0xb0
[ 48.934939] xfs_mountfs+0x5b3/0x920
[ 48.935804] xfs_fs_fill_super+0x462/0x640
[ 48.936784] ? xfs_test_remount_options+0x60/0x60
[ 48.937908] mount_bdev+0x178/0x1b0
[ 48.938751] mount_fs+0x36/0x170
[ 48.939533] vfs_kern_mount.part.43+0x54/0x130
[ 48.940596] do_mount+0x20e/0xcb0
[ 48.941396] ? memdup_user+0x3e/0x70
[ 48.942249] ksys_mount+0xba/0xd0
[ 48.943046] __x64_sys_mount+0x21/0x30
[ 48.943953] do_syscall_64+0x54/0x170
[ 48.944835] entry_SYSCALL_64_after_hwframe+0x49/0xbe
Hence we need to ensure the finobt per-ag space reservations take
into account the size of the last AG rather than treat it like all
the other full size AGs.
Note that both refcountbt and rmapbt already take the size of the AG
into account via reading the AGF length directly.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
When retrying a failed inode or dquot buffer,
xfs_buf_resubmit_failed_buffers() clears all the failed flags from
the inde/dquot log items. In doing so, it also drops all the
reference counts on the buffer that the failed log items hold. This
means it can drop all the active references on the buffer and hence
free the buffer before it queues it for write again.
Putting the buffer on the delwri queue takes a reference to the
buffer (so that it hangs around until it has been written and
completed), but this goes bang if the buffer has already been freed.
Hence we need to add the buffer to the delwri queue before we remove
the failed flags from the log items attached to the buffer to ensure
it always remains referenced during the resubmit process.
Reported-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
'shash' algorithms are always synchronous, so passing CRYPTO_ALG_ASYNC
in the mask to crypto_alloc_shash() has no effect. Many users therefore
already don't pass it, but some still do. This inconsistency can cause
confusion, especially since the way the 'mask' argument works is
somewhat counterintuitive.
Thus, just remove the unneeded CRYPTO_ALG_ASYNC flags.
This patch shouldn't change any actual behavior.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
For cases when the application does not specify aio_reqprio for an aio,
fallback to use get_current_ioprio() to obtain the task I/O priority
last set using ioprio_set() rather than the hardcoded IOPRIO_CLASS_NONE
value.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Adam Manzanares <adam.manzanares@wdc.com>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Fix a deadlock whereby the NFSv4 state manager can get stuck in the
delegation return code, waiting for a layout return to complete in
another thread. If the server reboots before that other thread
completes, then we need to be able to start a second state
manager thread in order to perform recovery.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
xfs_file_remap_range() is only used in fs/xfs/xfs_file.c, so make it
static.
This addresses a gcc warning when -Wmissing-prototypes is enabled.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Page writeback indirectly handles shared extents via the existence
of overlapping COW fork blocks. If COW fork blocks exist, writeback
always performs the associated copy-on-write regardless if the
underlying blocks are actually shared. If the blocks are shared,
then overlapping COW fork blocks must always exist.
fstests shared/010 reproduces a case where a buffered write occurs
over a shared block without performing the requisite COW fork
reservation. This ultimately causes writeback to the shared extent
and data corruption that is detected across md5 checks of the
filesystem across a mount cycle.
The problem occurs when a buffered write lands over a shared extent
that crosses an extent size hint boundary and that also happens to
have a partial COW reservation that doesn't cover the start and end
blocks of the data fork extent.
For example, a buffered write occurs across the file offset (in FSB
units) range of [29, 57]. A shared extent exists at blocks [29, 35]
and COW reservation already exists at blocks [32, 34]. After
accommodating a COW extent size hint of 32 blocks and the existing
reservation at offset 32, xfs_reflink_reserve_cow() allocates 32
blocks of reservation at offset 0 and returns with COW reservation
across the range of [0, 34]. The associated data fork extent is
still [29, 35], however, which isn't fully covered by the COW
reservation.
This leads to a buffered write at file offset 35 over a shared
extent without associated COW reservation. Writeback eventually
kicks in, performs an overwrite of the underlying shared block and
causes the associated data corruption.
Update xfs_reflink_reserve_cow() to accommodate the fact that a
delalloc allocation request may not fully cover the extent in the
data fork. Trim the data fork extent appropriately, just as is done
for shared extent boundaries and/or existing COW reservations that
happen to overlap the start of the data fork extent. This prevents
shared/010 failures due to data corruption on reflink enabled
filesystems.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Pull networking fixes from David Miller:
1) Fix some potentially uninitialized variables and use-after-free in
kvaser_usb can drier, from Jimmy Assarsson.
2) Fix leaks in qed driver, from Denis Bolotin.
3) Socket leak in l2tp, from Xin Long.
4) RSS context allocation fix in bnxt_en from Michael Chan.
5) Fix cxgb4 build errors, from Ganesh Goudar.
6) Route leaks in ipv6 when removing exceptions, from Xin Long.
7) Memory leak in IDR allocation handling of act_pedit, from Davide
Caratti.
8) Use-after-free of bridge vlan stats, from Nikolay Aleksandrov.
9) When MTU is locked, do not force DF bit on ipv4 tunnels. From
Sabrina Dubroca.
10) When NAPI cached skb is reused, we must set it to the proper initial
state which includes skb->pkt_type. From Eric Dumazet.
11) Lockdep and non-linear SKB handling fix in tipc from Jon Maloy.
12) Set RX queue properly in various tuntap receive paths, from Matthew
Cover.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (61 commits)
tuntap: fix multiqueue rx
ipv6: Fix PMTU updates for UDP/raw sockets in presence of VRF
tipc: don't assume linear buffer when reading ancillary data
tipc: fix lockdep warning when reinitilaizing sockets
net-gro: reset skb->pkt_type in napi_reuse_skb()
tc-testing: tdc.py: Guard against lack of returncode in executed command
tc-testing: tdc.py: ignore errors when decoding stdout/stderr
ip_tunnel: don't force DF when MTU is locked
MAINTAINERS: Add entry for CAKE qdisc
net: bridge: fix vlan stats use-after-free on destruction
socket: do a generic_file_splice_read when proto_ops has no splice_read
net: phy: mdio-gpio: Fix working over slow can_sleep GPIOs
Revert "net: phy: mdio-gpio: Fix working over slow can_sleep GPIOs"
net: phy: mdio-gpio: Fix working over slow can_sleep GPIOs
net/sched: act_pedit: fix memory leak when IDR allocation fails
net: lantiq: Fix returned value in case of error in 'xrx200_probe()'
ipv6: fix a dst leak when removing its exception
net: mvneta: Don't advertise 2.5G modes
drivers/net/ethernet/qlogic/qed/qed_rdma.h: fix typo
net/mlx4: Fix UBSAN warning of signed integer overflow
...
For the core poll helper, the task state setting don't need to imply any
atomics, as it's the current task itself that is being modified and
we're not going to sleep.
For IRQ driven, the wakeup path have the necessary barriers to not need
us using the heavy handed version of the task state setting.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Theodore Ts'o reported a v4.19 regression with docker-dropbox:
https://marc.info/?l=linux-fsdevel&m=154070089431116&w=2
"I was rebuilding my dropbox Docker container, and it failed in 4.19
with the following error:
...
dpkg: error: error creating new backup file \
'/var/lib/dpkg/status-old': Invalid cross-device link"
The problem did not reproduce with metacopy feature disabled.
The error was caused by insufficient credentials to set
"trusted.overlay.redirect" xattr on link of a metacopy file.
Reproducer:
echo Y > /sys/module/overlay/parameters/redirect_dir
echo Y > /sys/module/overlay/parameters/metacopy
cd /tmp
mkdir l u w m
chmod 777 l u
touch l/foo
ln l/foo l/link
chmod 666 l/foo
mount -t overlay none -olowerdir=l,upperdir=u,workdir=w m
su fsgqa
ln m/foo m/bar
[ 21.455823] overlayfs: failed to set redirect (-1)
ln: failed to create hard link 'm/bar' => 'm/foo':\
Invalid cross-device link
Reported-by: Theodore Y. Ts'o <tytso@mit.edu>
Reported-by: Maciej Zięba <maciekz82@gmail.com>
Fixes: 4120fe64dc ("ovl: Set redirect on upper inode when it is linked")
Cc: <stable@vger.kernel.org> # v4.19
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
After calling get_unlocked_entry(), you have to call
put_unlocked_entry() to avoid subsequent waiters losing wakeups.
Fixes: c2a7d2a115 ("filesystem-dax: Introduce dax_lock_mapping_entry()")
Cc: stable@vger.kernel.org
Signed-off-by: Matthew Wilcox <willy@infradead.org>
Suspend fails due to the exec family of functions blocking the freezer.
The casue is that de_thread() sleeps in TASK_UNINTERRUPTIBLE waiting for
all sub-threads to die, and we have the deadlock if one of them is frozen.
This also can occur with the schedule() waiting for the group thread leader
to exit if it is frozen.
In our machine, it causes freeze timeout as bellows.
Freezing of tasks failed after 20.010 seconds (1 tasks refusing to freeze, wq_busy=0):
setcpushares-ls D ffffffc00008ed70 0 5817 1483 0x0040000d
Call trace:
[<ffffffc00008ed70>] __switch_to+0x88/0xa0
[<ffffffc000d1c30c>] __schedule+0x1bc/0x720
[<ffffffc000d1ca90>] schedule+0x40/0xa8
[<ffffffc0001cd784>] flush_old_exec+0xdc/0x640
[<ffffffc000220360>] load_elf_binary+0x2a8/0x1090
[<ffffffc0001ccff4>] search_binary_handler+0x9c/0x240
[<ffffffc00021c584>] load_script+0x20c/0x228
[<ffffffc0001ccff4>] search_binary_handler+0x9c/0x240
[<ffffffc0001ce8e0>] do_execveat_common.isra.14+0x4f8/0x6e8
[<ffffffc0001cedd0>] compat_SyS_execve+0x38/0x48
[<ffffffc00008de30>] el0_svc_naked+0x24/0x28
To fix this, make de_thread() freezable. It looks safe and works fine.
Suggested-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Chanho Min <chanho.min@lge.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Pavel Machek <pavel@ucw.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Commit c26f6c6157 ("udf: Fix conversion of 'dstring' fields to UTF8")
started to be more strict when checking whether converted strings are
properly formatted. Sudip reports that there are DVDs where the volume
identification string is actually too long - UDF reports:
[ 632.309320] UDF-fs: incorrect dstring lengths (32/32)
during mount and fails the mount. This is mostly harmless failure as we
don't need volume identification (and even less volume set
identification) for anything. So just truncate the volume identification
string if it is too long and replace it with 'Invalid' if we just cannot
convert it for other reasons. This keeps slightly incorrect media still
mountable.
CC: stable@vger.kernel.org
Fixes: c26f6c6157 ("udf: Fix conversion of 'dstring' fields to UTF8")
Reported-and-tested-by: Sudip Mukherjee <sudipm.mukherjee@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Fix a static code checker warning:
fs/exportfs/expfs.c:171 reconnect_one() warn: passing zero to 'ERR_PTR'
The error path for lookup_one_len_unlocked failure
should set err to PTR_ERR.
Fixes: bbf7a8a356 ("exportfs: move most of reconnect_path to helper function")
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
-----BEGIN PGP SIGNATURE-----
iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAlvx2sAeHHRvcnZhbGRz
QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGycgIAIuxobwt0RRKa0zO
ROS+34JGoC2yU2P9VdEGWdtxS6ANMVQgKPBhWL6s+xR89Kd+V4xSdJLD1pNTxxqP
0DCva0np1/Q4juH+JbU50v/lykoLgteZ0P0LBRGf1y8p3WiLPv45IbnNsMDNYhB2
7a8rOmZYakRY9CPznRDw3X8cJt3sddKgFJHIOGz1OQJVWtCD0KPGcJmQNsbDSagY
Zx6Z5BKSIdjRqaAdN5gDa1Pft3WQo7TpaQGl80lSsgr5LcjmscXA3sClOCy+25Mo
FZLx0PcwP+Efq8RTGzNK51WSOMa6d37hvjDqUAdQBOR0KbyjRyXQwyQVw/MGbPJs
7J3Pzm0=
=56Mt
-----END PGP SIGNATURE-----
Merge tag 'v4.20-rc3' into for-4.21/block
Merge in -rc3 to resolve a few conflicts, but also to get a few
important fixes that have gone into mainline since the block
4.21 branch was forked off (most notably the SCSI queue issue,
which is both a conflict AND needed fix).
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Convert string compares of DT node names to use of_node_name_eq helper
instead. This removes direct access to the node name pointer.
Signed-off-by: Rob Herring <robh@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
In preparation to remove struct device_node.path_component_name, use
full_name instead. kbasename is used so full_name can be used whether it
is the full path or just the node's name and unit-address.
Cc: "David S. Miller" <davem@davemloft.net>
Cc: sparclinux@vger.kernel.org
Signed-off-by: Rob Herring <robh@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The write context should also be freed even when direct IO failed.
Otherwise a memory leak is introduced and entries remain in
oi->ip_unwritten_list causing the following BUG later in unlink path:
ERROR: bug expression: !list_empty(&oi->ip_unwritten_list)
ERROR: Clear inode of 215043, inode has unwritten extents
...
Call Trace:
? __set_current_blocked+0x42/0x68
ocfs2_evict_inode+0x91/0x6a0 [ocfs2]
? bit_waitqueue+0x40/0x33
evict+0xdb/0x1af
iput+0x1a2/0x1f7
do_unlinkat+0x194/0x28f
SyS_unlinkat+0x1b/0x2f
do_syscall_64+0x79/0x1ae
entry_SYSCALL_64_after_hwframe+0x151/0x0
This patch also logs, with frequency limit, direct IO failures.
Link: http://lkml.kernel.org/r/20181102170632.25921-1-wen.gang.wang@oracle.com
Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Changwei Ge <ge.changwei@h3c.com>
Reviewed-by: Joseph Qi <jiangqi903@gmail.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Spock reported that commit 172b06c32b ("mm: slowly shrink slabs with a
relatively small number of objects") leads to a regression on his setup:
periodically the majority of the pagecache is evicted without an obvious
reason, while before the change the amount of free memory was balancing
around the watermark.
The reason behind is that the mentioned above change created some
minimal background pressure on the inode cache. The problem is that if
an inode is considered to be reclaimed, all belonging pagecache page are
stripped, no matter how many of them are there. So, if a huge
multi-gigabyte file is cached in the memory, and the goal is to reclaim
only few slab objects (unused inodes), we still can eventually evict all
gigabytes of the pagecache at once.
The workload described by Spock has few large non-mapped files in the
pagecache, so it's especially noticeable.
To solve the problem let's postpone the reclaim of inodes, which have
more than 1 attached page. Let's wait until the pagecache pages will be
evicted naturally by scanning the corresponding LRU lists, and only then
reclaim the inode structure.
Link: http://lkml.kernel.org/r/20181023164302.20436-1-guro@fb.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Reported-by: Spock <dairinin@gmail.com>
Tested-by: Spock <dairinin@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: <stable@vger.kernel.org> [4.19.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Using xas_load() with a PMD-sized xa_state would work if either a
PMD-sized entry was present or a PTE sized entry was present in the
first 64 entries (of the 512 PTEs in a PMD on x86). If there was no
PTE in the first 64 entries, grab_mapping_entry() would believe there
were no entries present, allocate a PMD-sized entry and overwrite the
PTE in the page cache.
Use xas_find_conflict() instead which turns out to simplify
both get_unlocked_entry() and grab_mapping_entry(). Also remove a
WARN_ON_ONCE from grab_mapping_entry() as it will have already triggered
in get_unlocked_entry().
Fixes: cfc93c6c6c ("dax: Convert dax_insert_pfn_mkwrite to XArray")
Signed-off-by: Matthew Wilcox <willy@infradead.org>
Device DAX PMD pages do not set the PageHead bit for compound pages.
Fix for now by retrieving the PMD bit from the entry, but eventually we
will be passed the page size by the caller.
Reported-by: Dan Williams <dan.j.williams@intel.com>
Fixes: 9f32d22130 ("dax: Convert dax_lock_mapping_entry to XArray")
Signed-off-by: Matthew Wilcox <willy@infradead.org>
If the ioprio capability check fails, we return without putting
the file pointer.
Fixes: d9a08a9e61 ("fs: Add aio iopriority support")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
For the device-dax case, it is possible that the inode can go away
underneath us. The rcu_read_lock() was there to prevent it from
being freed, and not (as I thought) to protect the tree. Bring back
the rcu_read_lock() protection. Also add a little kernel-doc; while
this function is not exported to modules, it is used from outside dax.c
Reported-by: Dan Williams <dan.j.williams@intel.com>
Fixes: 9f32d22130 ("dax: Convert dax_lock_mapping_entry to XArray")
Signed-off-by: Matthew Wilcox <willy@infradead.org>
I wrote the semantics in the commit message, but didn't document it in
the source code. Use a BUG_ON instead (if any code does do this, it's
really buggy; we can't recover and it's worth taking the machine down).
Signed-off-by: Matthew Wilcox <willy@infradead.org>
Skipping some of the revalidation after we sleep can lead to returning
a mapping which has already been freed. Just drop this optimisation.
Reported-by: Dan Williams <dan.j.williams@intel.com>
Fixes: 9f32d22130 ("dax: Convert dax_lock_mapping_entry to XArray")
Signed-off-by: Matthew Wilcox <willy@infradead.org>
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAlvukbUACgkQnJ2qBz9k
QNkJnggAriAyO72fH7nfh2AzvwnkqLi9uE4JQk+ZxQFW9W6a32Y+qIRBsaiau++w
/lOVSGLCCVewzAa4b/lYPdyOS/yG4w6UuEsg5HQPux4JsOccalYudB5ONdeZGd/P
5FPxcsixDQQHP9mhz1gIn2wV7bbytH2DduIHxuKbJfX//xnO3jS/gnJq82rnxb4R
IzqewVjbFiqYQKtyuzIqKqI3yOVqESJhyQ96j6chj5YKNUkMuwU9akhQGR4Zcqen
/cxh+mK+ZzCybgu2Sfgx3noZb5RFYePpcSSpDG5OKJc04IY4m76fyHIi2flBE9AX
HCN60y+VyKFOiNxzsSLXgsE9aj40OA==
=UcJO
-----END PGP SIGNATURE-----
Merge tag 'fsnotify_for_v4.20-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull fsnotify fix from Jan Kara:
"One small fsnotify fix for duplicate events"
* tag 'fsnotify_for_v4.20-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
fanotify: fix handling of events on child sub-directory
Pull bfs2 fixes from Andreas Gruenbacher:
"Fix two bugs leading to leaked buffer head references:
- gfs2: Put bitmap buffers in put_super
- gfs2: Fix iomap buffer head reference counting bug
And one bug leading to significant slow-downs when deleting large
files:
- gfs2: Fix metadata read-ahead during truncate (2)"
* tag 'gfs2-4.20.fixes3' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
gfs2: Fix iomap buffer head reference counting bug
gfs2: Fix metadata read-ahead during truncate (2)
gfs2: Put bitmap buffers in put_super
GFS2 passes the inode buffer head (dibh) from gfs2_iomap_begin to
gfs2_iomap_end in iomap->private. It sets that private pointer in
gfs2_iomap_get. Users of gfs2_iomap_get other than gfs2_iomap_begin
would have to release iomap->private, but this isn't done correctly,
leading to a leak of buffer head references.
To fix this, move the code for setting iomap->private from
gfs2_iomap_get to gfs2_iomap_begin.
Fixes: 64bc06bb32 ("gfs2: iomap buffered write support")
Cc: stable@vger.kernel.org # v4.19+
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Those will go straight to issue inside blk-mq, so don't bother
setting up a block plug for them.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Inherit the iocb IOCB_HIPRI flag, and pass on REQ_HIPRI for
those kinds of requests.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If we're polling for IO on a device that doesn't use interrupts, then
IO completion loop (and wake of task) is done by submitting task itself.
If that is the case, then we don't need to enter the wake_up_process()
function, we can simply mark ourselves as TASK_RUNNING.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iHQEABYIAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCW+3F5wAKCRDh3BK/laaZ
PIFHAPi8jN+wXE24tSEx/oJzCVAXP5tw8IxbiyiEdkvBZeyuAP9nkx0GoUtZodWN
c+JEjDJnyYQhM28VoKauDoWn+dxPDw==
=QvAu
-----END PGP SIGNATURE-----
Merge tag 'fuse-fixes-4.20-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse
Pull fuse fixes from Miklos Szeredi:
"A couple of fixes, all bound for -stable (i.e. not regressions in this
cycle)"
* tag 'fuse-fixes-4.20-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
fuse: fix use-after-free in fuse_direct_IO()
fuse: fix possibly missed wake-up after abort
fuse: fix leaked notify reply
The life-checking function, which is used by kAFS to make sure that a call
is still live in the event of a pending signal, only samples the received
packet serial number counter; it doesn't actually provoke a change in the
counter, rather relying on the server to happen to give us a packet in the
time window.
Fix this by adding a function to force a ping to be transmitted.
kAFS then keeps track of whether there's been a stall, and if so, uses the
new function to ping the server, resetting the timeout to allow the reply
to come back.
If there's a stall, a ping and the call is *still* stalled in the same
place after another period, then the call will be aborted.
Fixes: bc5e3a546d ("rxrpc: Use MSG_WAITALL to tell sendmsg() to temporarily ignore signals")
Fixes: f4d15fb6f9 ("rxrpc: Provide functions for allowing cleaner handling of signals")
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Highlights include:
Stable fixes:
- Don't exit the NFSv4 state manager without clearing NFS4CLNT_MANAGER_RUNNING
Bugfixes:
- Fix an Oops when destroying the RPCSEC_GSS credential cache
- Fix an Oops during delegation callbacks
- Ensure that the NFSv4 state manager exits the loop on SIGKILL
- Fix a bogus get/put in generic_key_to_expire()
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJb7Lh0AAoJEA4mA3inWBJc8uAQAIkrGChs3AFuEQ3G3H9RlxDX
WFsPghRGmDwXf2sD+nWjl0r60v0v5fQaUhW/7EPe2kbVTF/rnjieXNeFOw33ZMFk
MDq03nL1/I25DoNK/qg5GZ2NIltZ9oKKbwaN+0LxXKz69X5qIXYnDzYPHDR/PNTg
Go7PvG8rU31Wd67E2pquwC6zZ6rCPf2BtQjZdzouLAEUWXAMHyJmszpFUxhLMJoz
k6dZouphj8fkMse3cfKLnGDqbQ2bE6+Yb0B6Hi0p5nShYgZTaQNZ9KxrEJF7J05i
cxH6IvLEawEMWXYzGEwr1LUDDrpwveuNTt/OroTgOcSsVpZx1DE0sOZkQ4pt/uTe
c5NzZYKjEOb2DWxoGR2GEDkRasKVBkWvR5MegvyDgyAcXkAjN/6CgYXiniNYDxl6
qk7sIqkJfug7fv+VW5YHwORKnvRIEDlFcwy5yZ0ij/Qa0dqUR3aczINGLwS6kcfn
u7M42UR17FUo2zaI9pZhuijwntbtkXMIETWHGRctK7Mum6u37QSVySNCO2A4knBE
jEy+oYPFCIUqH+ESpNp73otrVt1CTexScIJNsEi1naLmOhjQRW7YjUPEH1Xjg0Ss
OGyqIjOf6ToF6ma39/XZI9miJe08k6x8b0aGUdG29Cko9UvjLH86ODEausSRAyFA
OyZFFuHHAau5FGpNvZfj
=AstN
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-4.20-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client bugfixes from Trond Myklebust:
"Highlights include:
Stable fixes:
- Don't exit the NFSv4 state manager without clearing
NFS4CLNT_MANAGER_RUNNING
Bugfixes:
- Fix an Oops when destroying the RPCSEC_GSS credential cache
- Fix an Oops during delegation callbacks
- Ensure that the NFSv4 state manager exits the loop on SIGKILL
- Fix a bogus get/put in generic_key_to_expire()"
* tag 'nfs-for-4.20-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
NFSv4: Fix an Oops during delegation callbacks
SUNRPC: Fix a bogus get/put in generic_key_to_expire()
SUNRPC: Fix a Oops when destroying the RPCSEC_GSS credential cache
NFSv4: Ensure that the state manager exits the loop on SIGKILL
NFSv4: Don't exit the state manager without clearing NFS4CLNT_MANAGER_RUNNING
inotify_show_fdinfo() is defined in fs/notify/fdinfo.c and declared in
fs/notify/fdinfo.h, but the declaration isn't included at the point of
the definition. Include the header to enforce that the definition
matches the declaration.
This addresses a gcc warning when -Wmissing-prototypes is enabled.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Reusable parameter of mb_cache_entry_create() is bool type,
so it's better to set true instead of 1.
Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Signed-off-by: Jan Kara <jack@suse.cz>
According to comment in dlm_user_request() ua should be freed
in dlm_free_lkb() after successful attach to lkb.
However ua is attached to lkb not in set_lock_args() but later,
inside request_lock().
Fixes 597d0cae0f ("[DLM] dlm: user locks")
Cc: stable@kernel.org # 2.6.19
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: David Teigland <teigland@redhat.com>
If allocation fails on last elements of array need to free already
allocated elements.
v2: just move existing out_rsbtbl label to right place
Fixes 789924ba635f ("dlm: fix race between remove and lookup")
Cc: stable@kernel.org # 3.6
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: David Teigland <teigland@redhat.com>
effort to hit, which might explain why they weren't found sooner.
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJb6zwdAAoJECebzXlCjuG+SLMP/AlpI+vPV7DdCLRWGCY1ZMjk
5pxIS+74mD2EopBYgZY58L1fxWgv2bLOiAs/baAlNpkjTNX3wlxXGTu9IzVdPOn7
3n+W2Rb+mXEFaag7mP8RFpOvt7Yb3p4DObGpg7TKWJZ6r/8xcxQWQO+e0iiS5+XK
EOiaFcGmYlOC1JtrRIL2fr16trXUhT1gz7qAZgKBzebbEdn4FfdsdwHm7nUyRB3I
LhCMV35RfzOBC2C/kQzlHaHYlo0dx5lKMtVzvtgMdpgXr4QXE/7Ke/ANQ7oGfhhO
9uX0Uf18HmeGRejK9QoMha7VWuwh5pyHBq0ppMpGL2jb11BD/l9iXgS+vTxpA2B0
YIiSOnaiDFsEk6hMsFqueVIdaTrarcjg/S2mh2QDjtkXKS3L0W6/7v97JJHu9J4l
6zxiT6Crq2p8pMZ5gY3RI1AYllW/K+TRoccLhO+q19g3q1HWxP6DyeFBgNF66/ha
NtmQP+94IkaCS70zirpEu/OeUMviQgX2x77OReyibHLA4+R+hNHwtR67BLl+xG0G
jmKHfqqX7offFaHmsoD8kK3gpKtit0/py9Hp7gXQg4vU5iL512gI83ICEOEkZMXn
Ppsrl1HyoO/ohY/USpMvRqYHjM1ZGew19ZzD7SId6vUVaYjQIEsjQVnycK3h+gSb
otk5pc3bWPCwa8csOWPs
=+1Ub
-----END PGP SIGNATURE-----
Merge tag 'nfsd-4.20-1' of git://linux-nfs.org/~bfields/linux
Pull nfsd fixes from Bruce Fields:
"Three nfsd bugfixes.
None are new bugs, but they all take a little effort to hit, which
might explain why they weren't found sooner"
* tag 'nfsd-4.20-1' of git://linux-nfs.org/~bfields/linux:
SUNRPC: drop pointless static qualifier in xdr_get_next_encode_buffer()
nfsd: COPY and CLONE operations require the saved filehandle to be set
sunrpc: correct the computation for page_ptr when truncating
Pull namespace fix from Eric Biederman:
"Benjamin Coddington noticed an unkillable busy loop in the kernel that
anyone who is sufficiently motivated can trigger. This bug did not
exist in earlier kernels making this bug a regression.
I have tested the change personally and confirmed that the bug exists
and that the fix works. This fix has been picked up by linux-next and
hopefully the automated testing bots and no problems have been
reported from those sources.
Ordinarily I would let something like this sit a little longer but I
am going to be away at Linux Plumbers the rest of this week and I am
afraid if I don't send the pull request now this fix will get lost"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
mnt: fix __detach_mounts infinite loop
We were using the path name received from user space without checking that
it is null terminated. While btrfs-progs is well behaved and does proper
validation and null termination, someone could call the ioctl and pass
a non-null terminated patch, leading to buffer overrun problems in the
kernel. The ioctl is protected by CAP_SYS_ADMIN.
So just set the last byte of the path to a null character, similar to what
we do in other ioctls (add/remove/resize device, snapshot creation, etc).
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
ext2_xattr_destroy_cache() can handle NULL pointer correctly,
so there is no need to check NULL pointer before calling
ext2_xattr_destroy_cache().
Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Signed-off-by: Jan Kara <jack@suse.cz>
If the server sends a CB_GETATTR or a CB_RECALL while the filesystem is
being unmounted, then we can Oops when releasing the inode in
nfs4_callback_getattr() and nfs4_callback_recall().
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Technically dlm_config_nodes() could return error and keep nodes
uninitialized. After that on the fail path of we'll call kfree()
for that uninitialized value.
The patch is simple - we should just initialize nodes with NULL.
Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David Teigland <teigland@redhat.com>
A new event mask FAN_OPEN_EXEC_PERM has been defined. This allows users
to receive events and grant access to files that are intending to be
opened for execution. Events of FAN_OPEN_EXEC_PERM type will be
generated when a file has been opened by using either execve(),
execveat() or uselib() system calls.
This acts in the same manner as previous permission event mask, meaning
that an access response is required from the user application in order
to permit any further operations on the file.
Signed-off-by: Matthew Bobrowski <mbobrowski@mbobrowski.org>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
A new event mask FAN_OPEN_EXEC has been defined so that users have the
ability to receive events specifically when a file has been opened with
the intent to be executed. Events of FAN_OPEN_EXEC type will be
generated when a file has been opened using either execve(), execveat()
or uselib() system calls.
The feature is implemented within fsnotify_open() by generating the
FAN_OPEN_EXEC event type if __FMODE_EXEC is set within file->f_flags.
Signed-off-by: Matthew Bobrowski <mbobrowski@mbobrowski.org>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Modify fanotify_should_send_event() so that it now returns a mask for
an event that contains ONLY flags for the event types that have been
specifically requested by the user. Flags that may have been included
within the event mask, but have not been explicitly requested by the
user will not be present in the returned value.
As an example, given the situation where a user requests events of type
FAN_OPEN. Traditionally, the event mask returned within an event that
occurred on a filesystem object that has been marked for monitoring and is
opened, will only ever have the FAN_OPEN bit set. With the introduction of
the new flags like FAN_OPEN_EXEC, and perhaps any other future event
flags, there is a possibility of the returned event mask containing more
than a single bit set, despite having only requested the single event type.
Prior to these modifications performed to fanotify_should_send_event(), a
user would have received a bundled event mask containing flags FAN_OPEN
and FAN_OPEN_EXEC in the instance that a file was opened for execution via
execve(), for example. This means that a user would receive event types
in the returned event mask that have not been requested. This runs the
possibility of breaking existing systems and causing other unforeseen
issues.
To mitigate this possibility, fanotify_should_send_event() has been
modified to return the event mask containing ONLY event types explicitly
requested by the user. This means that we will NOT report events that the
user did no set a mask for, and we will NOT report events that the user
has set an ignore mask for.
The function name fanotify_should_send_event() has also been updated so
that it's more relevant to what it has been designed to do.
Signed-off-by: Matthew Bobrowski <mbobrowski@mbobrowski.org>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
After the simplification of the fast fsync patch done recently by commit
b5e6c3e170 ("btrfs: always wait on ordered extents at fsync time") and
commit e7175a6927 ("btrfs: remove the wait ordered logic in the
log_one_extent path"), we got a very short time window where we can get
extents logged without writeback completing first or extents logged
without logging the respective data checksums. Both issues can only happen
when doing a non-full (fast) fsync.
As soon as we enter btrfs_sync_file() we trigger writeback, then lock the
inode and then wait for the writeback to complete before starting to log
the inode. However before we acquire the inode's lock and after we started
writeback, it's possible that more writes happened and dirtied more pages.
If that happened and those pages get writeback triggered while we are
logging the inode (for example, the VM subsystem triggering it due to
memory pressure, or another concurrent fsync), we end up seeing the
respective extent maps in the inode's list of modified extents and will
log matching file extent items without waiting for the respective
ordered extents to complete, meaning that either of the following will
happen:
1) We log an extent after its writeback finishes but before its checksums
are added to the csum tree, leading to -EIO errors when attempting to
read the extent after a log replay.
2) We log an extent before its writeback finishes.
Therefore after the log replay we will have a file extent item pointing
to an unwritten extent (and without the respective data checksums as
well).
This could not happen before the fast fsync patch simplification, because
for any extent we found in the list of modified extents, we would wait for
its respective ordered extent to finish writeback or collect its checksums
for logging if it did not complete yet.
Fix this by triggering writeback again after acquiring the inode's lock
and before waiting for ordered extents to complete.
Fixes: e7175a6927 ("btrfs: remove the wait ordered logic in the log_one_extent path")
Fixes: b5e6c3e170 ("btrfs: always wait on ordered extents at fsync time")
CC: stable@vger.kernel.org # 4.19+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When a metadata read is served the endio routine btree_readpage_end_io_hook
is called which eventually runs the tree-checker. If tree-checker fails
to validate the read eb then it sets EXTENT_BUFFER_CORRUPT flag. This
leads to btree_read_extent_buffer_pages wrongly assuming that all
available copies of this extent buffer are wrong and failing prematurely.
Fix this modify btree_read_extent_buffer_pages to read all copies of
the data.
This failure was exhibitted in xfstests btrfs/124 which would
spuriously fail its balance operations. The reason was that when balance
was run following re-introduction of the missing raid1 disk
__btrfs_map_block would map the read request to stripe 0, which
corresponded to devid 2 (the disk which is being removed in the test):
item 2 key (FIRST_CHUNK_TREE CHUNK_ITEM 3553624064) itemoff 15975 itemsize 112
length 1073741824 owner 2 stripe_len 65536 type DATA|RAID1
io_align 65536 io_width 65536 sector_size 4096
num_stripes 2 sub_stripes 1
stripe 0 devid 2 offset 2156920832
dev_uuid 8466c350-ed0c-4c3b-b17d-6379b445d5c8
stripe 1 devid 1 offset 3553624064
dev_uuid 1265d8db-5596-477e-af03-df08eb38d2ca
This caused read requests for a checksum item that to be routed to the
stale disk which triggered the aforementioned logic involving
EXTENT_BUFFER_CORRUPT flag. This then triggered cascading failures of
the balance operation.
Fixes: a826d6dcb3 ("Btrfs: check items for correctness as we search")
CC: stable@vger.kernel.org # 4.4+
Suggested-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If we exit the NFSv4 state manager due to a umount, then we can end up
leaving the NFS4CLNT_MANAGER_RUNNING flag set. If another mount causes
the nfs4_client to be rereferenced before it is destroyed, then we end
up never being able to recover state.
Fixes: 47c2199b6e ("NFSv4.1: Ensure state manager thread dies on last ...")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: stable@vger.kernel.org # v4.15+
lockdep_assert_held() is better suited to checking locking requirements,
since it only checks if the current thread holds the lock regardless of
whether someone else does. This is also a step towards possibly removing
spin_is_locked().
Signed-off-by: Lance Roy <ldr709@gmail.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: <linux-fsdevel@vger.kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
Since commit ff17fa561a ("d_invalidate(): unhash immediately")
immediately unhashes the dentry, we'll never return the mountpoint in
lookup_mountpoint(), which can lead to an unbreakable loop in
d_invalidate().
I have reports of NFS clients getting into this condition after the server
removes an export of an existing mount created through follow_automount(),
but I suspect there are various other ways to produce this problem if we
hunt down users of d_invalidate(). For example, it is possible to get into
this state by using XFS' d_invalidate() call in xfs_vn_unlink():
truncate -s 100m img{1,2}
mkfs.xfs -q -n version=ci img1
mkfs.xfs -q -n version=ci img2
mkdir -p /mnt/xfs
mount img1 /mnt/xfs
mkdir /mnt/xfs/sub1
mount img2 /mnt/xfs/sub1
cat > /mnt/xfs/sub1/foo &
umount -l /mnt/xfs/sub1
mount img2 /mnt/xfs/sub1
mount --make-private /mnt/xfs
mkdir /mnt/xfs/sub2
mount --move /mnt/xfs/sub1 /mnt/xfs/sub2
rmdir /mnt/xfs/sub1
Fix this by moving the check for an unlinked dentry out of the
detach_mounts() path.
Fixes: ff17fa561a ("d_invalidate(): unhash immediately")
Cc: stable@vger.kernel.org
Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAlvoGIUACgkQxWXV+ddt
WDta6g//UJSLnVskCUwh8VyMdd47QArQnaLJowOH7wQn4Nqj+2hf04mCq/kv05ed
OneTezzONZc/qW9fiJGS+Dp77ln4JIDA1hWHtb/A4t9pYlksSQllJ3oiDUVsCp3q
2EbzrjuNz3iQO6TjKlaHX473CLCMQMXS2OXOUnCkF2maMJSdr86oi+j1UiSnud1/
C7uMYM3hG8nkfEfjjb1COpkS2MmzYcPruF5RDcbT/WOUfylTsjjX1E7rK/ZEqS9P
SUcp4uoZe9BNoyWMASLaM7oHE82day4X9MwQoCQFRcm0kq4CnRAZ8X4lBl+M70iW
7Olii/wNZ2SRiJf3jac/rpxoBHvEskXTHyiHTEmdHp4n1L1pL9GzGYIePQcX7uV1
Tb6ImdUUKCC//fPqyeB7cEk5yxqahmlFD3qZVs6GnQkzKrPE+ChLx+7PgcJC/XVh
C5ogNmJm+NvFOuTrYk9zSXg85B8gWHescDJrvNKVizIjw3nKmqiC+dXZljhzw+p8
HscK9EXsiS8jW9ClfJljXzIa4SeA/i7fQGe4tCKfIrCQ+OqUxWpFCEoxygchinfF
Rw90fJ0jX083oXsnfFcVdQpQ+SLSKka/aIRMvi58WRgLU3trci5NNN4TFg8TYRKP
xBDF/iF3sqXajc+xsjoqLhLioZL3Pa5VDNuhsFdois9M5JSRekU=
=K14u
-----END PGP SIGNATURE-----
Merge tag 'for-4.20-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"Several fixes to recent release (4.19, fixes tagged for stable) and
other fixes"
* tag 'for-4.20-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
Btrfs: fix missing delayed iputs on unmount
Btrfs: fix data corruption due to cloning of eof block
Btrfs: fix infinite loop on inode eviction after deduplication of eof block
Btrfs: fix deadlock on tree root leaf when finding free extent
btrfs: avoid link error with CONFIG_NO_AUTO_INLINE
btrfs: tree-checker: Fix misleading group system information
Btrfs: fix missing data checksums after a ranged fsync (msync)
btrfs: fix pinned underflow after transaction aborted
Btrfs: fix cur_offset in the error case for nocow
error return cleanup paths.
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAlvoFrEACgkQ8vlZVpUN
gaMTSQf+Ogrvm7pfWtXf+RkmhhuyR26T+Hwxgl51m5bKetJBjEsh0qOaIfo7etwG
aLc1x/pWng2VTCHk4z0Ij9KS8YwLK3sQCBYZoJFyT/R09yGgAhLm+xP5j38WLqrX
h4GxVgekHSATkG95N/So7F7pQiz7gDowgbaYFW3PooXPoHJnCnTzcr7TGFAQBZAw
iR+8+KtH5E8IcC7Jj40nemk7Wib45DgaeGpP5P9Ct/Jw7hW+Mwhf56NYOWkLdHyy
4Kt7rm1Sbxam8k3nksNmIwx28bw+S0Ew1zZgkwgAcKcHaWdrv3TtGPkOA26AH+S3
UVeORM7xH+zXslIOyFK+7sXUZr5LiQ==
=BaBl
-----END PGP SIGNATURE-----
Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 fixes from Ted Ts'o:
"A large number of ext4 bug fixes, mostly buffer and memory leaks on
error return cleanup paths"
* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
ext4: missing !bh check in ext4_xattr_inode_write()
ext4: fix buffer leak in __ext4_read_dirblock() on error path
ext4: fix buffer leak in ext4_expand_extra_isize_ea() on error path
ext4: fix buffer leak in ext4_xattr_move_to_block() on error path
ext4: release bs.bh before re-using in ext4_xattr_block_find()
ext4: fix buffer leak in ext4_xattr_get_block() on error path
ext4: fix possible leak of s_journal_flag_rwsem in error path
ext4: fix possible leak of sbi->s_group_desc_leak in error path
ext4: remove unneeded brelse call in ext4_xattr_inode_update_ref()
ext4: avoid possible double brelse() in add_new_gdb() on error path
ext4: avoid buffer leak in ext4_orphan_add() after prior errors
ext4: avoid buffer leak on shutdown in ext4_mark_iloc_dirty()
ext4: fix possible inode leak in the retry loop of ext4_resize_fs()
ext4: fix missing cleanup if ext4_alloc_flex_bg_array() fails while resizing
ext4: add missing brelse() update_backups()'s error path
ext4: add missing brelse() add_new_gdb_meta_bg()'s error path
ext4: add missing brelse() in set_flexbg_block_bitmap()'s error path
ext4: avoid potential extra brelse in setup_new_flex_group_blocks()
Pull namespace fixes from Eric Biederman:
"I believe all of these are simple obviously correct bug fixes. These
fall into two groups:
- Fixing the implementation of MNT_LOCKED which prevents lesser
privileged users from seeing unders mounts created by more
privileged users.
- Fixing the extended uid and group mapping in user namespaces.
As well as ensuring the code looks correct I have spot tested these
changes as well and in my testing the fixes are working"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
mount: Prevent MNT_DETACH from disconnecting locked mounts
mount: Don't allow copying MNT_UNBINDABLE|MNT_LOCKED mounts
mount: Retest MNT_LOCKED in do_umount
userns: also map extents in the reverse map to kernel IDs
Fixes gcc '-Wunused-but-set-variable' warning:
fs/sysv/inode.c: In function '__sysv_write_inode':
fs/sysv/inode.c:239:6: warning:
variable 'err' set but not used [-Wunused-but-set-variable]
__sysv_write_inode should return 'err' instead of 0
Fixes: 05459ca81a ("repair sysv_write_inode(), switch sysv to simple_fsync()")
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
cleanup.
-----BEGIN PGP SIGNATURE-----
iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAlvluIATHGlkcnlvbW92
QGdtYWlsLmNvbQAKCRBKf944AhHzi/KDB/9ftmDVzZr8U9ubFIHfOKZQsxqElOAc
U/naOKU9PLZNsJkBRZNQMklS5OPAiWBPf/9bWTt+TV9jy8ljjt+Vnxmgqj8StqZY
da449b8uwDRWOY/3hzBNqDshmx3lWxI1+JIDcJPM2SkSASnBg6E1Usl0/xBp/a+r
dLLTUBJrxHMWtjXqclXk2iE1+Ehh5AMdqcwNKuqEJ3rg9OIt8PN/vDQN9dJk4kAX
4xwFoBY0WjUACf5r3+VhP/6yNxuLIIPKjygfkdYzc2LTDVXOr1SY5X0V26v6nN0K
UTqmn4g1uIrzaYCaPmzDYHKT7JYHQUPMMu9TaXkWt+MZxEsZ+z6r8QSU
=T9mn
-----END PGP SIGNATURE-----
Merge tag 'ceph-for-4.20-rc2' of https://github.com/ceph/ceph-client
Pull Ceph fixes from Ilya Dryomov:
"Two CephFS fixes (copy_file_range and quota) and a small feature bit
cleanup"
* tag 'ceph-for-4.20-rc2' of https://github.com/ceph/ceph-client:
libceph: assume argonaut on the server side
ceph: quota: fix null pointer dereference in quota check
ceph: add destination file data sync before doing any remote copy
According to Ted Ts'o ext4_getblk() called in ext4_xattr_inode_write()
should not return bh = NULL
The only time that bh could be NULL, then, would be in the case of
something really going wrong; a programming error elsewhere (perhaps a
wild pointer dereference) or I/O error causing on-disk file system
corruption (although that would be highly unlikely given that we had
*just* allocated the blocks and so the metadata blocks in question
probably would still be in the cache).
Fixes: e50e5129f3 ("ext4: xattr-in-inode support")
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org # 4.13
In async IO blocking case the additional reference to the io is taken for
it to survive fuse_aio_complete(). In non blocking case this additional
reference is not needed, however we still reference io to figure out
whether to wait for completion or not. This is wrong and will lead to
use-after-free. Fix it by storing blocking information in separate
variable.
This was spotted by KASAN when running generic/208 fstest.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Reported-by: Zorro Lang <zlang@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Fixes: 744742d692 ("fuse: Add reference counting for fuse_io_priv")
Cc: <stable@vger.kernel.org> # v4.6
In current fuse_drop_waiting() implementation it's possible that
fuse_wait_aborted() will not be woken up in the unlikely case that
fuse_abort_conn() + fuse_wait_aborted() runs in between checking
fc->connected and calling atomic_dec(&fc->num_waiting).
Do the atomic_dec_and_test() unconditionally, which also provides the
necessary barrier against reordering with the fc->connected check.
The explicit smp_mb() in fuse_wait_aborted() is not actually needed, since
the spin_unlock() in fuse_abort_conn() provides the necessary RELEASE
barrier after resetting fc->connected. However, this is not a performance
sensitive path, and adding the explicit barrier makes it easier to
document.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Fixes: b8f95e5d13 ("fuse: umount should wait for all requests")
Cc: <stable@vger.kernel.org> #v4.19
fuse_request_send_notify_reply() may fail if the connection was reset for
some reason (e.g. fs was unmounted). Don't leak request reference in this
case. Besides leaking memory, this resulted in fc->num_waiting not being
decremented and hence fuse_wait_aborted() left in a hanging and unkillable
state.
Fixes: 2d45ba381a ("fuse: add retrieve request")
Fixes: b8f95e5d13 ("fuse: umount should wait for all requests")
Reported-and-tested-by: syzbot+6339eda9cb4ebbc4c37b@syzkaller.appspotmail.com
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: <stable@vger.kernel.org> #v2.6.36
The previous attempt to fix for metadata read-ahead during truncate was
incorrect: for files with a height > 2 (1006989312 bytes with a block
size of 4096 bytes), read-ahead requests were not being issued for some
of the indirect blocks discovered while walking the metadata tree,
leading to significant slow-downs when deleting large files. Fix that.
In addition, only issue read-ahead requests in the first pass through
the meta-data tree, while deallocating data blocks.
Fixes: c3ce5aa9b0 ("gfs2: Fix metadata read-ahead during truncate")
Cc: stable@vger.kernel.org # v4.16+
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
gfs2_put_super calls gfs2_clear_rgrpd to destroy the gfs2_rgrpd objects
attached to the resource group glocks. That function should release the
buffers attached to the gfs2_bitmap objects (bi_bh), but the call to
gfs2_rgrp_brelse for doing that is missing.
When gfs2_releasepage later runs across these buffers which are still
referenced, it refuses to free them. This causes the pages the buffers
are attached to to remain referenced as well. With enough mount/unmount
cycles, the system will eventually run out of memory.
Fix this by adding the missing call to gfs2_rgrp_brelse in
gfs2_clear_rgrpd.
(Also fix a gfs2_rgrp_relse -> gfs2_rgrp_brelse typo in a comment.)
Fixes: 39b0f1e929 ("GFS2: Don't brelse rgrp buffer_heads every allocation")
Cc: stable@vger.kernel.org # v4.2+
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Before this patch, recovery would cause all callbacks to be delayed,
put on a queue, and afterward they were all queued to the callback
work queue. This patch does the same thing, but occasionally takes
a break after 25 of them so it won't swamp the CPU at the expense
of other RT processes like corosync.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
Make sure we have a saved filehandle, otherwise we'll oops with a null
pointer dereference in nfs4_preprocess_stateid_op().
Signed-off-by: Scott Mayhew <smayhew@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
No one is running pre-argonaut. In addition one of the argonaut
features (NOSRCADDR) has been required since day one (and a half,
2.6.34 vs 2.6.35) of the kernel client.
Allow for the possibility of reusing these feature bits later.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>
This patch fixes a possible null pointer dereference in
check_quota_exceeded, detected by the static checker smatch, with the
following warning:
fs/ceph/quota.c:240 check_quota_exceeded()
error: we previously assumed 'realm' could be null (see line 188)
Fixes: b7a2921765 ("ceph: quota: support for ceph.quota.max_files")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Luis Henriques <lhenriques@suse.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
If we try to copy into a file that was just written, any data that is
remote copied will be overwritten by our buffered writes once they are
flushed. When this happens, the call to invalidate_inode_pages2_range
will also return a -EBUSY error.
This patch fixes this by also sync'ing the destination file before
starting any copy.
Fixes: 503f82a993 ("ceph: support copy_file_range file operation")
Signed-off-by: Luis Henriques <lhenriques@suse.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
When an event is reported on a sub-directory and the parent inode has
a mark mask with FS_EVENT_ON_CHILD|FS_ISDIR, the event will be sent to
fsnotify() even if the event type is not in the parent mark mask
(e.g. FS_OPEN).
Further more, if that event happened on a mount or a filesystem with
a mount/sb mark that does have that event type in their mask, the "on
child" event will be reported on the mount/sb mark. That is not
desired, because user will get a duplicate event for the same action.
Note that the event reported on the victim inode is never merged with
the event reported on the parent inode, because of the check in
should_merge(): old_fsn->inode == new_fsn->inode.
Fix this by looking for a match of an actual event type (i.e. not just
FS_ISDIR) in parent's inode mark mask and by not reporting an "on child"
event to group if event type is only found on mount/sb marks.
[backport hint: The bug seems to have always been in fanotify, but this
patch will only apply cleanly to v4.19.y]
Cc: <stable@vger.kernel.org> # v4.19
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
If filesystem has already mounted as read-only, then we don't have
to do it again.
Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Timothy Baldwin <timbaldwin@fastmail.co.uk> wrote:
> As per mount_namespaces(7) unprivileged users should not be able to look under mount points:
>
> Mounts that come as a single unit from more privileged mount are locked
> together and may not be separated in a less privileged mount namespace.
>
> However they can:
>
> 1. Create a mount namespace.
> 2. In the mount namespace open a file descriptor to the parent of a mount point.
> 3. Destroy the mount namespace.
> 4. Use the file descriptor to look under the mount point.
>
> I have reproduced this with Linux 4.16.18 and Linux 4.18-rc8.
>
> The setup:
>
> $ sudo sysctl kernel.unprivileged_userns_clone=1
> kernel.unprivileged_userns_clone = 1
> $ mkdir -p A/B/Secret
> $ sudo mount -t tmpfs hide A/B
>
>
> "Secret" is indeed hidden as expected:
>
> $ ls -lR A
> A:
> total 0
> drwxrwxrwt 2 root root 40 Feb 12 21:08 B
>
> A/B:
> total 0
>
>
> The attack revealing "Secret":
>
> $ unshare -Umr sh -c "exec unshare -m ls -lR /proc/self/fd/4/ 4<A"
> /proc/self/fd/4/:
> total 0
> drwxr-xr-x 3 root root 60 Feb 12 21:08 B
>
> /proc/self/fd/4/B:
> total 0
> drwxr-xr-x 2 root root 40 Feb 12 21:08 Secret
>
> /proc/self/fd/4/B/Secret:
> total 0
I tracked this down to put_mnt_ns running passing UMOUNT_SYNC and
disconnecting all of the mounts in a mount namespace. Fix this by
factoring drop_mounts out of drop_collected_mounts and passing
0 instead of UMOUNT_SYNC.
There are two possible behavior differences that result from this.
- No longer setting UMOUNT_SYNC will no longer set MNT_SYNC_UMOUNT on
the vfsmounts being unmounted. This effects the lazy rcu walk by
kicking the walk out of rcu mode and forcing it to be a non-lazy
walk.
- No longer disconnecting locked mounts will keep some mounts around
longer as they stay because the are locked to other mounts.
There are only two users of drop_collected mounts: audit_tree.c and
put_mnt_ns.
In audit_tree.c the mounts are private and there are no rcu lazy walks
only calls to iterate_mounts. So the changes should have no effect
except for a small timing effect as the connected mounts are disconnected.
In put_mnt_ns there may be references from process outside the mount
namespace to the mounts. So the mounts remaining connected will
be the bug fix that is needed. That rcu walks are allowed to continue
appears not to be a problem especially as the rcu walk change was about
an implementation detail not about semantics.
Cc: stable@vger.kernel.org
Fixes: 5ff9d8a65c ("vfs: Lock in place mounts from more privileged users")
Reported-by: Timothy Baldwin <timbaldwin@fastmail.co.uk>
Tested-by: Timothy Baldwin <timbaldwin@fastmail.co.uk>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Jonathan Calmels from NVIDIA reported that he's able to bypass the
mount visibility security check in place in the Linux kernel by using
a combination of the unbindable property along with the private mount
propagation option to allow a unprivileged user to see a path which
was purposefully hidden by the root user.
Reproducer:
# Hide a path to all users using a tmpfs
root@castiana:~# mount -t tmpfs tmpfs /sys/devices/
root@castiana:~#
# As an unprivileged user, unshare user namespace and mount namespace
stgraber@castiana:~$ unshare -U -m -r
# Confirm the path is still not accessible
root@castiana:~# ls /sys/devices/
# Make /sys recursively unbindable and private
root@castiana:~# mount --make-runbindable /sys
root@castiana:~# mount --make-private /sys
# Recursively bind-mount the rest of /sys over to /mnnt
root@castiana:~# mount --rbind /sys/ /mnt
# Access our hidden /sys/device as an unprivileged user
root@castiana:~# ls /mnt/devices/
breakpoint cpu cstate_core cstate_pkg i915 intel_pt isa kprobe
LNXSYSTM:00 msr pci0000:00 platform pnp0 power software system
tracepoint uncore_arb uncore_cbox_0 uncore_cbox_1 uprobe virtual
Solve this by teaching copy_tree to fail if a mount turns out to be
both unbindable and locked.
Cc: stable@vger.kernel.org
Fixes: 5ff9d8a65c ("vfs: Lock in place mounts from more privileged users")
Reported-by: Jonathan Calmels <jcalmels@nvidia.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
It was recently pointed out that the one instance of testing MNT_LOCKED
outside of the namespace_sem is in ksys_umount.
Fix that by adding a test inside of do_umount with namespace_sem and
the mount_lock held. As it helps to fail fails the existing test is
maintained with an additional comment pointing out that it may be racy
because the locks are not held.
Cc: stable@vger.kernel.org
Reported-by: Al Viro <viro@ZenIV.linux.org.uk>
Fixes: 5ff9d8a65c ("vfs: Lock in place mounts from more privileged users")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
In copy_result_to_user(), we first create a struct dlm_lock_result, which
contains a struct dlm_lksb, the last member of which is a pointer to the
lvb. Unfortunately, we copy the entire struct dlm_lksb to the result
struct, which is then copied to userspace at the end of the function,
leaking the contents of sb_lvbptr, which is a valid kernel pointer in some
cases (indeed, later in the same function the data it points to is copied
to userspace).
It is an error to leak kernel pointers to userspace, as it undermines KASLR
protections (see e.g. 65eea8edc3 ("floppy: Do not copy a kernel pointer to
user memory in FDGETPRM ioctl") for another example of this).
Signed-off-by: Tycho Andersen <tycho@tycho.ws>
Signed-off-by: David Teigland <teigland@redhat.com>
kobject doesn't like zero length object names, so let's test for that.
Signed-off-by: Tycho Andersen <tycho@tycho.ws>
Signed-off-by: David Teigland <teigland@redhat.com>
dlm_config_nodes() does not allocate nodes on failure, so we should not
free() nodes when it fails.
Signed-off-by: Tycho Andersen <tycho@tycho.ws>
Signed-off-by: David Teigland <teigland@redhat.com>
We use IOCB_HIPRI to poll for IO in the caller instead of scheduling.
This information is not available for (or after) IO submission. The
driver may make different queue choices based on the type of IO, so
make the fact that we will poll for this IO known to the lower layers
as well.
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There's a race between close_ctree() and cleaner_kthread().
close_ctree() sets btrfs_fs_closing(), and the cleaner stops when it
sees it set, but this is racy; the cleaner might have already checked
the bit and could be cleaning stuff. In particular, if it deletes unused
block groups, it will create delayed iputs for the free space cache
inodes. As of "btrfs: don't run delayed_iputs in commit", we're no
longer running delayed iputs after a commit. Therefore, if the cleaner
creates more delayed iputs after delayed iputs are run in
btrfs_commit_super(), we will leak inodes on unmount and get a busy
inode crash from the VFS.
Fix it by parking the cleaner before we actually close anything. Then,
any remaining delayed iputs will always be handled in
btrfs_commit_super(). This also ensures that the commit in close_ctree()
is really the last commit, so we can get rid of the commit in
cleaner_kthread().
The fstest/generic/475 followed by 476 can trigger a crash that
manifests as a slab corruption caused by accessing the freed kthread
structure by a wake up function. Sample trace:
[ 5657.077612] BUG: unable to handle kernel NULL pointer dereference at 00000000000000cc
[ 5657.079432] PGD 1c57a067 P4D 1c57a067 PUD da10067 PMD 0
[ 5657.080661] Oops: 0000 [#1] PREEMPT SMP
[ 5657.081592] CPU: 1 PID: 5157 Comm: fsstress Tainted: G W 4.19.0-rc8-default+ #323
[ 5657.083703] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626cc-prebuilt.qemu-project.org 04/01/2014
[ 5657.086577] RIP: 0010:shrink_page_list+0x2f9/0xe90
[ 5657.091937] RSP: 0018:ffffb5c745c8f728 EFLAGS: 00010287
[ 5657.092953] RAX: 0000000000000074 RBX: ffffb5c745c8f830 RCX: 0000000000000000
[ 5657.094590] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff9a8747fdf3d0
[ 5657.095987] RBP: ffffb5c745c8f9e0 R08: 0000000000000000 R09: 0000000000000000
[ 5657.097159] R10: ffff9a8747fdf5e8 R11: 0000000000000000 R12: ffffb5c745c8f788
[ 5657.098513] R13: ffff9a877f6ff2c0 R14: ffff9a877f6ff2c8 R15: dead000000000200
[ 5657.099689] FS: 00007f948d853b80(0000) GS:ffff9a877d600000(0000) knlGS:0000000000000000
[ 5657.101032] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 5657.101953] CR2: 00000000000000cc CR3: 00000000684bd000 CR4: 00000000000006e0
[ 5657.103159] Call Trace:
[ 5657.103776] shrink_inactive_list+0x194/0x410
[ 5657.104671] shrink_node_memcg.constprop.84+0x39a/0x6a0
[ 5657.105750] shrink_node+0x62/0x1c0
[ 5657.106529] try_to_free_pages+0x1a4/0x500
[ 5657.107408] __alloc_pages_slowpath+0x2c9/0xb20
[ 5657.108418] __alloc_pages_nodemask+0x268/0x2b0
[ 5657.109348] kmalloc_large_node+0x37/0x90
[ 5657.110205] __kmalloc_node+0x236/0x310
[ 5657.111014] kvmalloc_node+0x3e/0x70
Fixes: 30928e9baa ("btrfs: don't run delayed_iputs in commit")
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add trace ]
Signed-off-by: David Sterba <dsterba@suse.com>
bs.bh was taken in previous ext4_xattr_block_find() call,
it should be released before re-using
Fixes: 7e01c8e542 ("ext3/4: fix uninitialized bs in ...")
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org # 2.6.26
Fixes: bfe0a5f47a ("ext4: add more mount time checks of the superblock")
Reported-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org # 4.18
ext4_mark_iloc_dirty() callers expect that it releases iloc->bh
even if it returns an error.
Fixes: 0db1ff222d ("ext4: add shutdown bit and check for it")
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org # 4.11