- Replace memcpy with structure assignment.
- Remove unneeded codes and use helper function i_blocksize().
- Fix typos found by codespell.
-----BEGIN PGP SIGNATURE-----
iQJMBAABCgA2FiEE6NzKS6Uv/XAAGHgyZwv7A1FEIQgFAl+QyhMYHG5hbWphZS5q
ZW9uQHNhbXN1bmcuY29tAAoJEGcL+wNRRCEIhHQP/Rfk21jZ0iRzoVG5WXVWpqrG
wYCUWZu2xXmuqO+NroZ7vmSloTkj7txOSt2nyEHdbJhABOzWFxyU+pfYRpLcXVC0
uNt0Qk5wrey6JM+93ulwdrAKHqQzd5B4rv/wli4pNQKuh9FQSWUIJSEUw7m3LRvi
R9TbXCwBRiMIC0NPZMNrjOtrpjHcqhAWRQHsLIoA+7XCgQ4z8T1t0IlASuCy4HTb
nUNkfc7lMzFp0aW19T2EJYr6X7Yle56Ad91kFhHh9qKZEp9ET81Q3kaA5tVSGCMc
hCTDdBopOvfXOuwkmDKIYObfMiAauWiGSMlP+WKQ26jfh8fLEc8MT+dInfuqSs76
dL6vtOtId/lzhTk7XjUGP3WDxmyNL9Ri9HqYfHydEUphWdWBusr1pxIMvllc6bAs
PLJ9CgJHjAmRMBOpVQELRgqTqqsQgfpX/x5EmA0h4CW1JfwCTH7Y62bSI7G7RgCi
sMuW57IzGBGNDw4KyFLRg0G/V8z05aehsXWqDBNfJCYbBGIHvxg0pofGHFcc8sla
+HQt5rfjn7XU5RngyN5zf/ZOtuM6/tklGs2mbGNVSgZ1Px2W03k9iXU2tx4PTPIY
RRNDodgVbHldB7PPllCLRJwb2m9ungz1GF/H84dNujK2oyZNksTvDCS4Ad7uW8Bf
lujSPhSDZB730gYly/Vr
=qmM6
-----END PGP SIGNATURE-----
Merge tag 'exfat-for-5.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat
Pull exfat updates from Namjae Jeon:
- Replace memcpy with structure assignment
- Remove unneeded codes and use helper function i_blocksize()
- Fix typos found by codespell
* tag 'exfat-for-5.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat:
exfat: remove useless check in exfat_move_file()
exfat: remove 'rwoffset' in exfat_inode_info
exfat: replace memcpy with structure assignment
exfat: remove useless directory scan in exfat_add_entry()
exfat: eliminate dead code in exfat_find()
exfat: use i_blocksize() to get blocksize
exfat: fix misspellings using codespell tool
A couple of small fixes (loff_t overflow on 32bit, syzbot uninitialized
variable warning) and code cleanup (xen)
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE/IPbcYBuWt0zoYhOq06b7GqY5nAFAl+Rc/EACgkQq06b7GqY
5nACqxAAnrPj6uA4IQGCQCbRKetkidZrIGre/8RyEAyfxj+8gV0POYp/1SFokAu+
RzNkeDgKXkV8L/6BEc/XTi/Iy2HGlPOQLjpBcDLwOIe+ZSlXYK544zUqs20W8/ZH
NHtUY9rvgEWaTyYCi0wjS51xqBs2/Km4OIUKj7OZHXx9rWmpPX/6xwgcR4tNAWEJ
7AYkq9G0OiDbY9TaHRK0dOmVq1Xs+H3s5Ci4FMh/uFcQBI3UgfpYAegqTiMh8/AF
aWgtMdidHaBIWdzb1W4UbBXVd14dFfcIqMTzstFO+mElO4VtQCAQtC16cJX9Dqgh
2YQxZgF5uVAi3DogHLDT0iT6RiqJrU8WeDLfOq8AuOy29l5jlGB6y1GppmS4z0DK
sw+fy0lMH9q2ahbJs3RMk05m9vK8L9HCjqnMbTD0TohtkEEWyhHJ2RadtkDnRAL+
Vy+tlfENloVW4zJIFT8RI5pLXqLG4NGV5Tlp5D//5abV1V8uEeVFHe97JTBMsafi
FAGWqghPPxRB7UhtbnNHiIGDClOctT9y+SY0mxIkq0Bw+26K5G/eXWwF+Wgelfrs
n0nCrbi5rVdJJhcrZid4r/tUDpXydxgQmrXV8jsgpHH22B02WXKtAzcgWJsB9vec
syicGh9Shr4cF1S0Rq/Brgl47WGwHjufsvWSwnukvZ1+X4Rs9Qw=
=ONeq
-----END PGP SIGNATURE-----
Merge tag '9p-for-5.10-rc1' of git://github.com/martinetd/linux
Pull 9p updates from Dominique Martinet:
"A couple of small fixes (loff_t overflow on 32bit, syzbot
uninitialized variable warning) and code cleanup (xen)"
* tag '9p-for-5.10-rc1' of git://github.com/martinetd/linux:
net: 9p: initialize sun_server.sun_path to have addr's value only when addr is valid
9p/xen: Fix format argument warning
9P: Cast to loff_t before multiplying
Every close(io_uring) causes cancellation of all inflight requests
carrying ->files. That's not nice but was neccessary up until recently.
Now task->files removal is handled in the core code, so that part of
flush can be removed.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We correctly set io-wq NUMA node affinities when the io-wq context is
setup, but if an entire node CPU set is offlined and then brought back
online, the per node affinities are broken. Ensure that we set them
again whenever a CPU comes online. This ensures that we always track
the right node affinity. The usual cpuhp notifiers are used to drive it.
Reported-by: Zhang Qiang <qiang.zhang@windriver.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
During the stability test, there are some errors:
ext4_lookup:1590: inode #6967: comm fsstress: iget: checksum invalid.
If the inode->i_iblocks too big and doesn't set huge file flag, checksum
will not be recalculated when update the inode information to it's buffer.
If other inode marks the buffer dirty, then the inconsistent inode will
be flushed to disk.
Fix this problem by checking i_blocks in advance.
Cc: stable@kernel.org
Signed-off-by: Luo Meng <luomeng12@huawei.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Link: https://lore.kernel.org/r/20201020013631.3796673-1-luomeng12@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
This patch adds fast commit recovery path support for Ext4 file
system. We add several helper functions that are similar in spirit to
e2fsprogs journal recovery path handlers. Example of such functions
include - a simple block allocator, idempotent block bitmap update
function etc. Using these routines and the fast commit log in the fast
commit area, the recovery path (ext4_fc_replay()) performs fast commit
log recovery.
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Link: https://lore.kernel.org/r/20201015203802.3597742-8-harshadshirwadkar@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
This patch adds main fast commit commit path handlers. The overall
patch can be divided into two inter-related parts:
(A) Metadata updates tracking
This part consists of helper functions to track changes that need
to be committed during a commit operation. These updates are
maintained by Ext4 in different in-memory queues. Following are
the APIs and their short description that are implemented in this
patch:
- ext4_fc_track_link/unlink/creat() - Track unlink. link and creat
operations
- ext4_fc_track_range() - Track changed logical block offsets
inodes
- ext4_fc_track_inode() - Track inodes
- ext4_fc_mark_ineligible() - Mark file system fast commit
ineligible()
- ext4_fc_start_update() / ext4_fc_stop_update() /
ext4_fc_start_ineligible() / ext4_fc_stop_ineligible() These
functions are useful for co-ordinating inode updates with
commits.
(B) Main commit Path
This part consists of functions to convert updates tracked in
in-memory data structures into on-disk commits. Function
ext4_fc_commit() is the main entry point to commit path.
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Link: https://lore.kernel.org/r/20201015203802.3597742-6-harshadshirwadkar@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
This patch adds fast commit area trackers in the journal_t
structure. These are initialized via the jbd2_fc_init() routine that
this patch adds. This patch also adds ext4/fast_commit.c and
ext4/fast_commit.h files for fast commit code that will be added in
subsequent patches in this series.
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Link: https://lore.kernel.org/r/20201015203802.3597742-4-harshadshirwadkar@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
We are running out of mount option bits. Add handling for using
s_mount_opt2. Add ext4 and jbd2 fast commit feature flag and also add
ability to turn off the fast commit feature in Ext4.
Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Link: https://lore.kernel.org/r/20201015203802.3597742-3-harshadshirwadkar@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
In exfat_move_file(), the identity of source and target directory has been
checked by the caller.
Also, it gets stream.start_clu from file dir-entry, which is an invalid
determination.
Signed-off-by: Tetsuhiro Kohada <kohada.t2@gmail.com>
Acked-by: Sungjong Seo <sj1557.seo@samsung.com>
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Remove 'rwoffset' in exfat_inode_info and replace it with the parameter of
exfat_readdir().
Since rwoffset is referenced only by exfat_readdir(), it is not necessary
a exfat_inode_info's member.
Also, change cpos to point to the next of entry-set, and return the index
of dir-entry via dir_entry->entry.
Signed-off-by: Tetsuhiro Kohada <kohada.t2@gmail.com>
Acked-by: Sungjong Seo <sj1557.seo@samsung.com>
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
There is nothing in directory just created, so there is no need to scan.
Signed-off-by: Tetsuhiro Kohada <kohada.t2@gmail.com>
Acked-by: Sungjong Seo <sj1557.seo@samsung.com>
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
The exfat_find_dir_entry() called by exfat_find() doesn't return -EEXIST.
Therefore, the root-dir information setting is never executed.
Signed-off-by: Tetsuhiro Kohada <kohada.t2@gmail.com>
Acked-by: Sungjong Seo <sj1557.seo@samsung.com>
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
We alreday has the interface i_blocksize() to get blocksize,
so use it.
Signed-off-by: Xianting Tian <tian.xianting@h3c.com>
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
If processing recovered log intent items fails, we need to cancel all
the unprocessed recovered items immediately so that a subsequent AIL
push in the bail out path won't get wedged on the pinned intent items
that didn't get processed.
This can happen if the log contains (1) an intent that gets and releases
an inode, (2) an intent that cannot be recovered successfully, and (3)
some third intent item. When recovery of (2) fails, we leave (3) pinned
in memory. Inode reclamation is called in the error-out path of
xfs_mountfs before xfs_log_cancel_mount. Reclamation calls
xfs_ail_push_all_sync, which gets stuck waiting for (3).
Therefore, call xlog_recover_cancel_intents if _process_intents fails.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
To servers which do not support directory leases (e.g. Samba)
it is wasteful to try to open_shroot (ie attempt to cache the
root directory handle). Skip attempt to open_shroot when
server does not indicate support for directory leases.
Cuts the number of requests on mount from 17 to 15, and
cuts the number of requests on stat of the root directory
from 4 to 3.
Signed-off-by: Steve French <stfrench@microsoft.com>
CC: Stable <stable@vger.kernel.org> # v5.1+
When mounting with modefromsid mount option, it was possible to
get the error on stat of a fifo or char or block device:
"cannot stat <filename>: Operation not supported"
Special devices can be stored as reparse points by some servers
(e.g. Windows NFS server and when using the SMB3.1.1 POSIX
Extensions) but when the modefromsid mount option is used
the client attempts to get the ACL for the file which requires
opening with OPEN_REPARSE_POINT create option.
Signed-off-by: Steve French <stfrench@microsoft.com>
CC: Stable <stable@vger.kernel.org>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Reviewed-by: Shyam Prasad N <sprasad@microsoft.com>
Can be helpful in debugging mount and reconnect issues
Signed-off-by: Samuel Cabrero <scabrero@suse.de>
Reviewed-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
TCP server info field server->total_read is modified in parallel by
demultiplex thread and decrypt offload worker thread. server->total_read
is used in calculation to discard the remaining data of PDU which is
not read into memory.
Because of parallel modification, server->total_read can get corrupted
and can result in discarding the valid data of next PDU.
Signed-off-by: Rohith Surabattula <rohiths@microsoft.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
CC: Stable <stable@vger.kernel.org> #5.4+
Signed-off-by: Steve French <stfrench@microsoft.com>
Clear linked_timeout for next requests in __io_queue_sqe() so we won't
queue it up unnecessary when it's going to be punted.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Cc: stable@vger.kernel.org # v5.9
Signed-off-by: Jens Axboe <axboe@kernel.dk>
- a patch that removes crush_workspace_mutex (myself). CRUSH
computations are no longer serialized and can run in parallel.
- a couple new filesystem client metrics for "ceph fs top" command
(Xiubo Li)
- a fix for a very old messenger bug that affected the filesystem,
marked for stable (myself)
- assorted fixups and cleanups throughout the codebase from Jeff
and others.
-----BEGIN PGP SIGNATURE-----
iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAl+QN8cTHGlkcnlvbW92
QGdtYWlsLmNvbQAKCRBKf944AhHzi4iwCACLwUvO0ONTzUUb2D8ftfyXjo/jIod5
eO7RHfCHOP2A83GQpdYdktVDSVeOUIOiPhxAxo1dL4GieI2/saXrnoevam7ogZkA
OmR4drdtRVUqF/aATrtjiDMg2ge0dnx5gfjMxSP/FiPPpXOdrtxng/7tv8yo+03q
AlMqg/YcxO06t1M1qh9SEyfzjcHyPnJU2i0ienngxnGxQ7QiMOR6anF1LNhtN803
4fTBX2tLqDNAa+x5yF1kKSn9OmFNhc0oUsqef+Ck0Vw1LC0/SuxzE2J904iehAwy
/HzJCer0zcp+eZhn3HhoHmp0UZpOC1dzMxa3CHyRJFrxCSTEhY8iqPiJ
=wGr7
-----END PGP SIGNATURE-----
Merge tag 'ceph-for-5.10-rc1' of git://github.com/ceph/ceph-client
Pull ceph updates from Ilya Dryomov:
- a patch that removes crush_workspace_mutex (myself). CRUSH
computations are no longer serialized and can run in parallel.
- a couple new filesystem client metrics for "ceph fs top" command
(Xiubo Li)
- a fix for a very old messenger bug that affected the filesystem,
marked for stable (myself)
- assorted fixups and cleanups throughout the codebase from Jeff and
others.
* tag 'ceph-for-5.10-rc1' of git://github.com/ceph/ceph-client: (27 commits)
libceph: clear con->out_msg on Policy::stateful_server faults
libceph: format ceph_entity_addr nonces as unsigned
libceph: fix ENTITY_NAME format suggestion
libceph: move a dout in queue_con_delay()
ceph: comment cleanups and clarifications
ceph: break up send_cap_msg
ceph: drop separate mdsc argument from __send_cap
ceph: promote to unsigned long long before shifting
ceph: don't SetPageError on readpage errors
ceph: mark ceph_fmt_xattr() as printf-like for better type checking
ceph: fold ceph_update_writeable_page into ceph_write_begin
ceph: fold ceph_sync_writepages into writepage_nounlock
ceph: fold ceph_sync_readpages into ceph_readpage
ceph: don't call ceph_update_writeable_page from page_mkwrite
ceph: break out writeback of incompatible snap context to separate function
ceph: add a note explaining session reject error string
libceph: switch to the new "osd blocklist add" command
libceph, rbd, ceph: "blacklist" -> "blocklist"
ceph: have ceph_writepages_start call pagevec_lookup_range_tag
ceph: use kill_anon_super helper
...
In commit fe341eb151, I forgot that xfs_free_file_space isn't strictly
a "remove mapped blocks" function. It is actually a function to zero
file space by punching out the middle and writing zeroes to the
unaligned ends of the specified range. Therefore, putting a rtextsize
alignment check in that function is wrong because that breaks unaligned
ZERO_RANGE on the realtime volume.
Furthermore, xfs_file_fallocate already has alignment checks for the
functions require the file range to be aligned to the size of a
fundamental allocation unit (which is 1 FSB on the data volume and 1 rt
extent on the realtime volume). Create a new helper to check fallocate
arguments against the realtiem allocation unit size, fix the fallocate
frontend to use it, fix free_file_space to delete the correct range, and
remove a now redundant check from insert_file_space.
NOTE: The realtime extent size is not required to be a power of two!
Fixes: fe341eb151 ("xfs: ensure that fpunch, fcollapse, and finsert operations are aligned to rt extent size")
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
NFS_FS=y as dependency of CONFIG_NFSD_V4_2_INTER_SSC still have
build errors and some configs with NFSD=m to get NFS4ERR_STALE
error when doing inter server copy.
Added ops table in nfs_common for knfsd to access NFS client modules.
Fixes: 3ac3711adb ("NFSD: Fix NFS server build errors")
Signed-off-by: Dai Ngo <dai.ngo@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
This one was missed in the earlier conversion, should be included like
any of the other IO identity flags. Make sure we restore to RLIM_INIFITY
when dropping the personality again.
Fixes: 98447d65b4 ("io_uring: move io identity items into separate struct")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
And read these in __get_log_header() from the log header.
Also make gfs2_statfs_change_out() non-static so it can be used
outside of super.c
Signed-off-by: Abhi Das <adas@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
The gfs2_glock structure has a gl_vm member, introduced in commit 7005c3e4ae
("GFS2: Use range based functions for rgrp sync/invalidation"), which stores
the location of resource groups within their address space. This structure is
in a union with iopen glock specific fields. It was introduced because at
unmount time, the resource group objects were destroyed before flushing out any
pending resource group glock work, and flushing out such work could require
flushing / truncating the address space.
Since commit b3422cacdd ("gfs2: Rework how rgrp buffer_heads are managed"),
any pending resource group glock work is flushed out before destroying the
resource group objects. So the resource group objects will now always exist in
rgrp_go_sync and rgrp_go_inval, and we now simply compute the gl_vm values
where needed instead of caching them. This also eliminates the union.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Only initialize gl_delete for iopen glocks, but more importantly, only access
it for iopen glocks in flush_delete_work: flush_delete_work is called for
different types of glocks including rgrp glocks, and those use gl_vm which is
in a union with gl_delete. Without this fix, we'll end up clobbering gl_vm,
which results in general memory corruption.
Fixes: a0e3cc65fa ("gfs2: Turn gl_delete into a delayed work")
Cc: stable@vger.kernel.org # v5.8+
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
The comments before function glock_hash_walk had the wrong name and
an extra parameter. This simply fixes the comments.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
- Stable Fixes:
- Wait for stateid updates after CLOSE/OPEN_DOWNGRADE # v5.4+
- Fix nfs_path in case of a rename retry
- Support EXCHID4_FLAG_SUPP_FENCE_OPS v4.2 EXCHANGE_ID flag
- New features and improvements:
- Replace dprintk() calls with tracepoints
- Make cache consistency bitmap dynamic
- Added support for the NFS v4.2 READ_PLUS operation
- Improvements to net namespace uniquifier
- Other bugfixes and cleanups
- Remove redundant clnt pointer
- Don't update timeout values on connection resets
- Remove redundant tracepoints
- Various cleanups to comments
- Fix oops when trying to use copy_file_range with v4.0 source server
- Improvements to flexfiles mirrors
- Add missing "local_lock=posix" mount option
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAl+PJv0ACgkQ18tUv7Cl
QOtB+hAAza4BwQSAZs0QbszPcoNV0f0qND99hWcJ2ad5KCr8S5eNYaMqB8G8lbS+
Ahuv6xRy69l2vjHvINXoSaSiXClYNAlAr5myiW0DqMLDpVV8VEMAhqgFJK97VpqQ
AsbsdFwC4xAcQ+6nJ+IfK9f8AOYhvARkOP3171qkq0jX3Eq4j1IiQARn+JrGFZFL
waC4UtVhCnx5z5pg7jyu7Ar4N0Xky72+Fm1EvRog4gndX2JbNNSkwavD33mWZ8iN
3q6DRq/AtEk9F4ttPwVeALvpNCBqjoGcNw5FCBil7x4V+xIbDI2FBhusKas3GzXm
W2tyUuJMTGpA6ZjkyYDcg0jC8vKUiUpMWgT3oZoQ/6g7P4vPzB/XB/RQcAInMRjS
/MhWZ3YNQmApnVCIZSwOa9+oNUc69dM0+vqmesLvvb/aNjzRnGwiJNiuWGyrwoGX
3SzpPUcJixa3+eKLi9uEHojKB+SuPIrB/HW4UYh/ZpOMyilVqUnX8RYbaK3qQ/gK
Un++LjTmao9cXyjIDCiHJB630W01YSWXq1nwuSYkLuXCuALmmtgA4z8pm+4K+w+G
SctETKVcD0qi3Yx5mFdU5A1dHXMB7nzYNaiRWYLvLO5qAR33nUPI4lV8rog2TUPe
2iW4n0j2YxnTlwR3QMPYJRndP7KeR5XvOoNuh3XpKRVbmbc7/A4=
=6RM5
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-5.10-1' of git://git.linux-nfs.org/projects/anna/linux-nfs
Pull NFS client updates from Anna Schumaker:
"Stable Fixes:
- Wait for stateid updates after CLOSE/OPEN_DOWNGRADE # v5.4+
- Fix nfs_path in case of a rename retry
- Support EXCHID4_FLAG_SUPP_FENCE_OPS v4.2 EXCHANGE_ID flag
New features and improvements:
- Replace dprintk() calls with tracepoints
- Make cache consistency bitmap dynamic
- Added support for the NFS v4.2 READ_PLUS operation
- Improvements to net namespace uniquifier
Other bugfixes and cleanups:
- Remove redundant clnt pointer
- Don't update timeout values on connection resets
- Remove redundant tracepoints
- Various cleanups to comments
- Fix oops when trying to use copy_file_range with v4.0 source server
- Improvements to flexfiles mirrors
- Add missing 'local_lock=posix' mount option"
* tag 'nfs-for-5.10-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (55 commits)
NFSv4.2: support EXCHGID4_FLAG_SUPP_FENCE_OPS 4.2 EXCHANGE_ID flag
NFSv4: Fix up RCU annotations for struct nfs_netns_client
NFS: Only reference user namespace from nfs4idmap struct instead of cred
nfs: add missing "posix" local_lock constant table definition
NFSv4: Use the net namespace uniquifier if it is set
NFSv4: Clean up initialisation of uniquified client id strings
NFS: Decode a full READ_PLUS reply
SUNRPC: Add an xdr_align_data() function
NFS: Add READ_PLUS hole segment decoding
SUNRPC: Add the ability to expand holes in data pages
SUNRPC: Split out _shift_data_right_tail()
SUNRPC: Split out xdr_realign_pages() from xdr_align_pages()
NFS: Add READ_PLUS data segment support
NFS: Use xdr_page_pos() in NFSv4 decode_getacl()
SUNRPC: Implement a xdr_page_pos() function
SUNRPC: Split out a function for setting current page
NFS: fix nfs_path in case of a rename retry
fs: nfs: return per memcg count for xattr shrinkers
NFSv4: Wait for stateid updates after CLOSE/OPEN_DOWNGRADE
nfs: remove incorrect fallthrough label
...
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl+O9WEQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpqajEADSD5PiO94YWTtVNWFUjn5RW+GlCE70/0VV
XImdasmvM8nb48E+z2EW0Ky4vKXeVy5r+WZAeIYqPUHy/ogQDVpEn00NL7tFQmOz
8UYrlZ3LLE/bSeWM5iavgG7TldVs/ZfspJ0hj3/Ac7jJpzuRGEI5TClsxJ0mWV39
b2qT4OYBDhwvwVPZ/qhWgEwJXJFzFywckouIbMw8gPkveebUYeyu/yuScNGwYuiQ
46YPEk/XIuOy8iUvQjqhLY+NNlAKJwt3z9WZgt5F+TZIhkpp7z6h20+jezFQcuFP
GXzIDN+EADpsbw7MWJYIZVffxDEMlDpkJlAVMT1hsYLDfTXoEzmFwRddoFh2Fjf6
ghWqhOKffUuAOX2xs1MrS2xLaxd0ot7QqZJVTYk7zEljkaRANlstSZZ+PpI+Sad/
rNieQvs6jnsmTODDEaV3qyFX5aBQ2NdvyndZNU9wz0GZAWAdz+wxE0A1FVD0A37i
p6m8sIvhNg3/cW89G04IDYUkAygi8knVDnEDHRwaJtswZQ4pRSGMp+N4qZ0GpnK7
BviaAhofGaYlqruavO6Ug2YyomYpWGlUxTaB9ZKh0HkEDlDM945+0sgQRdxfsE8d
OboycqJn3puOl/wh5Fc4oGYrWLsDbaA/5kksC4lm85Z+HUf+UXMS4QFdoPJYjhuM
H6oMz1w2bQ==
=v56S
-----END PGP SIGNATURE-----
Merge tag 'io_uring-5.10-2020-10-20' of git://git.kernel.dk/linux-block
Pull io_uring updates from Jens Axboe:
"A mix of fixes and a few stragglers. In detail:
- Revert the bogus __read_mostly that we discussed for the initial
pull request.
- Fix a merge window regression with fixed file registration error
path handling.
- Fix io-wq numa node affinities.
- Series abstracting out an io_identity struct, making it both easier
to see what the personality items are, and also easier to to adopt
more. Use this to cover audit logging.
- Fix for read-ahead disabled block condition in async buffered
reads, and using single page read-ahead to unify what
generic_file_buffer_read() path is used.
- Series for REQ_F_COMP_LOCKED fix and removal of it (Pavel)
- Poll fix (Pavel)"
* tag 'io_uring-5.10-2020-10-20' of git://git.kernel.dk/linux-block: (21 commits)
io_uring: use blk_queue_nowait() to check if NOWAIT supported
mm: use limited read-ahead to satisfy read
mm: mark async iocb read as NOWAIT once some data has been copied
io_uring: fix double poll mask init
io-wq: inherit audit loginuid and sessionid
io_uring: use percpu counters to track inflight requests
io_uring: assign new io_identity for task if members have changed
io_uring: store io_identity in io_uring_task
io_uring: COW io_identity on mismatch
io_uring: move io identity items into separate struct
io_uring: rely solely on work flags to determine personality.
io_uring: pass required context in as flags
io-wq: assign NUMA node locality if appropriate
io_uring: fix error path cleanup in io_sqe_files_register()
Revert "io_uring: mark io_uring_fops/io_op_defs as __read_mostly"
io_uring: fix REQ_F_COMP_LOCKED by killing it
io_uring: dig out COMP_LOCK from deep call chain
io_uring: don't put a poll req under spinlock
io_uring: don't unnecessarily clear F_LINK_TIMEOUT
io_uring: don't set COMP_LOCKED if won't put
...
Don't populate const array smb3_create_tag_posix on the stack but
instead make it static. Makes the object code smaller by 50 bytes.
Before:
text data bss dec hex filename
150184 47167 0 197351 302e7 fs/cifs/smb2pdu.o
After:
text data bss dec hex filename
150070 47231 0 197301 302b5 fs/cifs/smb2pdu.o
(gcc version 10.2.0)
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
We were setting the uid/gid to the default in each dir entry
in the parsing of the POSIX query dir response, rather
than attempting to map the user and group SIDs returned by
the server to well known SIDs (or upcall if not found).
CC: Stable <stable@vger.kernel.org>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
SMB3 crediting is used for flow control, and it can be useful to
trace for problem determination how many credits were acquired
and for which operation.
Here is an example ("trace-cmd record -e *add_credits"):
cifsd-9522 [010] .... 5995.202712: smb3_add_credits:
server=localhost current_mid=0x12 credits=373 credits_to_add=10
cifsd-9522 [010] .... 5995.204040: smb3_add_credits:
server=localhost current_mid=0x15 credits=400 credits_to_add=30
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
There are cases where the server can return a cipher type of 0 and
it not be an error. For example server supported no encryption types
(e.g. server completely disabled encryption), or the server and
client didn't support any encryption types in common (e.g. if a
server only supported AES256_CCM). In those cases encryption would
not be supported, but that can be ok if the client did not require
encryption on mount and it should not return an error.
In the case in which mount requested encryption ("seal" on mount)
then checks later on during tree connection will return the proper
rc, but if seal was not requested by client, since server is allowed
to return 0 to indicate no supported cipher, we should not fail mount.
Reported-by: Pavel Shilovsky <pshilov@microsoft.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
- New feature: Widen inode timestamps and quota grace expiration
timestamps to support dates through the year 2486.
- New feature: storing inode btree counts in the AGI to speed up certain
mount time per-AG block reservation operatoins and add a little more
metadata redundancy.
For the second round of new code for 5.10:
- Deprecate the V4 filesystem format, some disused mount options, and some
legacy sysctl knobs now that we can support dates into the 25th century.
Note that removal of V4 support will not happen until the early 2030s.
- Fix some probles with inode realtime flag propagation.
- Fix some buffer handling issues when growing a rt filesystem.
- Fix a problem where a BMAP_REMAP unmap call would free rt extents even
though the purpose of BMAP_REMAP is to avoid freeing the blocks.
- Strengthen the dabtree online scrubber to check hash values on child
dabtree blocks.
- Actually log new intent items created as part of recovering log intent
items.
- Fix a bug where quotas weren't attached to an inode undergoing bmap
intent item recovery.
- Fix a buffer overrun problem with specially crafted log buffer
headers.
- Various cleanups to type usage and slightly inaccurate comments.
- More cleanups to the xattr, log, and quota code.
- Don't run the (slower) shared-rmap operations on attr fork mappings.
- Fix a bug where we failed to check the LSN of finobt blocks during
replay and could therefore overwrite newer data with older data.
- Clean up the ugly nested transaction mess that log recovery uses to
stage intent item recovery in the correct order by creating a proper
data structure to capture recovered chains.
- Use the capture structure to resume intent item chains with the
same log space and block reservations as when they were captured.
- Fix a UAF bug in bmap intent item recovery where we failed to maintain
our reference to the incore inode if the bmap operation needed to
relog itself to continue.
- Rearrange the defer ops mechanism to finish newly created subtasks
of a parent task before moving on to the next parent task.
- Automatically relog intent items in deferred ops chains if doing so
would help us avoid pinning the log tail. This will help fix some
log scaling problems now and will facilitate atomic file updates later.
- Fix a deadlock in the GETFSMAP implementation by using an internal
memory buffer to reduce indirect calls and copies to userspace,
thereby improving its performance by ~20%.
- Fix various problems when calling growfs on a realtime volume would
not fully update the filesystem metadata.
- Fix broken Kconfig asking about deprecated XFS when XFS is disabled.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAl+KRuUACgkQ+H93GTRK
tOtqsBAAm+AZ92DRjOD7/TbU3vJALRKBBUCc6weEYUJKaZUkdYpx7Fn8Si3K8Nu0
Pxo8qLO8WtP3ECyd+CZgkQgZAhHrjRG+FnCOuNyj1yMguX9CDu4cK0dOh/M64+pM
BvWPqLfd99mzr7HkQ0SuLIyDMeio3leU4lySAIVpADO3V7WF5ZgHCfEETpOh5Di1
oIzhYlxHyfK+32u4sXSWsPnogQZwjyn4CyQ+6humK0d089pVB1wbjHaTym7exjSa
cFhMqS1XDbpMuoF4BXMcx31UTOb+8/S6TKCVsRl61j3XKGzbYKSrLmrSb/r6gFWn
wyXJGmLok0I2UDnX1ZArIWstJHcgPlTelWrssG8wAnopLSJoU10f8o88m43d0krF
fCUCac1rKPcisg7CS5njgUkOBknSLeBCeztl59N/8acnkaETPQr0tReDpB4wGGaW
aGEWBrCbz1QZyfDBttNPQLcreROGukZ8R8MMRl4GiAQwZz5UrTUFeoK6thplHVvp
ANhpYGdJy4jJ79wt4MNVYUF8U8IRWdn0ddsRx08pLWchC1PH8HH944qrUXAVPYZ+
MohSQqKtjvaKwZLP86SvCJFs20wEEUxzCSQbz4LTO5aBz3uDfo0LQSOYzPBU+OKp
E33SNds13nsjeH8HBLtXH3lr3absywLcV2ZMaIGsQSLpE2p8AHA=
=LuWg
-----END PGP SIGNATURE-----
Merge tag 'xfs-5.10-merge-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull more xfs updates from Darrick Wong:
"The second large pile of new stuff for 5.10, with changes even more
monumental than last week!
We are formally announcing the deprecation of the V4 filesystem format
in 2030. All users must upgrade to the V5 format, which contains
design improvements that greatly strengthen metadata validation,
supports reflink and online fsck, and is the intended vehicle for
handling timestamps past 2038. We're also deprecating the old Irix
behavioral tweaks in September 2025.
Coming along for the ride are two design changes to the deferred
metadata ops subsystem. One of the improvements is to retain correct
logical ordering of tasks and subtasks, which is a more logical design
for upper layers of XFS and will become necessary when we add atomic
file range swaps and commits. The second improvement to deferred ops
improves the scalability of the log by helping the log tail to move
forward during long-running operations. This reduces log contention
when there are a large number of threads trying to run transactions.
In addition to that, this fixes numerous small bugs in log recovery;
refactors logical intent log item recovery to remove the last
remaining place in XFS where we could have nested transactions; fixes
a couple of ways that intent log item recovery could fail in ways that
wouldn't have happened in the regular commit paths; fixes a deadlock
vector in the GETFSMAP implementation (which improves its performance
by 20%); and fixes serious bugs in the realtime growfs, fallocate, and
bitmap handling code.
Summary:
- Deprecate the V4 filesystem format, some disused mount options, and
some legacy sysctl knobs now that we can support dates into the
25th century. Note that removal of V4 support will not happen until
the early 2030s.
- Fix some probles with inode realtime flag propagation.
- Fix some buffer handling issues when growing a rt filesystem.
- Fix a problem where a BMAP_REMAP unmap call would free rt extents
even though the purpose of BMAP_REMAP is to avoid freeing the
blocks.
- Strengthen the dabtree online scrubber to check hash values on
child dabtree blocks.
- Actually log new intent items created as part of recovering log
intent items.
- Fix a bug where quotas weren't attached to an inode undergoing bmap
intent item recovery.
- Fix a buffer overrun problem with specially crafted log buffer
headers.
- Various cleanups to type usage and slightly inaccurate comments.
- More cleanups to the xattr, log, and quota code.
- Don't run the (slower) shared-rmap operations on attr fork
mappings.
- Fix a bug where we failed to check the LSN of finobt blocks during
replay and could therefore overwrite newer data with older data.
- Clean up the ugly nested transaction mess that log recovery uses to
stage intent item recovery in the correct order by creating a
proper data structure to capture recovered chains.
- Use the capture structure to resume intent item chains with the
same log space and block reservations as when they were captured.
- Fix a UAF bug in bmap intent item recovery where we failed to
maintain our reference to the incore inode if the bmap operation
needed to relog itself to continue.
- Rearrange the defer ops mechanism to finish newly created subtasks
of a parent task before moving on to the next parent task.
- Automatically relog intent items in deferred ops chains if doing so
would help us avoid pinning the log tail. This will help fix some
log scaling problems now and will facilitate atomic file updates
later.
- Fix a deadlock in the GETFSMAP implementation by using an internal
memory buffer to reduce indirect calls and copies to userspace,
thereby improving its performance by ~20%.
- Fix various problems when calling growfs on a realtime volume would
not fully update the filesystem metadata.
- Fix broken Kconfig asking about deprecated XFS when XFS is
disabled"
* tag 'xfs-5.10-merge-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (48 commits)
xfs: fix Kconfig asking about XFS_SUPPORT_V4 when XFS_FS=n
xfs: fix high key handling in the rt allocator's query_range function
xfs: annotate grabbing the realtime bitmap/summary locks in growfs
xfs: make xfs_growfs_rt update secondary superblocks
xfs: fix realtime bitmap/summary file truncation when growing rt volume
xfs: fix the indent in xfs_trans_mod_dquot
xfs: do the ASSERT for the arguments O_{u,g,p}dqpp
xfs: fix deadlock and streamline xfs_getfsmap performance
xfs: limit entries returned when counting fsmap records
xfs: only relog deferred intent items if free space in the log gets low
xfs: expose the log push threshold
xfs: periodically relog deferred intent items
xfs: change the order in which child and parent defer ops are finished
xfs: fix an incore inode UAF in xfs_bui_recover
xfs: clean up xfs_bui_item_recover iget/trans_alloc/ilock ordering
xfs: clean up bmap intent item recovery checking
xfs: xfs_defer_capture should absorb remaining transaction reservation
xfs: xfs_defer_capture should absorb remaining block reservations
xfs: proper replay of deferred ops queued during log recovery
xfs: remove XFS_LI_RECOVERED
...
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCX4n0/gAKCRDh3BK/laaZ
PM3jAP4xhaix0j/y3VyaxsUqWg6ZSrjq6X0o9clGMJv27IAtjgD/fJ7ZwzTldojD
qb7N3utjLiPVRjwFmvsZ8JZ7O7PbwQ0=
=oUbZ
-----END PGP SIGNATURE-----
Merge tag 'fuse-update-5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse
Pull fuse updates from Miklos Szeredi:
- Support directly accessing host page cache from virtiofs. This can
improve I/O performance for various workloads, as well as reducing
the memory requirement by eliminating double caching. Thanks to Vivek
Goyal for doing most of the work on this.
- Allow automatic submounting inside virtiofs. This allows unique
st_dev/ st_ino values to be assigned inside the guest to files
residing on different filesystems on the host. Thanks to Max Reitz
for the patches.
- Fix an old use after free bug found by Pradeep P V K.
* tag 'fuse-update-5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: (25 commits)
virtiofs: calculate number of scatter-gather elements accurately
fuse: connection remove fix
fuse: implement crossmounts
fuse: Allow fuse_fill_super_common() for submounts
fuse: split fuse_mount off of fuse_conn
fuse: drop fuse_conn parameter where possible
fuse: store fuse_conn in fuse_req
fuse: add submount support to <uapi/linux/fuse.h>
fuse: fix page dereference after free
virtiofs: add logic to free up a memory range
virtiofs: maintain a list of busy elements
virtiofs: serialize truncate/punch_hole and dax fault path
virtiofs: define dax address space operations
virtiofs: add DAX mmap support
virtiofs: implement dax read/write operations
virtiofs: introduce setupmapping/removemapping commands
virtiofs: implement FUSE_INIT map_alignment field
virtiofs: keep a list of free dax memory ranges
virtiofs: add a mount option to enable dax
virtiofs: set up virtio_fs dax_device
...
This pull request introduces the following changes to zonefs:
* Add the "explicit-open" mount option to automatically issue a
REQ_OP_ZONE_OPEN command to the device whenever a sequential zone file
is open for writing for the first time. This avoids "insufficient zone
resources" errors for write operations on some drives with limited
zone resources or on ZNS drives with a limited number of active zones.
From Johannes.
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQSRPv8tYSvhwAzJdzjdoc3SxdoYdgUCX4zOWAAKCRDdoc3SxdoY
dh7PAP9IdcYnR9x6ttd2Aqsm17IBfY6b/TroE70Lm2YTlY0nTgD+IJTwYQG8KQAE
QHAe6TD6VQfSftOeAOAnjEG64Iv2hQE=
=vwu+
-----END PGP SIGNATURE-----
Merge tag 'zonefs-5.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs
Pull zonefs updates from Damien Le Moal:
"Add an 'explicit-open' mount option to automatically issue a
REQ_OP_ZONE_OPEN command to the device whenever a sequential zone file
is open for writing for the first time.
This avoids 'insufficient zone resources' errors for write operations
on some drives with limited zone resources or on ZNS drives with a
limited number of active zones. From Johannes"
* tag 'zonefs-5.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs:
zonefs: document the explicit-open mount option
zonefs: open/close zone on file open/close
zonefs: provide no-lock zonefs_io_error variant
zonefs: introduce helper for zone management
In crypt_message, when smb2_get_enc_key returns error, we need to
return the error back to the caller. If not, we end up processing
the message further, causing a kernel oops due to unwarranted access
of memory.
Call Trace:
smb3_receive_transform+0x120/0x870 [cifs]
cifs_demultiplex_thread+0xb53/0xc20 [cifs]
? cifs_handle_standard+0x190/0x190 [cifs]
kthread+0x116/0x130
? kthread_park+0x80/0x80
ret_from_fork+0x1f/0x30
Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
CC: Stable <stable@vger.kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Now that 256 bit encryption can be negotiated, update
names of the nonces to match the updated official protocol
documentation (e.g. AES_GCM_NONCE instead of AES_128GCM_NONCE)
since they apply to both 128 bit and 256 bit encryption.
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
If server does not support AES-256-GCM and it was required on mount, print
warning message. Also log and return a different error message (EOPNOTSUPP)
when encryption mechanism is not supported vs the case when an unknown
unrequested encryption mechanism could be returned (EINVAL).
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
io_link_timeout_fn() removes REQ_F_LINK_TIMEOUT from the link head's
flags, it's not atomic and may race with what the head is doing.
If io_link_timeout_fn() doesn't clear the flag, as forced by this patch,
then it may happen that for "req -> link_timeout1 -> link_timeout2",
__io_kill_linked_timeout() would find link_timeout2 and try to cancel
it, so miscounting references. Teach it to ignore such double timeouts
by marking the active one with a new flag in io_prep_linked_timeout().
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Move INIT_HLIST_NODE(&req->hash_node) into __io_arm_poll_handler(), so
that it doesn't duplicated and common poll code would be responsible for
it.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
io_poll_task_handler() doesn't add clarity, inline it in its only user.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
io_poll_add_prep() doesn't need to verify ->file because it's already
done in io_init_req().
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
ctx->cached_cq_overflow is changed only under completion_lock. Convert
it from atomic_t to just int, and mark all places when it's read without
lock with READ_ONCE, which guarantees atomicity (relaxed ordering).
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Inline io_fail_links() and kill extra io_cqring_ev_posted().
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Don't take an identity on personality/creds init only to drop it a few
lines after. Extract a function which prepares req->work but leaves it
without identity.
Note: it's safe to not check REQ_F_WORK_INITIALIZED there because it's
nobody had a chance to init it before io_init_req().
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Use IO_WQ_WORK_CREDS to figure out if req has creds to be used.
Since recently it should rely only on flags, but not value of
work.creds.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
commit 021a24460d ("block: add QUEUE_FLAG_NOWAIT") adds a new helper
function blk_queue_nowait() to check if the bdev supports handling of
REQ_NOWAIT or not. Since then bio-based dm device can also support
REQ_NOWAIT, and currently only dm-linear supports that since
commit 6abc49468e ("dm: add support for REQ_NOWAIT and enable it for
linear target").
Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Merge yet more updates from Andrew Morton:
"Subsystems affected by this patch series: mm (memcg, migration,
pagemap, gup, madvise, vmalloc), ia64, and misc"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (31 commits)
mm: remove duplicate include statement in mmu.c
mm: remove the filename in the top of file comment in vmalloc.c
mm: cleanup the gfp_mask handling in __vmalloc_area_node
mm: remove alloc_vm_area
x86/xen: open code alloc_vm_area in arch_gnttab_valloc
xen/xenbus: use apply_to_page_range directly in xenbus_map_ring_pv
drm/i915: use vmap in i915_gem_object_map
drm/i915: stop using kmap in i915_gem_object_map
drm/i915: use vmap in shmem_pin_map
zsmalloc: switch from alloc_vm_area to get_vm_area
mm: allow a NULL fn callback in apply_to_page_range
mm: add a vmap_pfn function
mm: add a VM_MAP_PUT_PAGES flag for vmap
mm: update the documentation for vfree
mm/madvise: introduce process_madvise() syscall: an external memory hinting API
pid: move pidfd_get_pid() to pid.c
mm/madvise: pass mm to do_madvise
selftests/vm: 10x speedup for hmm-tests
binfmt_elf: take the mmap lock around find_extend_vma()
mm/gup_benchmark: take the mmap lock around GUP
...
UBI:
- Correctly use kthread_should_stop in ubi worker
UBIFS:
- Fixes for memory leaks while iterating directory entries
- Fix for a user triggerable error message
- Fix for a space accounting bug in authenticated mode
-----BEGIN PGP SIGNATURE-----
iQJKBAABCAA0FiEEdgfidid8lnn52cLTZvlZhesYu8EFAl+LRl4WHHJpY2hhcmRA
c2lnbWEtc3Rhci5hdAAKCRBm+VmF6xi7wVvTD/0ffSK9y4oECcW1/+84oHb5515g
5/CxYf5zalruOXWzA55OxI55BL18e1tnS26jT1G79BpNoitPLbqh/OhvDVtHUqEI
4Chd9PdeFb33GFubElcTviIBsGKMD2eYK5AVTX6fXxxRG8+UxT0u9T+1GXZGzlHA
0N3qWBuDhAlrh65UtORulpBOAexLymbSeINS7ibTXKqo3+sc70xjFZYTyt9Tr9np
q2VVI+SS019A30RrIzfeaSpyfDZ1tdh5vhfsm6eGbearHpUrX6OZgFUQglRCF7DS
bMTTODH+feJAPyknd9T1EdXrVNjzX24i1V3/wM8hC7qUZfaf1ZHeCuu0t353iiGn
dXg+qA/v+AKTYh71MRfHd8GJvmKHospdhze1K5IIvA+lL6+bRwur88KVF8PwyaB7
KHRAghKLEdVBb68MzwF0ChbFSUDk6VTZdvj0FB1LO/h3YQ1I2Dp1Z946qjJm9bWF
biATqHmR1hSPAyAP/VUGCdRqlwpuox5cUgoa7SwO3zP1aqBRG29Wsg1JJuqy74GN
Dcov5vIhVly/zns6tKaHFleTeAMBO7e1fhEjg4TuMpzWt9v8eYnacWwR1Hfvm9VO
+0rox1g5mDX/ZzPE/Koj3LFnr8JV8g0DPs0Uemz3D89xt3ftO0Du3NZSd8WgOcbS
mZcpgTakNv2/DbTiGA==
=RD+h
-----END PGP SIGNATURE-----
Merge tag 'for-linus-5.10-rc1-part2' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/ubifs
Pull more ubi and ubifs updates from Richard Weinberger:
"UBI:
- Correctly use kthread_should_stop in ubi worker
UBIFS:
- Fixes for memory leaks while iterating directory entries
- Fix for a user triggerable error message
- Fix for a space accounting bug in authenticated mode"
* tag 'for-linus-5.10-rc1-part2' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/ubifs:
ubifs: journal: Make sure to not dirty twice for auth nodes
ubifs: setflags: Don't show error message when vfs_ioc_setflags_prepare() fails
ubifs: ubifs_jnl_change_xattr: Remove assertion 'nlink > 0' for host inode
ubi: check kthread_should_stop() after the setting of task state
ubifs: dent: Fix some potential memory leaks while iterating entries
ubifs: xattr: Fix some potential memory leaks while iterating entries
- Kernel-doc fixes
- Fixes for memory leaks in authentication option parsing
-----BEGIN PGP SIGNATURE-----
iQJKBAABCAA0FiEEdgfidid8lnn52cLTZvlZhesYu8EFAl+LRSgWHHJpY2hhcmRA
c2lnbWEtc3Rhci5hdAAKCRBm+VmF6xi7wZU7EACGIvNIUvPl35GhUxRP7EBGOVdz
LPonjtBiipeuvIXnwviqzYcCwFJ2a5xFel4pEs3J58AYb9D6++qShgrSw7/g815Z
aEcLDAuSTkYIj9HzmI03xObe4UHtGoIXpHTj2e8vzq/doJMEXYtZeWzKyMW9IvgJ
dU4Lq1ZvNNaYpiaglE9ThJUG6oXT3rnHkWR/uFf0MOh2kZ6pH6U8QAGUTZp9ZdCF
Tzy1ruDS7+ZEox7j7cIBhrmN6NyaSk/I4C9tBisRxDRNZ/ku9b6URd384DKxvli1
8otLLYBxWypqx4FF0qF2Gyp7ZYGWDcnxLHgrmpafmsbwMqnOrOwjSD+whVKLdJqm
B2nXzxRw6BNDqViAxQ+K/KISoZq4w3/dREPgvBzjypiDKkYtJ6MksTj4Mbw514b6
iUjaKpGVMZF8PIjTxn/YWrbxR6UV/On+mbHs94FTyanldRw1GG9UcrcL5TyhaXj0
WeVhBiU4Aui+vwtBtdPCY1sf6iBE3x40P7tTWe6/CPgc7pE1InVU3ex0517gyEls
yY0JFVQPNeMIc9C/g12GbDiMA328FSxcCLKd+2+h3PAgcMd+RPvynHxIL5g9hxMC
GRFUGERxCVHU2VkrOpXA8P6Z3u3Lu9kZkz4UP1dMU+5tiBcw0xN0Ks9TvZm8s/WV
lGyhX9r+kpfNUiq4hQ==
=/Spl
-----END PGP SIGNATURE-----
Merge tag 'for-linus-5.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/ubifs
Pull ubifs updates from Richard Weinberger:
- Kernel-doc fixes
- Fixes for memory leaks in authentication option parsing
* tag 'for-linus-5.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/ubifs:
ubifs: mount_ubifs: Release authentication resource in error handling path
ubifs: Don't parse authentication mount options in remount process
ubifs: Fix a memleak after dumping authentication mount options
ubifs: Fix some kernel-doc warnings in tnc.c
ubifs: Fix some kernel-doc warnings in replay.c
ubifs: Fix some kernel-doc warnings in gc.c
ubifs: Fix 'hash' kernel-doc warning in auth.c
Patch series "introduce memory hinting API for external process", v9.
Now, we have MADV_PAGEOUT and MADV_COLD as madvise hinting API. With
that, application could give hints to kernel what memory range are
preferred to be reclaimed. However, in some platform(e.g., Android), the
information required to make the hinting decision is not known to the app.
Instead, it is known to a centralized userspace daemon(e.g.,
ActivityManagerService), and that daemon must be able to initiate reclaim
on its own without any app involvement.
To solve the concern, this patch introduces new syscall -
process_madvise(2). Bascially, it's same with madvise(2) syscall but it
has some differences.
1. It needs pidfd of target process to provide the hint
2. It supports only MADV_{COLD|PAGEOUT|MERGEABLE|UNMEREABLE} at this
moment. Other hints in madvise will be opened when there are explicit
requests from community to prevent unexpected bugs we couldn't support.
3. Only privileged processes can do something for other process's
address space.
For more detail of the new API, please see "mm: introduce external memory
hinting API" description in this patchset.
This patch (of 3):
In upcoming patches, do_madvise will be called from external process
context so we shouldn't asssume "current" is always hinted process's
task_struct.
Furthermore, we must not access mm_struct via task->mm, but obtain it via
access_mm() once (in the following patch) and only use that pointer [1],
so pass it to do_madvise() as well. Note the vma->vm_mm pointers are
safe, so we can use them further down the call stack.
And let's pass current->mm as arguments of do_madvise so it shouldn't
change existing behavior but prepare next patch to make review easy.
[vbabka@suse.cz: changelog tweak]
[minchan@kernel.org: use current->mm for io_uring]
Link: http://lkml.kernel.org/r/20200423145215.72666-1-minchan@kernel.org
[akpm@linux-foundation.org: fix it for upstream changes]
[akpm@linux-foundation.org: whoops]
[rdunlap@infradead.org: add missing includes]
Signed-off-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jann Horn <jannh@google.com>
Cc: Tim Murray <timmurray@google.com>
Cc: Daniel Colascione <dancol@google.com>
Cc: Sandeep Patil <sspatil@google.com>
Cc: Sonny Rao <sonnyrao@google.com>
Cc: Brian Geffon <bgeffon@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: John Dias <joaodias@google.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Cc: SeongJae Park <sj38.park@gmail.com>
Cc: Christian Brauner <christian@brauner.io>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Oleksandr Natalenko <oleksandr@redhat.com>
Cc: SeongJae Park <sjpark@amazon.de>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Florian Weimer <fw@deneb.enyo.de>
Cc: <linux-man@vger.kernel.org>
Link: https://lkml.kernel.org/r/20200901000633.1920247-1-minchan@kernel.org
Link: http://lkml.kernel.org/r/20200622192900.22757-1-minchan@kernel.org
Link: http://lkml.kernel.org/r/20200302193630.68771-2-minchan@kernel.org
Link: http://lkml.kernel.org/r/20200622192900.22757-2-minchan@kernel.org
Link: https://lkml.kernel.org/r/20200901000633.1920247-2-minchan@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
create_elf_tables() runs after setup_new_exec(), so other tasks can
already access our new mm and do things like process_madvise() on it. (At
the time I'm writing this commit, process_madvise() is not in mainline
yet, but has been in akpm's tree for some time.)
While I believe that there are currently no APIs that would actually allow
another process to mess up our VMA tree (process_madvise() is limited to
MADV_COLD and MADV_PAGEOUT, and uring and userfaultfd cannot reach an mm
under which no syscalls have been executed yet), this seems like an
accident waiting to happen.
Let's make sure that we always take the mmap lock around GUP paths as long
as another process might be able to see the mm.
(Yes, this diff looks suspicious because we drop the lock before doing
anything with `vma`, but that's because we actually don't do anything with
it apart from the NULL check.)
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Michel Lespinasse <walken@google.com>
Cc: "Eric W . Biederman" <ebiederm@xmission.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: Sakari Ailus <sakari.ailus@linux.intel.com>
Link: https://lkml.kernel.org/r/CAG48ez1-PBCdv3y8pn-Ty-b+FmBSLwDuVKFSt8h7wARLy0dF-Q@mail.gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently the remote memcg charging API consists of two functions:
memalloc_use_memcg() and memalloc_unuse_memcg(), which set and clear the
memcg value, which overwrites the memcg of the current task.
memalloc_use_memcg(target_memcg);
<...>
memalloc_unuse_memcg();
It works perfectly for allocations performed from a normal context,
however an attempt to call it from an interrupt context or just nest two
remote charging blocks will lead to an incorrect accounting. On exit from
the inner block the active memcg will be cleared instead of being
restored.
memalloc_use_memcg(target_memcg);
memalloc_use_memcg(target_memcg_2);
<...>
memalloc_unuse_memcg();
Error: allocation here are charged to the memcg of the current
process instead of target_memcg.
memalloc_unuse_memcg();
This patch extends the remote charging API by switching to a single
function: struct mem_cgroup *set_active_memcg(struct mem_cgroup *memcg),
which sets the new value and returns the old one. So a remote charging
block will look like:
old_memcg = set_active_memcg(target_memcg);
<...>
set_active_memcg(old_memcg);
This patch is heavily based on the patch by Johannes Weiner, which can be
found here: https://lkml.org/lkml/2020/5/28/806 .
Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Dan Schatzberg <dschatzberg@fb.com>
Link: https://lkml.kernel.org/r/20200821212056.3769116-1-guro@fb.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When we try to use file already used as a quota file again (for the same
or different quota type), strange things can happen. At the very least
lockdep annotations may be wrong but also inode flags may be wrongly set
/ reset. When the file is used for two quota types at once we can even
corrupt the file and likely crash the kernel. Catch all these cases by
checking whether passed file is already used as quota file and bail
early in that case.
This fixes occasional generic/219 failure due to lockdep complaint.
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Reported-by: Ritesh Harjani <riteshh@linux.ibm.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20201015110330.28716-1-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
When ext4 is formatted with lazy_journal_init=1 and transactions from
the previous filesystem are still on disk, it is possible that they are
considered during a recovery after a crash. Because the checksum seed
has changed, the CRC check will fail, and the journal recovery fails
with checksum error although the journal is otherwise perfectly valid.
Fix the problem by checking commit block time stamps to determine
whether the data in the journal block is just stale or whether it is
indeed corrupt.
Reported-by: kernel test robot <lkp@intel.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Fengnan Chang <changfengnan@hikvision.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20201012164900.20197-1-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Here we use the READ_ONCE to fix race conditions in ->d_compare() and
->d_hash() when they are called in RCU-walk mode, seems we can use
the normal helper d_inode_rcu() to get the actual inode.
Signed-off-by: Kaixu Xia <kaixuxia@tencent.com>
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
Link: https://lore.kernel.org/r/1602317416-1260-1-git-send-email-kaixuxia@tencent.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
left shifting m_lblk by blkbits was causing value overflow and hence
it was not able to convert unwritten to written extent.
So, make sure we typecast it to loff_t before do left shift operation.
Also in func ext4_convert_unwritten_io_end_vec(), make sure to initialize
ret variable to avoid accidentally returning an uninitialized ret.
This patch fixes the issue reported in ext4 for bs < ps with
dioread_nolock mount option.
Fixes: c8cc88163f ("ext4: Add support for blocksize < pagesize in dioread_nolock")
Cc: stable@kernel.org
Reported-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/af902b5db99e8b73980c795d84ad7bb417487e76.1602168865.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
This implements journal callbacks j_submit|finish_inode_data_buffers()
with different behavior for data=journal: to write-protect pages under
commit, preventing changes to buffers writeably mapped to userspace.
If a buffer's content changes between commit's checksum calculation
and write-out to disk, it can cause journal recovery/mount failures
upon a kernel crash or power loss.
[ 27.334874] EXT4-fs: Warning: mounting with data=journal disables delayed allocation, dioread_nolock, and O_DIRECT support!
[ 27.339492] JBD2: Invalid checksum recovering data block 8705 in log
[ 27.342716] JBD2: recovery failed
[ 27.343316] EXT4-fs (loop0): error loading journal
mount: /ext4: can't read superblock on /dev/loop0.
In j_submit_inode_data_buffers() we write-protect the inode's pages
with write_cache_pages() and redirty w/ writepage callback if needed.
In j_finish_inode_data_buffers() there is nothing do to.
And in order to use the callbacks, inodes are added to the inode list
in transaction in __ext4_journalled_writepage() and ext4_page_mkwrite().
In ext4_page_mkwrite() we must make sure that the buffers are attached
to the transaction as jbddirty with write_end_fn(), as already done in
__ext4_journalled_writepage().
Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com>
Reported-by: Dann Frazier <dann.frazier@canonical.com>
Reported-by: kernel test robot <lkp@intel.com> # wbc.nr_to_write
Suggested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20201006004841.600488-5-mfo@canonical.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
These are two fixes for data journalling required by
the next patch, discovered while testing it.
First, the optimization to return early if all buffers
are mapped is not appropriate for the next patch:
The inode _must_ be added to the transaction's list in
data=journal mode (so to write-protect pages on commit)
thus we cannot return early there.
Second, once that optimization to reduce transactions
was disabled for data=journal mode, more transactions
happened, and occasionally hit this warning message:
'JBD2: Spotted dirty metadata buffer'.
Reason is, block_page_mkwrite() will set_buffer_dirty()
before do_journal_get_write_access() that is there to
prevent it. This issue was masked by the optimization.
So, on data=journal use __block_write_begin() instead.
This also requires page locking and len recalculation.
(see block_page_mkwrite() for implementation details.)
Finally, as Jan noted there is little sharing between
data=journal and other modes in ext4_page_mkwrite().
However, a prototype of ext4_journalled_page_mkwrite()
showed there still would be lots of duplicated lines
(tens of) that didn't seem worth it.
Thus this patch ends up with an ugly goto to skip all
non-data journalling code (to avoid long indentations,
but that can be changed..) in the beginning, and just
a conditional in the transaction section.
Well, we skip a common part to data journalling which
is the page truncated check, but we do it again after
ext4_journal_start() when we re-acquire the page lock
(so not to acquire the page lock twice needlessly for
data journalling.)
Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com>
Suggested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Link: https://lore.kernel.org/r/20201006004841.600488-4-mfo@canonical.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Introduce journal callbacks to allow different behaviors
for an inode in journal_submit|finish_inode_data_buffers().
The existing users of the current behavior (ext4, ocfs2)
are adapted to use the previously exported functions
that implement the current behavior.
Users are callers of jbd2_journal_inode_ranged_write|wait(),
which adds the inode to the transaction's inode list with
the JI_WRITE|WAIT_DATA flags. Only ext4 and ocfs2 in-tree.
Both CONFIG_EXT4_FS and CONFIG_OCSFS2_FS select CONFIG_JBD2,
which builds fs/jbd2/commit.c and journal.c that define and
export the functions, so we can call directly in ext4/ocfs2.
Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com>
Suggested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Link: https://lore.kernel.org/r/20201006004841.600488-3-mfo@canonical.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Export functions that implement the current behavior done
for an inode in journal_submit|finish_inode_data_buffers().
No functional change.
Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com>
Suggested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Link: https://lore.kernel.org/r/20201006004841.600488-2-mfo@canonical.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Now we only use sb_bread_unmovable() to read superblock and descriptor
block at mount time, so there is no opportunity that we need to clear
buffer verified bit and also handle buffer write_io error bit. But for
the sake of unification, let's introduce ext4_sb_bread_unmovable() to
replace all sb_bread_unmovable(). After this patch, we stop using read
helpers in fs/buffer.c.
Signed-off-by: zhangyi (F) <yi.zhang@huawei.com>
Link: https://lore.kernel.org/r/20200924073337.861472-8-yi.zhang@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
We have already remove open codes that invoke helpers provide by
fs/buffer.c in all places reading metadata buffers. This patch switch to
use ext4_sb_bread() to replace all sb_bread() helpers, which is
ext4_read_bh() helper back end.
Signed-off-by: zhangyi (F) <yi.zhang@huawei.com>
Link: https://lore.kernel.org/r/20200924073337.861472-7-yi.zhang@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
If we readahead inode tables in __ext4_get_inode_loc(), it may bypass
buffer_write_io_error() check, so introduce ext4_sb_breadahead_unmovable()
to handle this special case.
This patch also replace sb_breadahead_unmovable() in ext4_fill_super()
for the sake of unification.
Signed-off-by: zhangyi (F) <yi.zhang@huawei.com>
Link: https://lore.kernel.org/r/20200924073337.861472-6-yi.zhang@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
We have already introduced ext4_buffer_uptodate() to re-set the uptodate
bit on buffer which has been failed to write out to disk. Just remove
the redundant codes and switch to use ext4_buffer_uptodate() in
__ext4_get_inode_loc().
Signed-off-by: zhangyi (F) <yi.zhang@huawei.com>
Link: https://lore.kernel.org/r/20200924073337.861472-5-yi.zhang@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Revome all open codes that read metadata buffers, switch to use
ext4_read_bh_*() common helpers.
Signed-off-by: zhangyi (F) <yi.zhang@huawei.com>
Suggested-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20200924073337.861472-4-yi.zhang@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
The previous patch add clear_buffer_verified() before we read metadata
block from disk again, but it's rather easy to miss clearing of this bit
because currently we read metadata buffer through different open codes
(e.g. ll_rw_block(), bh_submit_read() and invoke submit_bh() directly).
So, it's time to add common helpers to unify in all the places reading
metadata buffers instead. This patch add 3 helpers:
- ext4_read_bh_nowait(): async read metadata buffer if it's actually
not uptodate, clear buffer_verified bit before read from disk.
- ext4_read_bh(): sync version of read metadata buffer, it will wait
until the read operation return and check the return status.
- ext4_read_bh_lock(): try to lock the buffer before read buffer, it
will skip reading if the buffer is already locked.
After this patch, we need to use these helpers in all the places reading
metadata buffer instead of different open codes.
Signed-off-by: zhangyi (F) <yi.zhang@huawei.com>
Suggested-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20200924073337.861472-3-yi.zhang@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
The metadata buffer is no longer trusted after we read it from disk
again because it is not uptodate for some reasons (e.g. failed to write
back). Otherwise we may get below memory corruption problem in
ext4_ext_split()->memset() if we read stale data from the newly
allocated extent block on disk which has been failed to async write
out but miss verify again since the verified bit has already been set
on the buffer.
[ 29.774674] BUG: unable to handle kernel paging request at ffff88841949d000
...
[ 29.783317] Oops: 0002 [#2] SMP
[ 29.784219] R10: 00000000000f4240 R11: 0000000000002e28 R12: ffff88842fa1c800
[ 29.784627] CPU: 1 PID: 126 Comm: kworker/u4:3 Tainted: G D W
[ 29.785546] R13: ffffffff9cddcc20 R14: ffffffff9cddd420 R15: ffff88842fa1c2f8
[ 29.786679] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),BIOS ?-20190727_0738364
[ 29.787588] FS: 0000000000000000(0000) GS:ffff88842fa00000(0000) knlGS:0000000000000000
[ 29.789288] Workqueue: writeback wb_workfn
[ 29.790319] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 29.790321] (flush-8:0)
[ 29.790844] CR2: 0000000000000008 CR3: 00000004234f2000 CR4: 00000000000006f0
[ 29.791924] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 29.792839] RIP: 0010:__memset+0x24/0x30
[ 29.793739] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 29.794256] Code: 90 90 90 90 90 90 0f 1f 44 00 00 49 89 f9 48 89 d1 83 e2 07 48 c1 e9 033
[ 29.795161] Kernel panic - not syncing: Fatal exception in interrupt
...
[ 29.808149] Call Trace:
[ 29.808475] ext4_ext_insert_extent+0x102e/0x1be0
[ 29.809085] ext4_ext_map_blocks+0xa89/0x1bb0
[ 29.809652] ext4_map_blocks+0x290/0x8a0
[ 29.809085] ext4_ext_map_blocks+0xa89/0x1bb0
[ 29.809652] ext4_map_blocks+0x290/0x8a0
[ 29.810161] ext4_writepages+0xc85/0x17c0
...
Fix this by clearing buffer's verified bit if we read meta block from
disk again.
Signed-off-by: zhangyi (F) <yi.zhang@huawei.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20200924073337.861472-2-yi.zhang@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
If userspace asked fsmap to try to count the number of entries, we cannot
return more than UINT_MAX entries because fmh_entries is u32.
Therefore, stop counting if we hit this limit or else we will waste time
to return truncated results.
Fixes: 0c9ec4beec ("ext4: support GETFSMAP ioctls")
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Link: https://lore.kernel.org/r/20201001222148.GA49520@magnolia
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Consider a situation when a filesystem was uncleanly shutdown and the
orphan list is not empty and a read-only mount is attempted. The orphan
list cleanup during mount will fail with:
ext4_check_bdev_write_error:193: comm mount: Error while async write back metadata
This happens because sbi->s_bdev_wb_err is not initialized when mounting
the filesystem in read only mode and so ext4_check_bdev_write_error()
falsely triggers.
Initialize sbi->s_bdev_wb_err unconditionally to avoid this problem.
Fixes: bc71726c72 ("ext4: abort the filesystem if failed to async write metadata buffer")
Cc: stable@kernel.org
Signed-off-by: Zhang Xiaoxu <zhangxiaoxu5@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20200928020556.710971-1-zhangxiaoxu5@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Rename system_blks to s_system_blks inside ext4_sb_info, keep
the naming rules consistent with other variables, which is
convenient for code reading and writing.
Signed-off-by: Chunguang Xu <brookxu@tencent.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/1600916623-544-2-git-send-email-brookxu@tencent.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Rename journal_dev to s_journal_dev inside ext4_sb_info, keep
the naming rules consistent with other variables, which is
convenient for code reading and writing.
Signed-off-by: Chunguang Xu <brookxu@tencent.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/1600916623-544-1-git-send-email-brookxu@tencent.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
In case if the file already has underlying blocks/extents allocated
then we don't need to start a journal txn and can directly return
the underlying mapping. Currently ext4_iomap_begin() is used by
both DAX & DIO path. We can check if the write request is an
overwrite & then directly return the mapping information.
This could give a significant perf boost for multi-threaded writes
specially random overwrites.
On PPC64 VM with simulated pmem(DAX) device, ~10x perf improvement
could be seen in random writes (overwrite). Also bcoz this optimizes
away the spinlock contention during jbd2 slab cache allocation
(jbd2_journal_handle). On x86 VM, ~2x perf improvement was observed.
Reported-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/88e795d8a4d5cd22165c7ebe857ba91d68d8813e.1600401668.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
The race condition could cause the persisted superblock checksum
to not match the contents of the superblock, causing the
superblock to be considered corrupt.
An example of the race follows. A first thread is interrupted in the
middle of a checksum calculation. Then, another thread changes the
superblock, calculates a new checksum, and sets it. Then, the first
thread resumes and sets the checksum based on the older superblock.
To fix, serialize the superblock checksum calculation using the buffer
header lock. While a spinlock is sufficient, the buffer header is
already there and there is precedent for locking it (e.g. in
ext4_commit_super).
Tested the patch by booting up a kernel with the patch, creating
a filesystem and some files (including some orphans), and then
unmounting and remounting the file system.
Cc: stable@kernel.org
Signed-off-by: Constantine Sapuntzakis <costa@purestorage.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Suggested-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20200914161014.22275-1-costa@purestorage.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
When ext4_journal_get_write_access() fails, we should
terminate the execution flow and release n_group_desc,
iloc.bh, dind and gdb_bh.
Cc: stable@kernel.org
Signed-off-by: Dinghao Liu <dinghao.liu@zju.edu.cn>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Link: https://lore.kernel.org/r/20200829025403.3139-1-dinghao.liu@zju.edu.cn
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
The 'handle' argument is not used for anything so simply remove it.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/20200826133116.11592-1-nborisov@suse.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Fields s_free_blocks_count_hi, s_r_blocks_count_hi and s_blocks_count_hi
are not valid if EXT4_FEATURE_INCOMPAT_64BIT is not enabled and should be
treated as zeroes.
Signed-off-by: Petr Malat <oss@malat.biz>
Link: https://lore.kernel.org/r/20200825150016.3363-1-oss@malat.biz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Delete repeated words in fs/ext4/.
{the, this, of, we, after}
Also change spelling of "xttr" in inline.c to "xattr" in 2 places.
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20200805024850.12129-1-rdunlap@infradead.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
ext4_mb_discard_group_preallocations() can be releasing group lock with
preallocations accumulated on its local list. Thus although
discard_pa_seq was incremented and concurrent allocating processes will
be retrying allocations, it can happen that premature ENOSPC error is
returned because blocks used for preallocations are not available for
reuse yet. Make sure we always free locally accumulated preallocations
before releasing group lock.
Fixes: 07b5b8e1ac ("ext4: mballoc: introduce pcpu seqcnt for freeing PA to improve ENOSPC handling")
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20200924150959.4335-1-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
As we test disk offline/online with running fsstress, we find fsstress
process is keeping running state.
kworker/u32:3-262 [004] ...1 140.787471: ext4_mb_discard_preallocations: dev 8,32 needed 114
....
kworker/u32:3-262 [004] ...1 140.787471: ext4_mb_discard_preallocations: dev 8,32 needed 114
ext4_mb_new_blocks
repeat:
ext4_mb_discard_preallocations_should_retry(sb, ac, &seq)
freed = ext4_mb_discard_preallocations
ext4_mb_discard_group_preallocations
this_cpu_inc(discard_pa_seq);
---> freed == 0
seq_retry = ext4_get_discard_pa_seq_sum
for_each_possible_cpu(__cpu)
__seq += per_cpu(discard_pa_seq, __cpu);
if (seq_retry != *seq) {
*seq = seq_retry;
ret = true;
}
As we see seq_retry is sum of discard_pa_seq every cpu, if
ext4_mb_discard_group_preallocations return zero discard_pa_seq in this
cpu maybe increase one, so condition "seq_retry != *seq" have always
been met.
Ritesh Harjani suggest to in ext4_mb_discard_group_preallocations function we
only increase discard_pa_seq when there is some PA to free.
Fixes: 07b5b8e1ac ("ext4: mballoc: introduce pcpu seqcnt for freeing PA to improve ENOSPC handling")
Signed-off-by: Ye Bin <yebin10@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/20200916113859.1556397-3-yebin10@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
After moving ext4's bmap to iomap interface, swapon functionality
on files created using fallocate (which creates unwritten extents) are
failing. This is since iomap_bmap interface returns 0 for unwritten
extents and thus generic_swapfile_activate considers this as holes
and hence bail out with below kernel msg :-
[340.915835] swapon: swapfile has holes
To fix this we need to implement ->swap_activate aops in ext4
which will use ext4_iomap_report_ops. Since we only need to return
the list of extents so ext4_iomap_report_ops should be enough.
Cc: stable@kernel.org
Reported-by: Yuxuan Shui <yshuiv7@gmail.com>
Fixes: ac58e4fb03 ("ext4: move ext4 bmap to use iomap infrastructure")
Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/20200904091653.1014334-1-riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
A previous commit changed the notification mode from true/false to an
int, allowing notify-no, notify-yes, or signal-notify. This was
backwards compatible in the sense that any existing true/false user
would translate to either 0 (on notification sent) or 1, the latter
which mapped to TWA_RESUME. TWA_SIGNAL was assigned a value of 2.
Clean this up properly, and define a proper enum for the notification
mode. Now we have:
- TWA_NONE. This is 0, same as before the original change, meaning no
notification requested.
- TWA_RESUME. This is 1, same as before the original change, meaning
that we use TIF_NOTIFY_RESUME.
- TWA_SIGNAL. This uses TIF_SIGPENDING/JOBCTL_TASK_WORK for the
notification.
Clean up all the callers, switching their 0/1/false/true to using the
appropriate TWA_* mode for notifications.
Fixes: e91b481623 ("task_work: teach task_work_add() to do signal_wake_up()")
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
__io_queue_proc() is used by both, poll reqs and apoll. Don't use
req->poll.events to copy poll mask because for apoll it aliases with
private data of the request.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>