-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAlwbm5YACgkQnJ2qBz9k
QNnF5wf+L9lWe4LzGf3sCc45JFtcsvLtcYd0XSn93ATRPSIxNYCCDEdvJ9UGfGhG
+2y4Z42KBWkqg2Dx8W2yiQjP1RxJTbWIAEE7uKKyWIojNMbtpjSjRw/n7VmCG7m/
ZHhO3XITHYRCVruIKVfwC0a9zPNWD6RD4+IhHRrZomC1ZrfTzs0OBc6PADXLchdR
0QRkpWlomD3CASGVioYbOFqX3pcDJWUdoFvOyMu2sELkJzAJjC7spB+MfoFeRJsA
QEUoRAdWCSFtxcuM6bkNMc1zXig3niLBka/ozXvwA0192rbOcyGvua3PwqYTdqnu
Bea36CD2G+ta0cBv1M5s7e87surttw==
=wdFw
-----END PGP SIGNATURE-----
Merge tag 'fsnotify_for_v4.21-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull fsnotify updates from Jan Kara:
"Support for new FAN_OPEN_EXEC event and couple of cleanups around
fsnotify"
* tag 'fsnotify_for_v4.21-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
fanotify: Use inode_is_open_for_write
fanotify: Make sure to check event_len when copying
fsnotify/fdinfo: include fdinfo.h for inotify_show_fdinfo()
fanotify: introduce new event mask FAN_OPEN_EXEC_PERM
fsnotify: refactor fsnotify_parent()/fsnotify() paired calls when event is on path
fanotify: introduce new event mask FAN_OPEN_EXEC
fanotify: return only user requested event types in event mask
This set is entirely trivial fixes, mainly around correct cleanup
on error paths and improved error checks. One patch adds scheduling
in a potentially long recovery loop.
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJcGnvKAAoJEDgbc8f8gGmq4hcP/0vhF9y+SLXDkvm989n9QBMe
elJ5KhxeHQgYDScv2Nbth9heTGMw2OFzDAUSMRXD45eYy8EjM12Gl8Dz4y8pQBne
5ENeibzPYmD2Smvtyq5T/45HpErlM5NmcefICumaC9WNRQBei/6dD3UEteRnAZbC
qyNPr9/bqFSkMCzJzKRsjfduk5GTYB5FQzIOeHvwn3dShPtTETn+zF5MH/mpswBP
LEtimOZdXVUZzdmbAuOMOfAMLoRdpzLFNxC10g//JyZAbG3dQZdpEdn98iu0oR3c
Y7JUs6JLtkB/FP46v0Ay7xJ7amx8dNnKbEj7UzqveSgo3PxPbIZYa3VUdXfdP/9T
J/+k88behzQ5NG+pETKA51+DaVJ50nAVeQqX001dgUi095esjqmyQDPbDwohsRvf
z+gb0vqwtsbS58/1tRcfRdrzYl/FSHuEcif0rwOwTnqAjOjz+V89wNZo2HbLbOeK
8yyeeGd+jMxom9xwHV79/IPo3U2P5iU/40AfJXW2MMnudbq+kfE7GdNXu2MWPQ1P
yhmwvELoNG8aD93nX14HMhHFUl+Lv9p5VW/RhySFevt/9TKRiwlG1tw7e/XhNJjs
P8K8ry394FlPOPYeAsiabOI9dUi3Divm7/7ChUFSqQUc7wAHjpLbUlFczYCnVz3W
rb/GfmgBFcJQtzURXDnx
=p4Yr
-----END PGP SIGNATURE-----
Merge tag 'dlm-4.21' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm
Pull dlm updates from David Teigland:
"This set is entirely trivial fixes, mainly around correct cleanup on
error paths and improved error checks. One patch adds scheduling in a
potentially long recovery loop"
* tag 'dlm-4.21' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm:
dlm: fix invalid cluster name warning
dlm: NULL check before some freeing functions is not needed
dlm: NULL check before kmem_cache_destroy is not needed
dlm: fix missing idr_destroy for recover_idr
dlm: memory leaks on error path in dlm_user_request()
dlm: lost put_lkb on error path in receive_convert() and receive_unlock()
dlm: possible memory leak on error path in create_lkb()
dlm: fixed memory leaks after failed ls_remove_names allocation
dlm: fix possible call to kfree() for non-initialized pointer
dlm: Don't swamp the CPU with callbacks queued during recovery
dlm: don't leak kernel pointer to userspace
dlm: don't allow zero length names
dlm: fix invalid free
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAlwaO6gACgkQxWXV+ddt
WDuiuRAAktVw4AEogVKW8J9Fx7UCW8hJ0C2NFWSlcvaNfKi7ccMBsINqQIH6bvmv
lO2KUuaKRK+FID3R3Er9Z4GZmvg7nXAEp/+PQgYMYYOFOmsdr/YCZCjBn9ACDZ/K
XJD/ESpEq9FORMj1yHre3zKgCPwkOK66Z8XFRMPaC5ANiLNohuE25FkwTTva0Bse
TyOc8aSKmfXBqoNvlKS0qc52FUoS6aebrYfTA7l9FfmUJ0szZeHSCtRyY4EzlQu7
VLojNq7GkSXpGGQHTTSp73Bj1E2ZM6fwTgpC1NHAFnTCTFSr/GswdfMdmxMkNuqm
zsrioqIzh6kX8RXVvB81Vb1aCRXELhcpBDWK/5QbEPyQvfKxQJLoHBt3QqewwtnT
lClhNi4tx3ItfWddz24kq/E8aaIOQtECq6OQXpyRPc47TmKwY8m8V41LP/ECNBQ2
InIuAsk0abkQvemc98SNxDNV0hiMJHJnR93wY8MEs6JTY3PuVE6ELrMTB7GRsl6M
zHHFD1kSBVNMxWr+pwnDe5NxQ3qP4SB1RZvTlN7FdzLAhWIFEN7foHoXI9h2G0kx
+H/cVr2OiXdxUd19IuAF9kxVkfrJ7A3JEopVdaI57swyu/IK2xGsPd5nDkf4kOXU
OpcnnbDxg6v0NL/Vr/LxWSa6qgVqM2D4JyxjMTMPQOhK7Y6GP4M=
=rO1q
-----END PGP SIGNATURE-----
Merge tag 'for-4.21-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
"New features:
- swapfile support - after a long time it's here, with some
limitations where COW design does not work well with the swap
implementation (nodatacow file, no compression, cannot be
snapshotted, not possible on multiple devices, ...), as this is the
most restricted but working setup, we'll try to improve that in the
future
- metadata uuid - an optional incompat feature to assign a new
filesystem UUID without overwriting all metadata blocks, stored
only in superblock
- more balance messages are printed to system log, initial is in the
format of the command line that would be used to start it
Fixes:
- tag pages of a snapshot to better separate pages that are involved
in the snapshot (and need to get synced) from newly dirtied pages
that could slow down or even livelock the snapshot operation
- improved check of filesystem id associated with a device during
scan to detect duplicate devices that could be mixed up during
mount
- fix device replace state transitions, eg. when it ends up
interrupted and reboot tries to restart balance too, or when
start/cancel ioctls race
- fix a crash due to a race when quotas are enabled during snapshot
creation
- GFP_NOFS/memalloc_nofs_* fixes due to GFP_KERNEL allocations in
transaction context
- fix fsync of files with multiple hard links in new directories
- fix race of send with transaction commits that create snapshots
Core changes:
- cleanups:
* further removals of now-dead fsync code
* core function for finding free extent has been split and
provides a base for further cleanups to make the logic more
understandable
* removed lot of indirect callbacks for data and metadata inodes
* simplified refcounting and locking for cloned extent buffers
* removed redundant function arguments
* defines converted to enums where appropriate
- separate reserve for delayed refs from global reserve, update logic
to do less trickery and ad-hoc heuristics, move out some related
expensive operations from transaction commit or file truncate
- dev-replace switched from custom locking scheme to semaphore
- remove first phase of balance that tried to make some space for the
relocation by calling shrink and grow, this did not work as
expected and only introduced more error states due to potential
resize failures, slightly improves the runtime as the chunks on all
devices are not needlessly enumerated
- clone and deduplication now use generic helper that adds a few more
checks that were missing from the original btrfs implementation of
the ioctls"
* tag 'for-4.21-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (125 commits)
btrfs: Fix typos in comments and strings
btrfs: improve error handling of btrfs_add_link
Btrfs: use generic_remap_file_range_prep() for cloning and deduplication
btrfs: Refactor main loop in extent_readpages
btrfs: Remove 1st shrink/grow phase from balance
Btrfs: send, fix race with transaction commits that create snapshots
Btrfs: use nofs context when initializing security xattrs to avoid deadlock
btrfs: run delayed items before dropping the snapshot
btrfs: catch cow on deleting snapshots
btrfs: extent-tree: cleanup one-shot usage of @blocksize in do_walk_down
Btrfs: scrub, move setup of nofs contexts higher in the stack
btrfs: scrub: move scrub_setup_ctx allocation out of device_list_mutex
btrfs: scrub: pass fs_info to scrub_setup_ctx
btrfs: fix truncate throttling
btrfs: don't run delayed refs in the end transaction logic
btrfs: rework btrfs_check_space_for_delayed_refs
btrfs: add new flushing states for the delayed refs rsv
btrfs: update may_commit_transaction to use the delayed refs rsv
btrfs: introduce delayed_refs_rsv
btrfs: only track ref_heads in delayed_ref_updates
...
- Enhancements and performance improvements to journal replay (Abhi Das)
- Cleanup of gfs2_is_ordered and gfs2_is_writeback (Andreas Gruenbacher)
- Fix a potential double-free in inode creation (Andreas Gruenbacher)
- Fix the bitmap search loop that was searching too far (Andreas Gruenbacher)
- Various cleanups (Andreas Gruenbacher, Bob Peterson)
- Implement Steve Whitehouse's patch to dump nrpages for inodes (Bob Peterson)
- Fix a withdraw bug where stuffed journaled data files didn't allocate
enough journal space to be grown (Bob Peterson)
-----BEGIN PGP SIGNATURE-----
iQEcBAABAgAGBQJcGlTwAAoJENeLYdPf93o7oGIIAIC5OfmIU2rz789L1s50m6Xg
fAJ4X+m3d1jbHZDAwd5BXYcCaIMQJpaL89JWZSyuN5vG4hmNq/Q1V0Ho4nU5onkY
sIQ5PlOf0I2uEbe2k9lAyWr2ro+S7QSfW+ogVIE7didOxYYRY825lxgMpJEChW+r
n0a7SqzPBHJglzNUVASrebD7VA7afzJ/Lb4YcXc7p6ewb7++/Kc3W0sewwy8kKER
29W21bxAhzm4vLMyU7h0g51AuLxBljijSZ7J1IHf0UPwQ/KC4QFuWVyufuH9ePuY
5yJxa25HTeZboIeeeuF01UDv6V9TOu83ztrGlI0uoR9rWPxPEA3Y1Spy6wQ/2w8=
=k9oZ
-----END PGP SIGNATURE-----
Merge tag 'gfs2-4.21.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2
Pull gfs2 updates from Bob Peterson:
- Enhancements and performance improvements to journal replay (Abhi
Das)
- Cleanup of gfs2_is_ordered and gfs2_is_writeback (Andreas
Gruenbacher)
- Fix a potential double-free in inode creation (Andreas Gruenbacher)
- Fix the bitmap search loop that was searching too far (Andreas
Gruenbacher)
- Various cleanups (Andreas Gruenbacher, Bob Peterson)
- Implement Steve Whitehouse's patch to dump nrpages for inodes (Bob
Peterson)
- Fix a withdraw bug where stuffed journaled data files didn't allocate
enough journal space to be grown (Bob Peterson)
* tag 'gfs2-4.21.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
gfs2: take jdata unstuff into account in do_grow
gfs2: Dump nrpages for inodes and their glocks
gfs2: Fix loop in gfs2_rbm_find
gfs2: Get rid of potential double-freeing in gfs2_create_inode
gfs2: Remove vestigial bd_ops
gfs2: read journal in large chunks to locate the head
gfs2: add a helper function to get_log_header that can be used elsewhere
gfs2: changes to gfs2_log_XXX_bio
gfs2: add more timing info to journal recovery process
gfs2: Fix the gfs2_invalidatepage description
gfs2: Clean up gfs2_is_{ordered,writeback}
Pull crypto updates from Herbert Xu:
"API:
- Add 1472-byte test to tcrypt for IPsec
- Reintroduced crypto stats interface with numerous changes
- Support incremental algorithm dumps
Algorithms:
- Add xchacha12/20
- Add nhpoly1305
- Add adiantum
- Add streebog hash
- Mark cts(cbc(aes)) as FIPS allowed
Drivers:
- Improve performance of arm64/chacha20
- Improve performance of x86/chacha20
- Add NEON-accelerated nhpoly1305
- Add SSE2 accelerated nhpoly1305
- Add AVX2 accelerated nhpoly1305
- Add support for 192/256-bit keys in gcmaes AVX
- Add SG support in gcmaes AVX
- ESN for inline IPsec tx in chcr
- Add support for CryptoCell 703 in ccree
- Add support for CryptoCell 713 in ccree
- Add SM4 support in ccree
- Add SM3 support in ccree
- Add support for chacha20 in caam/qi2
- Add support for chacha20 + poly1305 in caam/jr
- Add support for chacha20 + poly1305 in caam/qi2
- Add AEAD cipher support in cavium/nitrox"
* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (130 commits)
crypto: skcipher - remove remnants of internal IV generators
crypto: cavium/nitrox - Fix build with !CONFIG_DEBUG_FS
crypto: salsa20-generic - don't unnecessarily use atomic walk
crypto: skcipher - add might_sleep() to skcipher_walk_virt()
crypto: x86/chacha - avoid sleeping under kernel_fpu_begin()
crypto: cavium/nitrox - Added AEAD cipher support
crypto: mxc-scc - fix build warnings on ARM64
crypto: api - document missing stats member
crypto: user - remove unused dump functions
crypto: chelsio - Fix wrong error counter increments
crypto: chelsio - Reset counters on cxgb4 Detach
crypto: chelsio - Handle PCI shutdown event
crypto: chelsio - cleanup:send addr as value in function argument
crypto: chelsio - Use same value for both channel in single WR
crypto: chelsio - Swap location of AAD and IV sent in WR
crypto: chelsio - remove set but not used variable 'kctx_len'
crypto: ux500 - Use proper enum in hash_set_dma_transfer
crypto: ux500 - Use proper enum in cryp_set_dma_transfer
crypto: aesni - Add scatter/gather avx stubs, and use them in C
crypto: aesni - Introduce partial block macro
..
There is a security report where f2fs_getxattr() has a hole to expose wrong
memory region when the image is malformed like this.
f2fs_getxattr: entry->e_name_len: 4, size: 12288, buffer_size: 16384, len: 4
Cc: <stable@vger.kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
iput() on sbi->node_inode can update sbi->stat_info
in the below context, if the f2fs_write_checkpoint()
has failed with error.
f2fs_balance_fs_bg+0x1ac/0x1ec
f2fs_write_node_pages+0x4c/0x260
do_writepages+0x80/0xbc
__writeback_single_inode+0xdc/0x4ac
writeback_single_inode+0x9c/0x144
write_inode_now+0xc4/0xec
iput+0x194/0x22c
f2fs_put_super+0x11c/0x1e8
generic_shutdown_super+0x70/0xf4
kill_block_super+0x2c/0x5c
kill_f2fs_super+0x44/0x50
deactivate_locked_super+0x60/0x8c
deactivate_super+0x68/0x74
cleanup_mnt+0x40/0x78
Fix this by moving f2fs_destroy_stats() further below iput() in
both f2fs_put_super() and f2fs_fill_super() paths.
Signed-off-by: Sahitya Tummala <stummala@codeaurora.org>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
For all ordered cases in f2fs_wait_on_page_writeback(), we need to
check PageWriteback status, so let's clean up to relocate the check
into f2fs_wait_on_page_writeback().
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Treat "block_count" from struct f2fs_super_block as 64-bit little endian
value in sanity_check_raw_super() because struct f2fs_super_block
declares "block_count" as "__le64".
This fixes a bug where the superblock validation fails on big endian
devices with the following error:
F2FS-fs (sda1): Wrong segment_count / block_count (61439 > 0)
F2FS-fs (sda1): Can't find valid F2FS filesystem in 1th superblock
F2FS-fs (sda1): Wrong segment_count / block_count (61439 > 0)
F2FS-fs (sda1): Can't find valid F2FS filesystem in 2th superblock
As result of this the partition cannot be mounted.
With this patch applied the superblock validation works fine and the
partition can be mounted again:
F2FS-fs (sda1): Mounted with checkpoint version = 7c84
My little endian x86-64 hardware was able to mount the partition without
this fix.
To confirm that mounting f2fs filesystems works on big endian machines
again I tested this on a 32-bit MIPS big endian (lantiq) device.
Fixes: 0cfe75c5b0 ("f2fs: enhance sanity_check_raw_super() to avoid potential overflows")
Cc: stable@vger.kernel.org
Signed-off-by: Martin Blumenstingl <martin.blumenstingl@googlemail.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If user change inode's i_flags via ioctl, let's add it into global
dirty list, so that checkpoint can guarantee its persistence before
fsync, it can make checkpoint keeping strong consistency.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The union in struct extent_node wass only to indicate below fields
struct rb_node rb_node;
union {
struct {
unsigned int fofs;
unsigned int len;
...
...
can be parsed as fields in struct rb_entry, but they were never be
used explicitly before, so let's remove them for cleanup.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Should use lstart (logical start address) instead of start (in dev) here.
This fixes a bug in multi-device scenarios.
Signed-off-by: Qiuyang Sun <sunqiuyang@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
When there is a failure in f2fs_fill_super() after/during
the recovery of fsync'd nodes, it frees the current sbi and
retries again. This time the mount is successful, but the files
that got recovered before retry, still holds the extent tree,
whose extent nodes list is corrupted since sbi and sbi->extent_list
is freed up. The list_del corruption issue is observed when the
file system is getting unmounted and when those recoverd files extent
node is being freed up in the below context.
list_del corruption. prev->next should be fffffff1e1ef5480, but was (null)
<...>
kernel BUG at kernel/msm-4.14/lib/list_debug.c:53!
lr : __list_del_entry_valid+0x94/0xb4
pc : __list_del_entry_valid+0x94/0xb4
<...>
Call trace:
__list_del_entry_valid+0x94/0xb4
__release_extent_node+0xb0/0x114
__free_extent_tree+0x58/0x7c
f2fs_shrink_extent_tree+0xdc/0x3b0
f2fs_leave_shrinker+0x28/0x7c
f2fs_put_super+0xfc/0x1e0
generic_shutdown_super+0x70/0xf4
kill_block_super+0x2c/0x5c
kill_f2fs_super+0x44/0x50
deactivate_locked_super+0x60/0x8c
deactivate_super+0x68/0x74
cleanup_mnt+0x40/0x78
__cleanup_mnt+0x1c/0x28
task_work_run+0x48/0xd0
do_notify_resume+0x678/0xe98
work_pending+0x8/0x14
Fix this by not creating extents for those recovered files if shrinker is
not registered yet. Once mount is successful and shrinker is registered,
those files can have extents again.
Signed-off-by: Sahitya Tummala <stummala@codeaurora.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch cleans up checkpoint flow a bit:
- remove unneeded circulation of flushing meta pages.
- don't flush nat_bits pages in prior to other checkpoint pages.
- add bug_on to check remained meta pages after flushing.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Sometimes, I could observe # of issuing_discard to be 1 which blocks background
jobs due to is_idle()=false.
The only way to get out of it was to trigger gc_urgent. This patch avoids that
by checking any candidates as done in the list.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
One report says memalloc failure during mount.
(unwind_backtrace) from [<c010cd4c>] (show_stack+0x10/0x14)
(show_stack) from [<c049c6b8>] (dump_stack+0x8c/0xa0)
(dump_stack) from [<c024fcf0>] (warn_alloc+0xc4/0x160)
(warn_alloc) from [<c0250218>] (__alloc_pages_nodemask+0x3f4/0x10d0)
(__alloc_pages_nodemask) from [<c0270450>] (kmalloc_order_trace+0x2c/0x120)
(kmalloc_order_trace) from [<c03fa748>] (build_node_manager+0x35c/0x688)
(build_node_manager) from [<c03de494>] (f2fs_fill_super+0xf0c/0x16cc)
(f2fs_fill_super) from [<c02a5864>] (mount_bdev+0x15c/0x188)
(mount_bdev) from [<c03da624>] (f2fs_mount+0x18/0x20)
(f2fs_mount) from [<c02a68b8>] (mount_fs+0x158/0x19c)
(mount_fs) from [<c02c3c9c>] (vfs_kern_mount+0x78/0x134)
(vfs_kern_mount) from [<c02c76ac>] (do_mount+0x474/0xca4)
(do_mount) from [<c02c8264>] (SyS_mount+0x94/0xbc)
(SyS_mount) from [<c0108180>] (ret_fast_syscall+0x0/0x48)
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Commit 089842de ("f2fs: remove codes of unused wio_mutex") removes codes
of unused wio_mutex, but missing the comment, so delete it.
Signed-off-by: Yunlong Song <yunlong.song@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Pull RCU updates from Ingo Molnar:
"The biggest RCU changes in this cycle were:
- Convert RCU's BUG_ON() and similar calls to WARN_ON() and similar.
- Replace calls of RCU-bh and RCU-sched update-side functions to
their vanilla RCU counterparts. This series is a step towards
complete removal of the RCU-bh and RCU-sched update-side functions.
( Note that some of these conversions are going upstream via their
respective maintainers. )
- Documentation updates, including a number of flavor-consolidation
updates from Joel Fernandes.
- Miscellaneous fixes.
- Automate generation of the initrd filesystem used for rcutorture
testing.
- Convert spin_is_locked() assertions to instead use lockdep.
( Note that some of these conversions are going upstream via their
respective maintainers. )
- SRCU updates, especially including a fix from Dennis Krein for a
bag-on-head-class bug.
- RCU torture-test updates"
* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (112 commits)
rcutorture: Don't do busted forward-progress testing
rcutorture: Use 100ms buckets for forward-progress callback histograms
rcutorture: Recover from OOM during forward-progress tests
rcutorture: Print forward-progress test age upon failure
rcutorture: Print time since GP end upon forward-progress failure
rcutorture: Print histogram of CB invocation at OOM time
rcutorture: Print GP age upon forward-progress failure
rcu: Print per-CPU callback counts for forward-progress failures
rcu: Account for nocb-CPU callback counts in RCU CPU stall warnings
rcutorture: Dump grace-period diagnostics upon forward-progress OOM
rcutorture: Prepare for asynchronous access to rcu_fwd_startat
torture: Remove unnecessary "ret" variables
rcutorture: Affinity forward-progress test to avoid housekeeping CPUs
rcutorture: Break up too-long rcu_torture_fwd_prog() function
rcutorture: Remove cbflood facility
torture: Bring any extra CPUs online during kernel startup
rcutorture: Add call_rcu() flooding forward-progress tests
rcutorture/formal: Replace synchronize_sched() with synchronize_rcu()
tools/kernel.h: Replace synchronize_sched() with synchronize_rcu()
net/decnet: Replace rcu_barrier_bh() with rcu_barrier()
...
Pull sparc updates from David Miller:
- Automatic system call table generation, from Firoz Khan.
- Clean up accesses to the OF device names by using full_name instead
of path_component_name.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc-next:
ALSA: sparc: Use of_node_name_eq for node name comparisons
sbus: Use of_node_name_eq for node name comparisons
sparc: generate uapi header and system call table files
sparc: add system call table generation support
sparc: add __NR_syscalls along with NR_syscalls
sparc: move __IGNORE* entries to non uapi header
sparc: Use DT node full_name instead of name for resources
sparc: Remove unused leon_trans_init
sparc: Use device_type helpers to access the node type
sparc: Use of_node_name_eq for node name comparisons
sparc: Convert to using %pOFn instead of device_node.name
sparc: prom: use property "name" directly to construct node names
of: Drop full path from full_name for PDT systems
sparc: Convert to using %pOF instead of full_name
fs/openpromfs: Use of_node_name_eq for node name comparisons
fs/openpromfs: use full_name instead of path_component_name
mds contains an optimization, it does not re-issue stale caps if
client does not want any cap.
A special case of the optimization is that client wants some caps,
but skipped updating 'wanted'. For this case, client needs to update
'wanted' when stale session get renewed.
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
When reading cached inode that already has Fscr caps, this can avoid
two cap messages (one updats 'wanted' caps, one clears 'wanted' caps).
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Updating mseq makes client think importer mds has accepted all prior
cap messages and importer mds knows what caps client wants. Actually
some cap messages may have been dropped because of mseq mismatch.
If mseq is left untouched, importing cap's mds_wanted later will get
reset by cap import message.
Cc: stable@vger.kernel.org
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
There is redundant assighment of variable i in
ceph_mdsmap_get_random_mds(), just remvoe it.
Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
splice_dentry() may drop the original dentry and return other
dentry. It relies on its caller to update pointer that points
to the dropped dentry. This is error-prone.
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Core changes:
- Parse the 4BAIT SFDP section
- Add a bunch of SPI NOR entries to the flash_info table
- Add the concept of SFDP fixups and use it to fix a bug on MX25L25635F
- A bunch of minor cleanups/comestic changes
NAND changes:
NAND core changes:
- kernel-doc miscellaneous fixes.
- Third batch of fixes/cleanup to the raw NAND core impacting various
controller drivers (ams-delta, marvell, fsmc, denali, tegra, vf610):
* Stopping to pass mtd_info objects to internal functions
* Reorganizing code to avoid forward declarations
* Dropping useless test in nand_legacy_set_defaults()
* Moving nand_exec_op() to internal.h
* Adding nand_[de]select_target() helpers
* Passing the CS line to be selected in struct nand_operation
* Making ->select_chip() optional when ->exec_op() is implemented
* Deprecating the ->select_chip() hook
* Moving the ->exec_op() method to nand_controller_ops
* Moving ->setup_data_interface() to nand_controller_ops
* Deprecating the dummy_controller field
* Fixing JEDEC detection
* Providing a helper for polling GPIO R/B pin
Raw NAND chip drivers changes:
- Macronix:
* Flagging 1.8V AC chips with a broken GET_FEATURES(TIMINGS)
Raw NAND controllers drivers changes:
- Ams-delta:
* Fixing the error path
* SPDX tag added
* May be compiled with COMPILE_TEST=y
* Conversion to ->exec_op() interface
* Dropping .IOADDR_R/W use
* Use GPIO API for data I/O
- Denali:
* Removing denali_reset_banks()
* Removing ->dev_ready() hook
* Including <linux/bits.h> instead of <linux/bitops.h>
* Changes to comply with the above fixes/cleanup done in the core.
- FSMC:
* Adding an SPDX tag to replace the license text
* Making conversion from chip to fsmc consistent
* Fixing unchecked return value in fsmc_read_page_hwecc
* Changes to comply with the above fixes/cleanup done in the core.
- Marvell:
* Preventing timeouts on a loaded machine (fix)
* Changes to comply with the above fixes/cleanup done in the core.
- OMAP2:
* Pass the parent of pdev to dma_request_chan() (fix)
- R852:
* Use generic DMA API
- sh_flctl:
* Converting to SPDX identifiers
- Sunxi:
* Write pageprog related opcodes to the right register: WCMD_SET (fix)
- Tegra:
* Stop implementing ->select_chip()
- VF610:
* Adding an SPDX tag to replace the license text
* Changes to comply with the above fixes/cleanup done in the core.
- Various trivial/spelling/coding style fixes.
SPI-NAND drivers changes:
- Removing the depreacated mt29f_spinand driver from staging.
- Adding support for:
* Toshiba TC58CVG2S0H
* GigaDevice GD5FxGQ4xA
* Winbond W25N01GV
JFFS2 changes:
- Fix a lockdep issue
MTD changes:
- Rework the physmap driver to merge gpio-addr-flash and physmap_of
in it
- Add a new compatible for RedBoot partitions
- Make sub-partitions RW if the parent partition was RO because of a
mis-alignment
- Add pinctrl support to the
- Addition of /* fall-through */ comments where appropriate
- Various minor fixes and cleanups
Other changes:
- Update my email address
-----BEGIN PGP SIGNATURE-----
iQJQBAABCgA6FiEEKmCqpbOU668PNA69Ze02AX4ItwAFAlwZRoocHGJvcmlzLmJy
ZXppbGxvbkBib290bGluLmNvbQAKCRBl7TYBfgi3ALhmEACGoyJLVxRsE0yXOex7
JdPhZxnOkwnzaxQPQBQ7pFrWvRinV1Th8JenCCLiyM6DBcIDRo/87etC6cnjfnH/
eJ8e7xPLiESCsAB8xixP1YvLaZQzjBKOH/+qNR6anp505ZNKhgsuUDATLSFfaEvz
bZv3f1V70dPXdEK/3QXDZakqVtfvVBMSdyJMWdKSdVn70yA72wCPP4igcGdfYfoZ
KKaanhS1EdxA++WFrRqocQay9rtjnFGONHLfrefop9YKSevLp1UDZSAYSk4CHcEf
yaMFD+qlxc0wHlOk4XyDY3vREcdO50r2vXfN/Hxf0D6JeC/6RT0Hrm0YyY8X/u7l
xLhSv+DKGyft3SnQ4MDdwg57bvDSOkPryI3cxBul3S6I00pCdo8l5zQskezDbk0E
CkUuB+f7Wn3lmV7W8ZNbYHx0ljVJEXMxEQ6m/6ZLizIr/aC7m5ncgv8a3mL7QQQA
KXuLamw9pPlf/tAOQxB1PTiE1n8ECYqcInJXzoxaRzUdn8TlIYjxUb7GIDoOFpzm
6a3oCx6tUZq4j/GAoyYgo12NhH0aYW3X/V8N7SfPjmIofvFiUL1y1VQ2ghNDLCqH
2cz6WsFGoA0oSVnxl/TSSM6/ZXbvnLgM4zq6+wZITyCFNwyC956MddZrjL6lVkMT
s6kvh/h3wBWN+GxwlDeY+KZMlQ==
=myqK
-----END PGP SIGNATURE-----
Merge tag 'mtd/for-4.21' of git://git.infradead.org/linux-mtd
Pull mtd updates from Boris Brezillon:
"SPI NOR Core changes:
- Parse the 4BAIT SFDP section
- Add a bunch of SPI NOR entries to the flash_info table
- Add the concept of SFDP fixups and use it to fix a bug on MX25L25635F
- A bunch of minor cleanups/comestic changes
NAND core changes:
- kernel-doc miscellaneous fixes.
- Third batch of fixes/cleanup to the raw NAND core impacting various
controller drivers (ams-delta, marvell, fsmc, denali, tegra,
vf610):
* Stop to pass mtd_info objects to internal functions
* Reorganize code to avoid forward declarations
* Drop useless test in nand_legacy_set_defaults()
* Move nand_exec_op() to internal.h
* Add nand_[de]select_target() helpers
* Pass the CS line to be selected in struct nand_operation
* Make ->select_chip() optional when ->exec_op() is implemented
* Deprecate the ->select_chip() hook
* Move the ->exec_op() method to nand_controller_ops
* Move ->setup_data_interface() to nand_controller_ops
* Deprecate the dummy_controller field
* Fix JEDEC detection
* Provide a helper for polling GPIO R/B pin
Raw NAND chip drivers changes:
- Macronix:
* Flag 1.8V AC chips with a broken GET_FEATURES(TIMINGS)
Raw NAND controllers drivers changes:
- Ams-delta:
* Fix the error path
* SPDX tag added
* May be compiled with COMPILE_TEST=y
* Conversion to ->exec_op() interface
* Drop .IOADDR_R/W use
* Use GPIO API for data I/O
- Denali:
* Remove denali_reset_banks()
* Remove ->dev_ready() hook
* Include <linux/bits.h> instead of <linux/bitops.h>
* Changes to comply with the above fixes/cleanup done in the core.
- FSMC:
* Add an SPDX tag to replace the license text
* Make conversion from chip to fsmc consistent
* Fix unchecked return value in fsmc_read_page_hwecc
* Changes to comply with the above fixes/cleanup done in the core.
- Marvell:
* Prevent timeouts on a loaded machine (fix)
* Changes to comply with the above fixes/cleanup done in the core.
- OMAP2:
* Pass the parent of pdev to dma_request_chan() (fix)
- R852:
* Use generic DMA API
- sh_flctl:
* Convert to SPDX identifiers
- Sunxi:
* Write pageprog related opcodes to the right register: WCMD_SET (fix)
- Tegra:
* Stop implementing ->select_chip()
- VF610:
* Add an SPDX tag to replace the license text
* Changes to comply with the above fixes/cleanup done in the core.
- Various trivial/spelling/coding style fixes.
SPI-NAND drivers changes:
- Remove the depreacated mt29f_spinand driver from staging.
- Add support for:
* Toshiba TC58CVG2S0H
* GigaDevice GD5FxGQ4xA
* Winbond W25N01GV
JFFS2 changes:
- Fix a lockdep issue
MTD changes:
- Rework the physmap driver to merge gpio-addr-flash and physmap_of
in it
- Add a new compatible for RedBoot partitions
- Make sub-partitions RW if the parent partition was RO because of a
mis-alignment
- Add pinctrl support to the
- Addition of /* fall-through */ comments where appropriate
- Various minor fixes and cleanups
Other changes:
- Update my email address"
* tag 'mtd/for-4.21' of git://git.infradead.org/linux-mtd: (108 commits)
mtd: rawnand: sunxi: Write pageprog related opcodes to WCMD_SET
MAINTAINERS: Update my email address
mtd: rawnand: marvell: prevent timeouts on a loaded machine
mtd: rawnand: omap2: Pass the parent of pdev to dma_request_chan()
mtd: rawnand: Fix JEDEC detection
mtd: spi-nor: Add support for is25lp016d
mtd: spi-nor: parse SFDP 4-byte Address Instruction Table
mtd: spi-nor: Add 4B_OPCODES flag to is25lp256
mtd: spi-nor: Add an SPDX tag to spi-nor.{c,h}
mtd: spi-nor: Make the enable argument passed to set_byte() a bool
mtd: spi-nor: Stop passing flash_info around
mtd: spi-nor: Avoid forward declaration of internal functions
mtd: spi-nor: Drop inline on all internal helpers
mtd: spi-nor: Add a post BFPT fixup for MX25L25635E
mtd: spi-nor: Add a post BFPT parsing fixup hook
mtd: spi-nor: Add the SNOR_F_4B_OPCODES flag
mtd: spi-nor: cast to u64 to avoid uint overflows
mtd: spi-nor: Add support for IS25LP032/064
mtd: spi-nor: add entry for mt35xu512aba flash
mtd: spi-nor: add macros related to MICRON flash
...
The ext4_inline_data_fiemap() function calls fiemap_fill_next_extent()
while still holding the xattr semaphore. This is not necessary and it
triggers a circular lockdep warning. This is because
fiemap_fill_next_extent() could trigger a page fault when it writes
into page which triggers a page fault. If that page is mmaped from
the inline file in question, this could very well result in a
deadlock.
This problem can be reproduced using generic/519 with a file system
configuration which has the inline_data feature enabled.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
There are enough credits reserved for most dioread_nolock writes;
however, if the extent tree is sufficiently deep, and/or quota is
enabled, the code was not allowing for all eventualities when
reserving journal credits for the unwritten extent conversion.
This problem can be seen using xfstests ext4/034:
WARNING: CPU: 1 PID: 257 at fs/ext4/ext4_jbd2.c:271 __ext4_handle_dirty_metadata+0x10c/0x180
Workqueue: ext4-rsv-conversion ext4_end_io_rsv_work
RIP: 0010:__ext4_handle_dirty_metadata+0x10c/0x180
...
EXT4-fs: ext4_free_blocks:4938: aborting transaction: error 28 in __ext4_handle_dirty_metadata
EXT4: jbd2_journal_dirty_metadata failed: handle type 11 started at line 4921, credits 4/0, errcode -28
EXT4-fs error (device dm-1) in ext4_free_blocks:4950: error 28
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
This will be needed by DFS cache.
Signed-off-by: Paulo Alcantara <palcantara@suse.de>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Different servers have different set of file ids.
After failover, unique IDs will be different so we can't validate
them.
Signed-off-by: Aurelien Aptel <aaptel@suse.com>
Reviewed-by: Paulo Alcantara <palcantara@suse.de>
Signed-off-by: Steve French <stfrench@microsoft.com>
If we only want to get the mount options strings, do not return the
devname.
For DFS failover, we'll be passing the DFS full path down to
cifs_mount() rather than the devname.
Signed-off-by: Paulo Alcantara <palcantara@suse.de>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
When extracting hostname from UNC, check for leading backslashes
before trying to remove them.
Signed-off-by: Paulo Alcantara <palcantara@suse.de>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
* Split and refactor the very large function cifs_mount() in multiple
functions:
- tcp, ses and tcon setup to mount_get_conns()
- tcp, ses and tcon cleanup in mount_put_conns()
- tcon tlink setup to mount_setup_tlink()
- remote path checking to is_path_remote()
* Implement 2 version of cifs_mount() for DFS-enabled builds and
non-DFS-enabled builds (CONFIG_CIFS_DFS_UPCALL).
In preparation for DFS failover support.
Signed-off-by: Paulo Alcantara <palcantara@suse.de>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
While resolving a bug with locks on samba shares found a strange behavior.
When a file locked by one node and we trying to lock it from another node
it fail with errno 5 (EIO) but in that case errno must be set to
(EACCES | EAGAIN).
This isn't happening when we try to lock file second time on same node.
In this case it returns EACCES as expected.
Also this issue not reproduces when we use SMB1 protocol (vers=1.0 in
mount options).
Further investigation showed that the mapping from status_to_posix_error
is different for SMB1 and SMB2+ implementations.
For SMB1 mapping is [NT_STATUS_LOCK_NOT_GRANTED to ERRlock]
(See fs/cifs/netmisc.c line 66)
but for SMB2+ mapping is [STATUS_LOCK_NOT_GRANTED to -EIO]
(see fs/cifs/smb2maperror.c line 383)
Quick changes in SMB2+ mapping from EIO to EACCES has fixed issue.
BUG: https://bugzilla.kernel.org/show_bug.cgi?id=201971
Signed-off-by: Georgy A Bystrenin <gkot@altlinux.org>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
CC: Stable <stable@vger.kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
When pinning memory failed, we should return the correct error code and
rewind the SMB credits.
Reported-by: Murphy Zhou <jencce.kernel@gmail.com>
Signed-off-by: Long Li <longli@microsoft.com>
Cc: stable@vger.kernel.org
Cc: Murphy Zhou <jencce.kernel@gmail.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
The current code attempts to pin memory using the largest possible wsize
based on the currect SMB credits. This doesn't cause kernel oops but this
is not optimal as we may pin more pages then actually needed.
Fix this by only pinning what are needed for doing this write I/O.
Signed-off-by: Long Li <longli@microsoft.com>
Cc: stable@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Joey Pabalinas <joeypabalinas@gmail.com>
RHBZ: 1021460
There is an issue where when multiple threads open/close the same directory
ntwrk_buf_start might end up being NULL, causing the call to smbCalcSize
later to oops with a NULL deref.
The real bug is why this happens and why this can become NULL for an
open cfile, which should not be allowed.
This patch tries to avoid a oops until the time when we fix the underlying
issue.
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
password_with_pad is a fixed size buffer of 16 bytes, it contains a
password string, to be padded with \0 if shorter than 16 bytes
but is just truncated if longer.
It is not, and we do not depend on it to be, nul terminated.
As such, do not use strncpy() to populate this buffer since
the str* prefix suggests that this is a string, which it is not,
and it also confuses coverity causing a false warning.
Detected by CoverityScan CID#113743 ("Buffer not null terminated")
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Fixes gcc '-Wunused-but-set-variable' warning:
fs/cifs/sess.c: In function '_sess_auth_rawntlmssp_assemble_req':
fs/cifs/sess.c:1157:18: warning:
variable 'smb_buf' set but not used [-Wunused-but-set-variable]
It never used since commit cc87c47d9d ("cifs: Separate rawntlmssp auth
from CIFS_SessSetup()")
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
To avoid the warning:
warning: this statement may fall through [-Wimplicit-fallthrough=]
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Reviewed-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reducing the number of network roundtrips improves the performance
of query xattrs
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Technically 3.02 is not the dialect name although that is more familiar to
many, so we should also accept the official dialect name (3.0.2 vs. 3.02)
in vers=
Signed-off-by: Kenneth D'souza <kdsouza@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
This is not actually a bug but as Coverity points out we shouldn't
be doing an "|=" on a value which hasn't been set (although technically
it was memset to zero so isn't a bug) and so might as well change
"|=" to "=" in this line
Detected by CoverityScan, CID#728535 ("Unitialized scalar variable")
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
As Coverity points out le16_to_cpu(midEntry->Command) can not be
less than zero.
Detected by CoverityScan, CID#1438650 ("Macro compares unsigned to 0")
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Improve performance by reducing number of network round trips
for set xattr.
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Trivial fix to clean up indentation, replace spaces with tab
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Pull vfs fixes from Al Viro:
"A couple of fixes - no common topic ;-)"
[ The aio spectre patch also came in from Jens, so now we have that
doubly fixed .. ]
* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
proc/sysctl: don't return ENOMEM on lookup when a table is unregistering
aio: fix spectre gadget in lookup_ioctx
This reverts commit 55956b59df.
commit 55956b59df ("vfs: Allow userns root to call mknod on owned filesystems.")
enabled mknod() in user namespaces for userns root if CAP_MKNOD is
available. However, these device nodes are useless since any filesystem
mounted from a non-initial user namespace will set the SB_I_NODEV flag on
the filesystem. Now, when a device node s created in a non-initial user
namespace a call to open() on said device node will fail due to:
bool may_open_dev(const struct path *path)
{
return !(path->mnt->mnt_flags & MNT_NODEV) &&
!(path->mnt->mnt_sb->s_iflags & SB_I_NODEV);
}
The problem with this is that as of the aforementioned commit mknod()
creates partially functional device nodes in non-initial user namespaces.
In particular, it has the consequence that as of the aforementioned commit
open() will be more privileged with respect to device nodes than mknod().
Before it was the other way around. Specifically, if mknod() succeeded
then it was transparent for any userspace application that a fatal error
must have occured when open() failed.
All of this breaks multiple userspace workloads and a widespread assumption
about how to handle mknod(). Basically, all container runtimes and systemd
live by the slogan "ask for forgiveness not permission" when running user
namespace workloads. For mknod() the assumption is that if the syscall
succeeds the device nodes are useable irrespective of whether it succeeds
in a non-initial user namespace or not. This logic was chosen explicitly
to allow for the glorious day when mknod() will actually be able to create
fully functional device nodes in user namespaces.
A specific problem people are already running into when running 4.18 rc
kernels are failing systemd services. For any distro that is run in a
container systemd services started with the PrivateDevices= property set
will fail to start since the device nodes in question cannot be
opened (cf. the arguments in [1]).
Full disclosure, Seth made the very sound argument that it is already
possible to end up with partially functional device nodes. Any filesystem
mounted with MS_NODEV set will allow mknod() to succeed but will not allow
open() to succeed. The difference to the case here is that the MS_NODEV
case is transparent to userspace since it is an explicitly set mount option
while the SB_I_NODEV case is an implicit property enforced by the kernel
and hence opaque to userspace.
[1]: https://github.com/systemd/systemd/pull/9483
Signed-off-by: Christian Brauner <christian@brauner.io>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Seth Forshee <seth.forshee@canonical.com>
Cc: Serge Hallyn <serge@hallyn.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
At mount time, we allocate m_rsum_cache with the number of realtime
bitmap blocks. However, xfs_growfs_rt() can increase the number of
realtime bitmap blocks. Using the cache after this happens may access
out of the bounds of the cache. Fix it by reallocating the cache in this
case.
Fixes: 355e353213 ("xfs: cache minimum realtime summary level")
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
get_unlocked_entry() uses an exclusive wait because it is guaranteed to
eventually obtain the lock and follow on with an unlock+wakeup cycle.
The wait_entry_unlocked() path does not have the same guarantee. Rather
than open-code an extra wakeup, just switch to a non-exclusive wait.
Cc: Jan Kara <jack@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
This patch removes the check from nfs_compare_mount_options to see if a
`sec' option was passed for the current mount before comparing auth
flavors and instead just always compares auth flavors.
Consider the following scenario:
You have a server with the address 192.168.1.1 and two exports /export/a
and /export/b. The first export supports `sys' and `krb5' security, the
second just `sys'.
Assume you start with no mounts from the server.
The following results in EIOs being returned as the kernel nfs client
incorrectly thinks it can share the underlying `struct nfs_server's:
$ mkdir /tmp/{a,b}
$ sudo mount -t nfs -o vers=3,sec=krb5 192.168.1.1:/export/a /tmp/a
$ sudo mount -t nfs -o vers=3 192.168.1.1:/export/b /tmp/b
$ df >/dev/null
df: ‘/tmp/b’: Input/output error
Signed-off-by: Chris Perl <cperl@janestreet.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAlwcW2IACgkQiiy9cAdy
T1FaLAv/Vs0QhYATaJHkvmjk7EdFoXTZST2wYxPcPSrOfWbCaMjsKREwPVUjOCiW
W//zL/Qbi0FSVMWdoskJP0KQC5rAhUSxlvtDxrpzqszfeJprIs0oL5IRlOwxNdtT
F9/i8+s/AuYq0TTn15UtqZHS6wZdWcerdttWV1V/97hEwcO5Xg0pEtyCmLPf7k8W
wNdxCAQFZ9j2pDVyuJO3a0+Tas34dc2t/cac12h0qeXbrE6e88bQWa6bpEKSvCKr
cMU94pkajCKeelZOhq+ga7cCmlBJs6gt4sgsKEsoDn72tQCsWVH6p1N4+AxmLsZU
bR65XodusR1WHMesSth7QraUk0pIQ4ZzMRPZJCkh9bSjaa+fxX1Up/sjn74q1prf
DHJ/52rQrWK3hvETUZD2B6N9AEDN0swbqeCJRlUYlzG5OEdfit9qSgfTaRzYxVnX
+tct7j+8mjzk+rsGuTXrQupPXUndPTcpUrKFp5db9Sejcx/Gw/atKhW6mVgL3LiR
8kVIraSV
=+8yd
-----END PGP SIGNATURE-----
Merge tag '4.20-rc7-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6
Pull smb3 fix from Steve French:
"An important smb3 fix for an regression to some servers introduced by
compounding optimization to rmdir.
This fix has been tested by multiple developers (including me) with
the usual private xfstesting, but also by the new cifs/smb3 "buildbot"
xfstest VMs (thank you Ronnie and Aurelien for good work on this
automation). The automated testing has been updated so that it will
catch problems like this in the future.
Note that Pavel discovered (very recently) some unrelated but
extremely important bugs in credit handling (smb3 flow control problem
that can lead to disconnects/reconnects) when compounding, that I
would have liked to send in ASAP but the complete testing of those two
fixes may not be done in time and have to wait for 4.21"
* tag '4.20-rc7-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6:
smb3: Fix rmdir compounding regression to strict servers
Adding options to growing mnt_opts. NFS kludge with passing
context= down into non-text-options mount switched to it, and
with that the last use of ->sb_parse_opts_str() is gone.
Reviewed-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Keep void * instead, allocate on demand (in parse_str_opts, at the
moment). Eventually both selinux and smack will be better off
with private structures with several strings in those, rather than
this "counter and two pointers to dynamically allocated arrays"
ugliness. This commit allows to do that at leisure, without
disrupting anything outside of given module.
Changes:
* instead of struct security_mnt_opt use an opaque pointer
initialized to NULL.
* security_sb_eat_lsm_opts(), security_sb_parse_opts_str() and
security_free_mnt_opts() take it as var argument (i.e. as void **);
call sites are unchanged.
* security_sb_set_mnt_opts() and security_sb_remount() take
it by value (i.e. as void *).
* new method: ->sb_free_mnt_opts(). Takes void *, does
whatever freeing that needs to be done.
* ->sb_set_mnt_opts() and ->sb_remount() might get NULL as
mnt_opts argument, meaning "empty".
Reviewed-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* if mount(2) passes something like "context=foo" with MS_REMOUNT
in flags (/sbin/mount.nfs will _not_ do that - you need to issue
the syscall manually), you'll get leaked copies for LSM options.
The reason is that instead of nfs_{alloc,free}_parsed_mount_data()
nfs_remount() uses kzalloc/kfree, which lacks the needed cleanup.
* selinux options are not changed on remount (as for any other
fs), but in case of NFS the failure is quiet - they are not compared
to what we used to have, with complaint in case of attempted changes.
Trivially fixed by converting to use of security_sb_remount().
Reviewed-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
1) keeping a copy in btrfs_fs_info is completely pointless - we never
use it for anything. Getting rid of that allows for simpler calling
conventions for setup_security_options() (caller is responsible for
freeing mnt_opts in all cases).
2) on remount we want to use ->sb_remount(), not ->sb_set_mnt_opts(),
same as we would if not for FS_BINARY_MOUNTDATA. Behaviours *are*
close (in fact, selinux sb_set_mnt_opts() ought to punt to
sb_remount() in "already initialized" case), but let's handle
that uniformly. And the only reason why the original btrfs changes
didn't go for security_sb_remount() in btrfs_remount() case is that
it hadn't been exported. Let's export it for a while - it'll be
going away soon anyway.
Reviewed-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
... leaving the "is it kernel-internal" logics in the caller.
Reviewed-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
combination of alloc_secdata(), security_sb_copy_data(),
security_sb_parse_opt_str() and free_secdata().
Reviewed-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
This paves the way for retaining the LSM options from a common filesystem
mount context during a mount parameter parsing phase to be instituted prior
to actual mount/reconfiguration actions.
Reviewed-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
This paves the way for retaining the LSM options from a common filesystem
mount context during a mount parameter parsing phase to be instituted prior
to actual mount/reconfiguration actions.
Reviewed-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
iomap_is_partially_uptodate() is intended to check wither blocks within
the selected range of a not-uptodate page are uptodate; if the range we
care about is up to date, it's an optimization.
However, the iomap implementation continues to check all blocks up to
from+count, which is beyond the page, and can even be well beyond the
iop->uptodate bitmap.
I think the worst that will happen is that we may eventually find a zero
bit and return "not partially uptodate" when it would have otherwise
returned true, and skip the optimization. Still, it's clearly an invalid
memory access that must be fixed.
So: fix this by limiting the search to within the page as is done in the
non-iomap variant, block_is_partially_uptodate().
Zorro noticed thiswhen KASAN went off for 512 byte blocks on a 64k
page system:
BUG: KASAN: slab-out-of-bounds in iomap_is_partially_uptodate+0x1a0/0x1e0
Read of size 8 at addr ffff800120c3a318 by task fsstress/22337
Reported-by: Zorro Lang <zlang@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
- Kconfig dependency fixes for our new auth feature
- Fix for selecting the right compressor when creating a fs
- Bugfix for a bug in UBIFS's O_TMPFILE implementation
- Refcounting fixes for UBI
-----BEGIN PGP SIGNATURE-----
iQJKBAABCAA0FiEEdgfidid8lnn52cLTZvlZhesYu8EFAlwb+MAWHHJpY2hhcmRA
c2lnbWEtc3Rhci5hdAAKCRBm+VmF6xi7wSIjD/9zyAQUaMXJzI/ct2SXXikyTgRi
hmLbUhuIEZD/c3tDQMgb3XqL7yYkOt/QnfeMXpi/Fy9wAdKcOU9QCcL6TPfR9B+B
v22VdDcsU5pp2Hcinm2Yz6v2OyYwchb2YShfMkiZSf0CKAhcsKZhaeqhug/Zq+w5
xZ6fMPvu8HhinBavq9OgSpPx892vlF1ypXHNeQNxwslz/gWKso47jtmXpz4LDX7+
s4+p+lagVJ0MHQ3fVhBvEplWr9zPKHvX8uGohADLsDp5bAqRyGNxBwqgGm6J6H7B
UHuqUjUf19hUZEauwKugmPTM3Y0TkZjV472qsp9omx8w/XozVqfb4YgMBBE3Qwuy
ZS7GcaGxf54WJbOJFQv7pS3DUmfOjQBnU7YAR7phMBDeOUfxRXWkuzIB+RkED+0d
7JjTGWQItdbTfXl2QWfSr/1rj45LNKmVPMAI2pKUBnC9dD9pktCcB90vTcrg2vjE
3U9GbqOVORZvHh2bcR0Y8uDUBOp4v7ybCc+OOTNVWuWQpckp79E/7dkVLbpoI12R
xEgMEaCSXlwPUxQ/FrxKBWhOZprPXNMo4xw547SDBxMq0IKUniA+YoUxtN3FxhQ8
OiLkfcQ+5bove9qYfARaRC/hk5lpgMWebAGou/CkfTIPJsw3ZdlmL6HH6ag5i6fS
fBHsVBdtv0ADSLXylg==
=nZML
-----END PGP SIGNATURE-----
Merge tag 'upstream-4.20-rc7' of git://git.infradead.org/linux-ubifs
Pull UBI/UBIFS fixes from Richard Weinberger:
- Kconfig dependency fixes for our new auth feature
- Fix for selecting the right compressor when creating a fs
- Bugfix for a bug in UBIFS's O_TMPFILE implementation
- Refcounting fixes for UBI
* tag 'upstream-4.20-rc7' of git://git.infradead.org/linux-ubifs:
ubifs: Handle re-linking of inodes correctly while recovery
ubi: Do not drop UBI device reference before using
ubi: Put MTD device after it is not used
ubifs: Fix default compression selection in ubifs
ubifs: Fix memory leak on error condition
ubifs: auth: Add CONFIG_KEYS dependency
ubifs: CONFIG_UBIFS_FS_AUTHENTICATION should depend on UBIFS_FS
ubifs: replay: Fix high stack usage
Separate just the changing of mount flags (MS_REMOUNT|MS_BIND) from full
remount because the mount data will get parsed with the new fs_context
stuff prior to doing a remount - and this causes the syscall to fail under
some circumstances.
To quote Eric's explanation:
[...] mount(..., MS_REMOUNT|MS_BIND, ...) now validates the mount options
string, which breaks systemd unit files with ProtectControlGroups=yes
(e.g. systemd-networkd.service) when systemd does the following to
change a cgroup (v1) mount to read-only:
mount(NULL, "/run/systemd/unit-root/sys/fs/cgroup/systemd", NULL,
MS_RDONLY|MS_NOSUID|MS_NODEV|MS_NOEXEC|MS_REMOUNT|MS_BIND, NULL)
... when the kernel has CONFIG_CGROUPS=y but no cgroup subsystems
enabled, since in that case the error "cgroup1: Need name or subsystem
set" is hit when the mount options string is empty.
Probably it doesn't make sense to validate the mount options string at
all in the MS_REMOUNT|MS_BIND case, though maybe you had something else
in mind.
This is also worthwhile doing because we will need to add a mount_setattr()
syscall to take over the remount-bind function.
Reported-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Reviewed-by: David Howells <dhowells@redhat.com>
Only the mount namespace code that implements mount(2) should be using the
MS_* flags. Suppress them inside the kernel unless uapi/linux/mount.h is
included.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Reviewed-by: David Howells <dhowells@redhat.com>
This reverts commit 61c6de6672.
The reverted commit added page reference counting to iomap page
structures that are used to track block size < page size state. This
was supposed to align the code with page migration page accounting
assumptions, but what it has done instead is break XFS filesystems.
Every fstests run I've done on sub-page block size XFS filesystems
has since picking up this commit 2 days ago has failed with bad page
state errors such as:
# ./run_check.sh "-m rmapbt=1,reflink=1 -i sparse=1 -b size=1k" "generic/038"
....
SECTION -- xfs
FSTYP -- xfs (debug)
PLATFORM -- Linux/x86_64 test1 4.20.0-rc6-dgc+
MKFS_OPTIONS -- -f -m rmapbt=1,reflink=1 -i sparse=1 -b size=1k /dev/sdc
MOUNT_OPTIONS -- /dev/sdc /mnt/scratch
generic/038 454s ...
run fstests generic/038 at 2018-12-20 18:43:05
XFS (sdc): Unmounting Filesystem
XFS (sdc): Mounting V5 Filesystem
XFS (sdc): Ending clean mount
BUG: Bad page state in process kswapd0 pfn:3a7fa
page:ffffea0000ccbeb0 count:0 mapcount:0 mapping:ffff88800d9b6360 index:0x1
flags: 0xfffffc0000000()
raw: 000fffffc0000000 dead000000000100 dead000000000200 ffff88800d9b6360
raw: 0000000000000001 0000000000000000 00000000ffffffff
page dumped because: non-NULL mapping
CPU: 0 PID: 676 Comm: kswapd0 Not tainted 4.20.0-rc6-dgc+ #915
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.1-1 04/01/2014
Call Trace:
dump_stack+0x67/0x90
bad_page.cold.116+0x8a/0xbd
free_pcppages_bulk+0x4bf/0x6a0
free_unref_page_list+0x10f/0x1f0
shrink_page_list+0x49d/0xf50
shrink_inactive_list+0x19d/0x3b0
shrink_node_memcg.constprop.77+0x398/0x690
? shrink_slab.constprop.81+0x278/0x3f0
shrink_node+0x7a/0x2f0
kswapd+0x34b/0x6d0
? node_reclaim+0x240/0x240
kthread+0x11f/0x140
? __kthread_bind_mask+0x60/0x60
ret_from_fork+0x24/0x30
Disabling lock debugging due to kernel taint
....
The failures are from anyway that frees pages and empties the
per-cpu page magazines, so it's not a predictable failure or an easy
to debug failure.
generic/038 is a reliable reproducer of this problem - it has a 9 in
10 failure rate on one of my test machines. Failure on other
machines have been at random points in fstests runs but every run
has ended up tripping this problem. Hence generic/038 was used to
bisect the failure because it was the most reliable failure.
It is too close to the 4.20 release (not to mention holidays) to
try to diagnose, fix and test the underlying cause of the problem,
so reverting the commit is the only option we have right now. The
revert has been tested against a current tot 4.20-rc7+ kernel across
multiple machines running sub-page block size XFs filesystems and
none of the bad page state failures have been seen.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Cc: Piotr Jaroszynski <pjaroszynski@nvidia.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: William Kucharski <william.kucharski@oracle.com>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Brian Foster <bfoster@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use __print_symbolic to print the scrub type in ftrace output.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Use __print_symbolic to print the btree type in ftrace output.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Move XFS_INODE_FORMAT_STR to libxfs so that we don't forget to keep it
updated, and add necessary TRACE_DEFINE_ENUM.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Move XFS_AG_BTREE_CMP_FORMAT_STR to libxfs so that we don't forget to
keep it updated, and TRACE_DEFINE_ENUM the values while we're at it.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
ftrace's __print_symbolic() has a (very poorly documented) requirement
that any enum values used in the symbol to string translation table be
wrapped in a TRACE_DEFINE_ENUM so that the enum value can be encoded in
the ftrace ring buffer. Fix this unsatisfied requirement.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Use %pS instead of %pF in ftrace strings so that we record the actual
function address instead of the function descriptor.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
If the file system has been shut down or is read-only, then
ext4_write_inode() needs to bail out early.
Also use jbd2_complete_transaction() instead of ext4_force_commit() so
we only force a commit if it is needed.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Some time back, nfsd switched from calling vfs_fsync() to using a new
commit_metadata() hook in export_operations(). If the file system did
not provide a commit_metadata() hook, it fell back to using
sync_inode_metadata(). Unfortunately doesn't work on all file
systems. In particular, it doesn't work on ext4 due to how the inode
gets journalled --- the VFS writeback code will not always call
ext4_write_inode().
So we need to provide our own ext4_nfs_commit_metdata() method which
calls ext4_write_inode() directly.
Google-Bug-Id: 121195940
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
SUNRPC has two sorts of credentials, both of which appear as
"struct rpc_cred".
There are "generic credentials" which are supplied by clients
such as NFS and passed in 'struct rpc_message' to indicate
which user should be used to authorize the request, and there
are low-level credentials such as AUTH_NULL, AUTH_UNIX, AUTH_GSS
which describe the credential to be sent over the wires.
This patch replaces all the generic credentials by 'struct cred'
pointers - the credential structure used throughout Linux.
For machine credentials, there is a special 'struct cred *' pointer
which is statically allocated and recognized where needed as
having a special meaning. A look-up of a low-level cred will
map this to a machine credential.
Signed-off-by: NeilBrown <neilb@suse.com>
Acked-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Use the common 'struct cred' to pass credentials for readdir.
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Rather than keying the access cache with 'struct rpc_cred',
use 'struct cred'. Then use cred_fscmp() to compare
credentials rather than comparing the raw pointer.
A benefit of this approach is that in the common case we avoid the
rpc_lookup_cred_nonblock() call which can be slow when the cred cache is large.
This also keeps many fewer items pinned in the rpc cred cache, so the
cred cache is less likely to get large.
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
NFS needs to know when a credential is about to expire so that
it can modify write-back behaviour to finish the write inside the
expiry time.
It currently uses functions in SUNRPC code which make use of a
fairly complex callback scheme and flags in the generic credientials.
As I am working to discard the generic credentials, this has to change.
This patch moves the logic into NFS, in part by finding and caching
the low-level credential in the open_context. We then make direct
cred-api calls on that.
This makes the code much simpler and removes a dependency on generic
rpc credentials.
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
When NFS creates a machine credential, it is a "generic" credential,
not tied to any auth protocol, and is really just a container for
the princpal name.
This doesn't get linked to a genuine credential until rpcauth_bindcred()
is called.
The lookup always succeeds, so various places that test if the machine
credential is NULL, are pointless.
As a step towards getting rid of generic credentials, this patch gets
rid of generic machine credentials. The nfs_client and rpc_client
just hold a pointer to a constant principal name.
When a machine credential is wanted, a special static 'struct rpc_cred'
pointer is used. rpcauth_bindcred() recognizes this, finds the
principal from the client, and binds the correct credential.
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
This lock is no longer necessary.
If nfs4_get_renew_cred() needs to hunt through the open-state
creds for a user cred, it still takes the lock to stablize
the rbtree, but otherwise there are no races.
Note that this completely removes the lock from nfs4_renew_state().
It appears that the original need for the locking here was removed
long ago, and there is no longer anything to protect.
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
NFSv4 state management tries a root credential when no machine
credential is available, as can happen with kerberos.
It does this by replacing the cl_machine_cred with a root credential.
This means that any user of the machine credential needs to take
a lock while getting a reference to the machine credential, which is
a little cumbersome.
So introduce an explicit cl_root_cred, and never free either
credential until client shutdown. This means that no locking
is needed to reference these credentials. Future patches
will make use of this.
This is only a temporary addition. both cl_machine_cred and
cl_root_cred will disappear later in the series.
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
We can use cred->groupinfo (from the 'struct cred') instead.
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
The SUNRPC credential framework was put together before
Linux has 'struct cred'. Now that we have it, it makes sense to
use it.
This first step just includes a suitable 'struct cred *' pointer
in every 'struct auth_cred' and almost every 'struct rpc_cred'.
The rpc_cred used for auth_null has a NULL 'struct cred *' as nothing
else really makes sense.
For rpc_cred, the pointer is reference counted.
For auth_cred it isn't. struct auth_cred are either allocated on
the stack, in which case the thread owns a reference to the auth,
or are part of 'struct generic_cred' in which case gc_base owns the
reference, and "acred" shares it.
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Please see comment to filelayout_pg_test for reference.
To: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Anna Schumaker <anna.schumaker@netapp.com>
Cc: linux-nfs@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
commit e8f25e6d6d "NFS: Remove the NFS v4 xdev mount function"
removed the last use of this.
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
If we receive a file handle, either from NFS or open_by_handle_at(2),
and it points at an inode which has not been initialized, and the file
system has metadata checksums enabled, we shouldn't try to get the
inode, discover the checksum is invalid, and then declare the file
system as being inconsistent.
This can be reproduced by creating a test file system via "mke2fs -t
ext4 -O metadata_csum /tmp/foo.img 8M", mounting it, cd'ing into that
directory, and then running the following program.
#define _GNU_SOURCE
#include <fcntl.h>
struct handle {
struct file_handle fh;
unsigned char fid[MAX_HANDLE_SZ];
};
int main(int argc, char **argv)
{
struct handle h = {{8, 1 }, { 12, }};
open_by_handle_at(AT_FDCWD, &h.fh, O_RDONLY);
return 0;
}
Google-Bug-Id: 120690101
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
In ext4_expand_extra_isize_ea(), we calculate the total size of the
xattr header, plus the xattr entries so we know how much of the
beginning part of the xattrs to move when expanding the inode extra
size. We need to include the terminating u32 at the end of the xattr
entries, or else if there is uninitialized, non-zero bytes after the
xattr entries and before the xattr values, the list of xattr entries
won't be properly terminated.
Reported-by: Steve Graham <stgraham2000@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Some servers require that the setinfo matches the exact size,
and in this case compounding changes introduced by
commit c2e0fe3f5a ("cifs: make rmdir() use compounding")
caused us to send 8 bytes (padded length) instead of 1 byte
(the size of the structure). See MS-FSCC section 2.4.11.
Fixing this when we send a SET_INFO command for delete file
disposition, then ends up as an iov of a single byte but this
causes problems with SMB3 and encryption.
To avoid this, instead of creating a one byte iov for the disposition value
and then appending an additional iov with a 7 byte padding we now handle
this as a single 8 byte iov containing both the disposition byte as well as
the padding in one single buffer.
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Acked-by: Paulo Alcantara <palcantara@suse.de>
44d8047f1d ("binder: use standard functions to allocate fds")
exposed a pre-existing issue in the binder driver.
fdget() is used in ksys_ioctl() as a performance optimization.
One of the rules associated with fdget() is that ksys_close() must
not be called between the fdget() and the fdput(). There is a case
where this requirement is not met in the binder driver which results
in the reference count dropping to 0 when the device is still in
use. This can result in use-after-free or other issues.
If userpace has passed a file-descriptor for the binder driver using
a BINDER_TYPE_FDA object, then kys_close() is called on it when
handling a binder_ioctl(BC_FREE_BUFFER) command. This violates
the assumptions for using fdget().
The problem is fixed by deferring the close using task_work_add(). A
new variant of __close_fd() was created that returns a struct file
with a reference. The fput() is deferred instead of using ksys_close().
Fixes: 44d8047f1d ("binder: use standard functions to allocate fds")
Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Todd Kjos <tkjos@google.com>
Cc: stable <stable@vger.kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
- Parse the 4BAIT SFDP section
- Add a bunch of SPI NOR entries to the flash_info table
- Add the concept of SFDP fixups and use it to fix a bug on MX25L25635F
- A bunch of minor cleanups/comestic changes
-----BEGIN PGP SIGNATURE-----
iQJQBAABCgA6FiEEKmCqpbOU668PNA69Ze02AX4ItwAFAlwVQUYcHGJvcmlzLmJy
ZXppbGxvbkBib290bGluLmNvbQAKCRBl7TYBfgi3ACIYD/sG0+vRIKK8+NgNUHYy
nzKICKvdnBrm2RWi+6va5n2pYggyNy1VhWEjmqWLupjxn7NGkjZiBilhfj8Iv6YN
HScNy7FLHM6pxTKpsZKQLLGKvaUXODgvZwiw3L6T5T3JaJ5nlpE5g8jQy8sCzfjK
pwKdrOw17caZgoY0bMe2ppCObIDLd+mY+WSHbo6tb4/fohpTX1l9QZYHjfgHU9vP
CG0z3sU0JCNGXsbQMngfeuyXFjJ4OKdnklbVTeZl673AYtQMBhQEIGNVkVefuBP3
p8hU0CWRn0Yikc1HGTENvYCnQ7ju3z+16rnLxy3A5CPHhCDrTgUmM8HabYbh+0Si
0Y1wXpEOZ0OZQ7uMs2Q8SK0GLyxqdxkE0ibHgb7K/aLb+yg8oB7DB4Uenb06CiaQ
KAZWGgWZlSondX+/GI7YcQozslFFCfixAw6H0kCpQW0/2piXNqsN5BOROEqjueXW
xeMG0DNnzrQ6/vB9ukLESSB/YvVwfUvt6GSjqMSDdXwx6zyKSvHJ+chCxlK46+Hm
zIVcvNT3mpwcuVnQOZZeCaiIrDUAySEq/8Ztp9O5/CfkzdQyMWxDPoY9A0HjL2p3
FmRN7aAB9jJcdHc2tLwcKPRepjliIUMLf0NXdTSizzQz8WJqULZRBN0VWW4sCbLc
+tTisYjX8fYUz9+kHUcQ4XY16g==
=JXhP
-----END PGP SIGNATURE-----
Merge tag 'spi-nor/for-4.21' of git://git.infradead.org/linux-mtd into mtd/next
Core changes:
- Parse the 4BAIT SFDP section
- Add a bunch of SPI NOR entries to the flash_info table
- Add the concept of SFDP fixups and use it to fix a bug on MX25L25635F
- A bunch of minor cleanups/comestic changes
NAND core changes:
- kernel-doc miscellaneous fixes.
- Third batch of fixes/cleanup to the raw NAND core impacting various
controller drivers (ams-delta, marvell, fsmc, denali, tegra, vf610):
* Stopping to pass mtd_info objects to internal functions
* Reorganizing code to avoid forward declarations
* Dropping useless test in nand_legacy_set_defaults()
* Moving nand_exec_op() to internal.h
* Adding nand_[de]select_target() helpers
* Passing the CS line to be selected in struct nand_operation
* Making ->select_chip() optional when ->exec_op() is implemented
* Deprecating the ->select_chip() hook
* Moving the ->exec_op() method to nand_controller_ops
* Moving ->setup_data_interface() to nand_controller_ops
* Deprecating the dummy_controller field
* Fixing JEDEC detection
* Providing a helper for polling GPIO R/B pin
Raw NAND chip drivers changes:
- Macronix:
* Flagging 1.8V AC chips with a broken GET_FEATURES(TIMINGS)
Raw NAND controllers drivers changes:
- Ams-delta:
* Fixing the error path
* SPDX tag added
* May be compiled with COMPILE_TEST=y
* Conversion to ->exec_op() interface
* Dropping .IOADDR_R/W use
* Use GPIO API for data I/O
- Denali:
* Removing denali_reset_banks()
* Removing ->dev_ready() hook
* Including <linux/bits.h> instead of <linux/bitops.h>
* Changes to comply with the above fixes/cleanup done in the core.
- FSMC:
* Adding an SPDX tag to replace the license text
* Making conversion from chip to fsmc consistent
* Fixing unchecked return value in fsmc_read_page_hwecc
* Changes to comply with the above fixes/cleanup done in the core.
- Marvell:
* Preventing timeouts on a loaded machine (fix)
* Changes to comply with the above fixes/cleanup done in the core.
- OMAP2:
* Pass the parent of pdev to dma_request_chan() (fix)
- R852:
* Use generic DMA API
- sh_flctl:
* Converting to SPDX identifiers
- Sunxi:
* Write pageprog related opcodes to the right register: WCMD_SET (fix)
- Tegra:
* Stop implementing ->select_chip()
- VF610:
* Adding an SPDX tag to replace the license text
* Changes to comply with the above fixes/cleanup done in the core.
- Various trivial/spelling/coding style fixes.
SPI-NAND drivers changes:
- Removing the depreacated mt29f_spinand driver from staging.
- Adding support for:
* Toshiba TC58CVG2S0H
* GigaDevice GD5FxGQ4xA
* Winbond W25N01GV
Several ioctl structs change size between native 32-bit (ia32) and x32
applications, because x32 follows the native 64-bit (amd64) integer
alignment rules and uses 64-bit time_t. In these instances, the ioctl
number changes so userspace simply gets -ENOTTY. This scenario can be
handled by simply adding more cases.
Looking at the different ioctls implemented here:
- All the ones marked 'No size or alignment issue on any arch' should
presumably all be fine.
- All the ones under BROKEN_X86_ALIGNMENT are different under integer
alignment rules. Since x32 matches amd64 here, we just need both
sets of cases handled.
- XFS_IOC_SWAPEXT has both integer alignment differences and time_t
differences. Since x32 matches amd64 here, we need to add a case
which calls the native implementation.
- The remaining ioctls have neither 64-bit integers nor time_t, so
x32 matches ia32 here and no change is required at this level. The
bulkstat ioctl implementations have some pointer chasing which is
handled separately.
Signed-off-by: Nick Bowler <nbowler@draconx.ca>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The bulkstat family of ioctls are problematic on x32, because there is
a mixup of native 32-bit and 64-bit conventions. The xfs_fsop_bulkreq
struct contains pointers and 32-bit integers so that matches the native
32-bit layout, and that means the ioctl implementation goes into the
regular compat path on x32.
However, the 'ubuffer' member of that struct in turn refers to either
struct xfs_inogrp or xfs_bstat (or an array of these). On x32, those
structures match the native 64-bit layout. The compat implementation
writes out the 32-bit version of these structures. This is not the
expected format for x32 userspace, causing problems.
Fortunately the functions which actually output these xfs_inogrp and
xfs_bstat structures have an easy way to select which output format
is required, so we just need a little tweak to select the right format
on x32.
Signed-off-by: Nick Bowler <nbowler@draconx.ca>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
While inspecting the ioctl implementations, I noticed that the compat
implementation of XFS_IOC_ATTRLIST_BY_HANDLE does not do exactly the
same thing as the native implementation. Specifically, the "cursor"
does not appear to be written out to userspace on the compat path,
like it is on the native path.
This adjusts the compat implementation to copy out the cursor just
like the native implementation does. The attrlist cursor does not
require any special compat handling. This fixes xfstests xfs/269
on both IA-32 and x32 userspace, when running on an amd64 kernel.
Signed-off-by: Nick Bowler <nbowler@draconx.ca>
Fixes: 0facef7fb0 ("xfs: in _attrlist_by_handle, copy the cursor back to userspace")
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Before this patch, function do_grow would not reserve enough journal
blocks in the transaction to unstuff jdata files while growing them.
This patch adds the logic to add one more block if the file to grow
is jdata.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com>
In preparation of handing in iocbs in a different fashion as well. Also
make it clear that the iocb being passed in isn't modified, by marking
it const throughout.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Replace the percpu_ref_put() + kmem_cache_free() with a call to
iocb_put() instead.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Plugging is meant to optimize submission of a string of IOs, if we don't
have more than 2 being submitted, don't bother setting up a plug.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
It's 192 bytes, fairly substantial. Most items don't need to be cleared,
especially not upfront. Clear the ones we do need to clear, and leave
the other ones for setup when the iocb is prepared and submitted.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This is in preparation for certain types of IO not needing a ring
reserveration.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We know this is a read/write request, but in preparation for
having different kinds of those, ensure that we call the assigned
handler instead of assuming it's aio_complete_rq().
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
* for-4.21/block: (351 commits)
blk-mq: enable IO poll if .nr_queues of type poll > 0
blk-mq: change blk_mq_queue_busy() to blk_mq_queue_inflight()
blk-mq: skip zero-queue maps in blk_mq_map_swqueue
block: fix blk-iolatency accounting underflow
blk-mq: fix dispatch from sw queue
block: mq-deadline: Fix write completion handling
nvme-pci: don't share queue maps
blk-mq: only dispatch to non-defauly queue maps if they have queues
blk-mq: export hctx->type in debugfs instead of sysfs
blk-mq: fix allocation for queue mapping table
blk-wbt: export internal state via debugfs
blk-mq-debugfs: support rq_qos
block: update sysfs documentation
block: loop: check error using IS_ERR instead of IS_ERR_OR_NULL in loop_add()
aoe: add __exit annotation
block: clear REQ_HIPRI if polling is not supported
blk-mq: replace and kill blk_mq_request_issue_directly
blk-mq: issue directly with bypass 'false' in blk_mq_sched_insert_requests
blk-mq: refactor the code of issue request directly
block: remove the bio_integrity_advance export
...
current_time is the last remaining caller of current_kernel_time64(),
which is a wrapper around ktime_get_coarse_real_ts64(). This calls the
latter directly for consistency with the rest of the kernel that is moving
to the ktime_get_ family of time accessors, as now documented in
Documentation/core-api/timekeeping.rst.
An open questions is whether we may want to actually call the more
accurate ktime_get_real_ts64() for file systems that save high-resolution
timestamps in their on-disk format. This would add a small overhead to
each update of the inode stamps but lead to inode timestamps to actually
have a usable resolution better than one jiffy (1 to 10 milliseconds
normally). Experiments on a variety of hardware platforms show a typical
time of around 100 CPU cycles to read the cycle counter and calculate the
accurate time from that. On old platforms without a cycle counter, this
can be signiciantly higher, up to several microseconds to access a
hardware clock, but those have become very rare by now.
I traced the original addition of the current_kernel_time() call to set
the nanosecond fields back to linux-2.5.48, where Andi Kleen added a patch
with subject "nanosecond stat timefields". Andi explains that the
motivation was to introduce as little overhead as possible back then. At
this time, reading the clock hardware was also more expensive when most
architectures did not have a cycle counter.
One side effect of having more accurate inode timestamp would be having to
write out the inode every time that mtime/ctime/atime get touched on most
systems, whereas many file systems today only write it when the timestamps
have changed, i.e. at most once per jiffy unless something else changes
as well. That change would certainly be noticed in some workloads, which
is enough reason to not do it without a good reason, regardless of the
cost of reading the time.
One thing we could still consider however would be to round the timestamps
from current_time() to multiples of NSEC_PER_JIFFY, e.g. full
milliseconds rather than having six or seven meaningless but confusing
digits at the end of the timestamp.
Link: http://lkml.kernel.org/r/20180726130820.4174359-1-arnd@arndb.de
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
... and don't abuse mount_nodev(), while we are at it.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Reviewed-by: David Howells <dhowells@redhat.com>
The typos accumulate over time so once in a while time they get fixed in
a large patch.
Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In the error handling block, err holds the return value of either
btrfs_del_root_ref() or btrfs_del_inode_ref() but it hasn't been checked
since it's introduction with commit fe66a05a06 (Btrfs: improve error
handling for btrfs_insert_dir_item callers) in 2012.
If the error handling in the error handling fails, there's not much left
to do and the abort either happened earlier in the callees or is
necessary here.
So if one of btrfs_del_root_ref() or btrfs_del_inode_ref() failed, abort
the transaction, but still return the original code of the failure
stored in 'ret' as this will be reported to the user.
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since cloning and deduplication are no longer Btrfs specific operations, we
now have generic code to handle parameter validation, compare file ranges
used for deduplication, clear capabilities when cloning, etc. This change
makes Btrfs use it, eliminating a lot of code in Btrfs and also fixing a
few bugs, such as:
1) When cloning, the destination file's capabilities were not dropped
(the fstest generic/513 tests this);
2) We were not checking if the destination file is immutable;
3) Not checking if either the source or destination files are swap
files (swap file support is coming soon for Btrfs);
4) System limits were not checked (resource limits and O_LARGEFILE).
Note that the generic helper generic_remap_file_range_prep() does start
and waits for writeback by calling filemap_write_and_wait_range(), however
that is not enough for Btrfs for two reasons:
1) With compression, we need to start writeback twice in order to get the
pages marked for writeback and ordered extents created;
2) filemap_write_and_wait_range() (and all its other variants) only waits
for the IO to complete, but we need to wait for the ordered extents to
finish, so that when we do the actual reflinking operations the file
extent items are in the fs tree. This is also important due to the fact
that the generic helper, for the deduplication case, compares the
contents of the pages in the requested range, which might require
reading extents from disk in the very unlikely case that pages get
invalidated after writeback finishes (so the file extent items must be
up to date in the fs tree).
Since these reasons are specific to Btrfs we have to do it in the Btrfs
code before calling generic_remap_file_range_prep(). This also results
in a simpler way of dealing with existing delalloc in the source/target
ranges, specially for the deduplication case where we used to lock all
the pages first and then if we found any dealloc for the range, or
ordered extent, we would unlock the pages trigger writeback and wait for
ordered extents to complete, then lock all the pages again and check if
deduplication can be done. So now we get a simpler approach: lock the
inodes, then trigger writeback and then wait for ordered extents to
complete.
So make btrfs use generic_remap_file_range_prep() (XFS and OCFS2 use it)
to eliminate duplicated code, fix a few bugs and benefit from future bug
fixes done there - for example the recent clone and dedupe bugs involving
reflinking a partial EOF block got a counterpart fix in the generic
helper, since it affected all filesystems supporting these operations,
so we no longer need special checks in Btrfs for them.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
extent_readpages processes all pages in the readlist in batches of 16,
this is implemented by a single for loop but thanks to an if condition
the loop does 2 things based on whether we've filled the batch or not.
Additionally due to the structure of the code there is an additional
check which deals with partial batches.
Streamline all of this by explicitly using two loops. The outter one is
used to process all pages while the inner one just fills in the batch
of 16 (currently). Due to this new structure the code guarantees that
all pages are processed in the loop hence the code to deal with any
leftovers is eliminated.
This also enable the compiler to inline __extent_readpages:
./scripts/bloat-o-meter fs/btrfs/extent_io.o extent_io.for
add/remove: 0/1 grow/shrink: 1/0 up/down: 660/-820 (-160)
Function old new delta
extent_readpages 476 1136 +660
__extent_readpages 820 - -820
Total: Before=44315, After=44155, chg -0.36%
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The first step of the rebalance process ensures there is 1MiB free on
each device. This number seems rather small. And in fact when talking to
the original authors their opinions were:
"man that's a little bonkers"
"i don't think we even need that code anymore"
"I think it was there to make sure we had room for the blank 1M at the
beginning. I bet it goes all the way back to v0"
"we just don't need any of that tho, i say we just delete it"
Clearly, this piece of code has lost its original intent throughout the
years. It doesn't really bring any real practical benefits to the
relocation process.
Additionally, this patch makes the balance process more lightweight by
removing a pair of shrink/grow operations which are rather expensive for
heavily populated filesystems. This is mainly due to shrink requiring
relocating block groups, involving heavy use of the btree.
The intermediate shrink/grow can fail and leave the filesystem in a
middle state that would need to be changed back by the user.
Suggested-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
If we create a snapshot of a snapshot currently being used by a send
operation, we can end up with send failing unexpectedly (returning
-ENOENT error to user space for example). The following diagram shows
how this happens.
CPU 1 CPU2 CPU3
btrfs_ioctl_send()
(...)
create_snapshot()
-> creates snapshot of a
root used by the send
task
btrfs_commit_transaction()
create_pending_snapshot()
__get_inode_info()
btrfs_search_slot()
btrfs_search_slot_get_root()
down_read commit_root_sem
get reference on eb of the
commit root
-> eb with bytenr == X
up_read commit_root_sem
btrfs_cow_block(root node)
btrfs_free_tree_block()
-> creates delayed ref to
free the extent
btrfs_run_delayed_refs()
-> runs the delayed ref,
adds extent to
fs_info->pinned_extents
btrfs_finish_extent_commit()
unpin_extent_range()
-> marks extent as free
in the free space cache
transaction commit finishes
btrfs_start_transaction()
(...)
btrfs_cow_block()
btrfs_alloc_tree_block()
btrfs_reserve_extent()
-> allocates extent at
bytenr == X
btrfs_init_new_buffer(bytenr X)
btrfs_find_create_tree_block()
alloc_extent_buffer(bytenr X)
find_extent_buffer(bytenr X)
-> returns existing eb,
which the send task got
(...)
-> modifies content of the
eb with bytenr == X
-> uses an eb that now
belongs to some other
tree and no more matches
the commit root of the
snapshot, resuts will be
unpredictable
The consequences of this race can be various, and can lead to searches in
the commit root performed by the send task failing unexpectedly (unable to
find inode items, returning -ENOENT to user space, for example) or not
failing because an inode item with the same number was added to the tree
that reused the metadata extent, in which case send can behave incorrectly
in the worst case or just fail later for some reason.
Fix this by performing a copy of the commit root's extent buffer when doing
a search in the context of a send operation.
CC: stable@vger.kernel.org # 4.4.x: 1fc28d8e2e: Btrfs: move get root out of btrfs_search_slot to a helper
CC: stable@vger.kernel.org # 4.4.x: f9ddfd0592: Btrfs: remove unused check of skip_locking
CC: stable@vger.kernel.org # 4.4.x
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When initializing the security xattrs, we are holding a transaction handle
therefore we need to use a GFP_NOFS context in order to avoid a deadlock
with reclaim in case it's triggered.
Fixes: 39a27ec100 ("btrfs: use GFP_KERNEL for xattr and acl allocations")
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
With my delayed refs patches in place we started seeing a large amount
of aborts in __btrfs_free_extent:
BTRFS error (device sdb1): unable to find ref byte nr 91947008 parent 0 root 35964 owner 1 offset 0
Call Trace:
? btrfs_merge_delayed_refs+0xaf/0x340
__btrfs_run_delayed_refs+0x6ea/0xfc0
? btrfs_set_path_blocking+0x31/0x60
btrfs_run_delayed_refs+0xeb/0x180
btrfs_commit_transaction+0x179/0x7f0
? btrfs_check_space_for_delayed_refs+0x30/0x50
? should_end_transaction.isra.19+0xe/0x40
btrfs_drop_snapshot+0x41c/0x7c0
btrfs_clean_one_deleted_snapshot+0xb5/0xd0
cleaner_kthread+0xf6/0x120
kthread+0xf8/0x130
? btree_invalidatepage+0x90/0x90
? kthread_bind+0x10/0x10
ret_from_fork+0x35/0x40
This was because btrfs_drop_snapshot depends on the root not being
modified while it's dropping the snapshot. It will unlock the root node
(and really every node) as it walks down the tree, only to re-lock it
when it needs to do something. This is a problem because if we modify
the tree we could cow a block in our path, which frees our reference to
that block. Then once we get back to that shared block we'll free our
reference to it again, and get ENOENT when trying to lookup our extent
reference to that block in __btrfs_free_extent.
This is ultimately happening because we have delayed items left to be
processed for our deleted snapshot _after_ all of the inodes are closed
for the snapshot. We only run the delayed inode item if we're deleting
the inode, and even then we do not run the delayed insertions or delayed
removals. These can be run at any point after our final inode does its
last iput, which is what triggers the snapshot deletion. We can end up
with the snapshot deletion happening and then have the delayed items run
on that file system, resulting in the above problem.
This problem has existed forever, however my patches made it much easier
to hit as I wake up the cleaner much more often to deal with delayed
iputs, which made us more likely to start the snapshot dropping work
before the transaction commits, which is when the delayed items would
generally be run. Before, generally speaking, we would run the delayed
items, commit the transaction, and wakeup the cleaner thread to start
deleting snapshots, which means we were less likely to hit this problem.
You could still hit it if you had multiple snapshots to be deleted and
ended up with lots of delayed items, but it was definitely harder.
Fix for now by simply running all the delayed items before starting to
drop the snapshot. We could make this smarter in the future by making
the delayed items per-root, and then simply drop any delayed items for
roots that we are going to delete. But for now just a quick and easy
solution is the safest.
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When debugging some weird extent reference bug I suspected that we were
changing a snapshot while we were deleting it, which could explain my
bug. This was indeed what was happening, and this patch helped me
verify my theory. It is never correct to modify the snapshot once it's
being deleted, so mark the root when we are deleting it and make sure we
complain about it when it happens.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
@blocksize variable in do_walk_down() is only used once, really no need
to declare it.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since scrub workers only do memory allocation with GFP_KERNEL when they
need to perform repair, we can move the recent setup of the nofs context
up to scrub_handle_errored_block() instead of setting it up down the call
chain at insert_full_stripe_lock() and scrub_add_page_to_wr_bio(),
removing some duplicate code and comment. So the only paths for which a
scrub worker can do memory allocations using GFP_KERNEL are the following:
scrub_bio_end_io_worker()
scrub_block_complete()
scrub_handle_errored_block()
lock_full_stripe()
insert_full_stripe_lock()
-> kmalloc with GFP_KERNEL
scrub_bio_end_io_worker()
scrub_block_complete()
scrub_handle_errored_block()
scrub_write_page_to_dev_replace()
scrub_add_page_to_wr_bio()
-> kzalloc with GFP_KERNEL
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The scrub context is allocated with GFP_KERNEL and called from
btrfs_scrub_dev under the fs_info::device_list_mutex. This is not safe
regarding reclaim that could try to flush filesystem data in order to
get the memory. And the device_list_mutex is held during superblock
commit, so this would cause a lockup.
Move the alocation and initialization before any changes that require
the mutex.
Signed-off-by: David Sterba <dsterba@suse.com>
We can pass fs_info directly as this is the only member of btrfs_device
that's bing used inside scrub_setup_ctx.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have a bunch of magic to make sure we're throttling delayed refs when
truncating a file. Now that we have a delayed refs rsv and a mechanism
for refilling that reserve simply use that instead of all of this magic.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Over the years we have built up a lot of infrastructure to keep delayed
refs in check, mostly by running them at btrfs_end_transaction() time.
We have a lot of different maths we do to figure out how much, if we
should do it inline or async, etc. This existed because we had no
feedback mechanism to force the flushing of delayed refs when they
became a problem. However with the enospc flushing infrastructure in
place for flushing delayed refs when they put too much pressure on the
enospc system we have this problem solved. Rip out all of this code as
it is no longer needed.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now with the delayed_refs_rsv we can now know exactly how much pending
delayed refs space we need. This means we can drastically simplify
btrfs_check_space_for_delayed_refs by simply checking how much space we
have reserved for the global rsv (which acts as a spill over buffer) and
the delayed refs rsv. If our total size is beyond that amount then we
know it's time to commit the transaction and stop any more delayed refs
from being generated.
With the introduction of dealyed_refs_rsv infrastructure, namely
btrfs_update_delayed_refs_rsv we now know exactly how much pending
delayed refs space is required.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
A nice thing we gain with the delayed refs rsv is the ability to flush
the delayed refs on demand to deal with enospc pressure. Add states to
flush delayed refs on demand, and this will allow us to remove a lot of
ad-hoc work around checking to see if we should commit the transaction
to run our delayed refs.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Any space used in the delayed_refs_rsv will be freed up by a transaction
commit, so instead of just counting the pinned space we also need to
account for any space in the delayed_refs_rsv when deciding if it will
make a different to commit the transaction to satisfy our space
reservation. If we have enough bytes to satisfy our reservation ticket
then we are good to go, otherwise subtract out what space we would gain
back by committing the transaction and compare that against the pinned
space to make our decision.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Traditionally we've had voodoo in btrfs to account for the space that
delayed refs may take up by having a global_block_rsv. This works most
of the time, except when it doesn't. We've had issues reported and seen
in production where sometimes the global reserve is exhausted during
transaction commit before we can run all of our delayed refs, resulting
in an aborted transaction. Because of this voodoo we have equally
dubious flushing semantics around throttling delayed refs which we often
get wrong.
So instead give them their own block_rsv. This way we can always know
exactly how much outstanding space we need for delayed refs. This
allows us to make sure we are constantly filling that reservation up
with space, and allows us to put more precise pressure on the enospc
system. Instead of doing math to see if its a good time to throttle,
the normal enospc code will be invoked if we have a lot of delayed refs
pending, and they will be run via the normal flushing mechanism.
For now the delayed_refs_rsv will hold the reservations for the delayed
refs, the block group updates, and deleting csums. We could have a
separate rsv for the block group updates, but the csum deletion stuff is
still handled via the delayed_refs so that will stay there.
Historical background:
The global reserve has grown to cover everything we don't reserve space
explicitly for, and we've grown a lot of weird ad-hoc heuristics to know
if we're running short on space and when it's time to force a commit. A
failure rate of 20-40 file systems when we run hundreds of thousands of
them isn't super high, but cleaning up this code will make things less
ugly and more predictible.
Thus the delayed refs rsv. We always know how many delayed refs we have
outstanding, and although running them generates more we can use the
global reserve for that spill over, which fits better into it's desired
use than a full blown reservation. This first approach is to simply
take how many times we're reserving space for and multiply that by 2 in
order to save enough space for the delayed refs that could be generated.
This is a niave approach and will probably evolve, but for now it works.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com> # high-level review
[ added background notes from the cover letter ]
Signed-off-by: David Sterba <dsterba@suse.com>
We use this number to figure out how many delayed refs to run, but
__btrfs_run_delayed_refs really only checks every time we need a new
delayed ref head, so we always run at least one ref head completely no
matter what the number of items on it. Fix the accounting to only be
adjusted when we add/remove a ref head.
In addition to using this number to limit the number of delayed refs
run, a future patch is also going to use it to calculate the amount of
space required for delayed refs space reservation.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The cleanup_extent_op function actually would run the extent_op if it
needed running, which made the name sort of a misnomer. Change it to
run_and_cleanup_extent_op, and move the actual cleanup work to
cleanup_extent_op so it can be used by check_ref_cleanup() in order to
unify the extent op handling.
Reviewed-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We were missing some quota cleanups in check_ref_cleanup, so break the
ref head accounting cleanup into a helper and call that from both
check_ref_cleanup and cleanup_ref_head. This will hopefully ensure that
we don't screw up accounting in the future for other things that we add.
Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We do this dance in cleanup_ref_head and check_ref_cleanup, unify it
into a helper and cleanup the calling functions.
Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When using a 'var & (PAGE_SIZE - 1)' construct one is checking for a page
alignment and thus should use the PAGE_ALIGNED() macro instead of
open-coding it.
Convert all open-coded occurrences of PAGE_ALIGNED().
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Constructs like 'var & (PAGE_SIZE - 1)' or 'var & ~PAGE_MASK' can denote an
offset into a page.
So replace them by the offset_in_page() macro instead of open-coding it if
they're not used as an alignment check.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The dev-replace locking functions are now trivial wrappers around rw
semaphore that can be used directly everywhere. No functional change.
Signed-off-by: David Sterba <dsterba@suse.com>
After the rw semaphore has been added, the custom blocking using
::blocking_readers and ::read_lock_wq is redundant.
The blocking logic in __btrfs_map_block is replaced by extending the
time the semaphore is held, that has the same blocking effect on writes
as the previous custom scheme that waited until ::blocking_readers was
zero.
Signed-off-by: David Sterba <dsterba@suse.com>
This is the first part of removing the custom locking and waiting scheme
used for device replace. It was probably copied from extent buffer
locking, but there's nothing that would require more than is provided by
the common locking primitives.
The rw spinlock protects waiting tasks counter in case of incompatible
locks and the waitqueue. Same as rw semaphore.
This patch only switches the locking primitive, for better
bisectability. There should be no functional change other than the
overhead of the locking and potential sleeping instead of spinning when
the lock is contended.
Signed-off-by: David Sterba <dsterba@suse.com>
The device-replace read lock is going to use rw semaphore in followup
commits. The semaphore might sleep which is not possible in the radix
tree preload section. The lock nesting is now:
* device replace
* radix tree preload
* readahead spinlock
Signed-off-by: David Sterba <dsterba@suse.com>
Running btrfs/124 in a loop hung up on me sporadically with the
following call trace:
btrfs D 0 5760 5324 0x00000000
Call Trace:
? __schedule+0x243/0x800
schedule+0x33/0x90
btrfs_start_ordered_extent+0x10c/0x1b0 [btrfs]
? wait_woken+0xa0/0xa0
btrfs_wait_ordered_range+0xbb/0x100 [btrfs]
btrfs_relocate_block_group+0x1ff/0x230 [btrfs]
btrfs_relocate_chunk+0x49/0x100 [btrfs]
btrfs_balance+0xbeb/0x1740 [btrfs]
btrfs_ioctl_balance+0x2ee/0x380 [btrfs]
btrfs_ioctl+0x1691/0x3110 [btrfs]
? lockdep_hardirqs_on+0xed/0x180
? __handle_mm_fault+0x8e7/0xfb0
? _raw_spin_unlock+0x24/0x30
? __handle_mm_fault+0x8e7/0xfb0
? do_vfs_ioctl+0xa5/0x6e0
? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
do_vfs_ioctl+0xa5/0x6e0
? entry_SYSCALL_64_after_hwframe+0x3e/0xbe
ksys_ioctl+0x3a/0x70
__x64_sys_ioctl+0x16/0x20
do_syscall_64+0x60/0x1b0
entry_SYSCALL_64_after_hwframe+0x49/0xbe
This happens because during page writeback it's valid for
writepage_delalloc to instantiate a delalloc range which doesn't belong
to the page currently being written back.
The reason this case is valid is due to find_lock_delalloc_range
returning any available range after the passed delalloc_start and
ignoring whether the page under writeback is within that range.
In turn ordered extents (OE) are always created for the returned range
from find_lock_delalloc_range. If, however, a failure occurs while OE
are being created then the clean up code in btrfs_cleanup_ordered_extents
will be called.
Unfortunately the code in btrfs_cleanup_ordered_extents doesn't consider
the case of such 'foreign' range being processed and instead it always
assumes that the range OE are created for belongs to the page. This
leads to the first page of such foregin range to not be cleaned up since
it's deliberately missed and skipped by the current cleaning up code.
Fix this by correctly checking whether the current page belongs to the
range being instantiated and if so adjsut the range parameters passed
for cleaning up. If it doesn't, then just clean the whole OE range
directly.
Fixes: 524272607e ("btrfs: Handle delalloc error correctly to avoid ordered extent hang")
CC: stable@vger.kernel.org # 4.14+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The @found is always false when it comes to the if branch. Besides, the
bool type is more suitable for @found. Change the return value of the
function and its caller to bool as well.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The test case btrfs/001 with inode_cache mount option will encounter the
following warning:
WARNING: CPU: 1 PID: 23700 at fs/btrfs/inode.c:956 cow_file_range.isra.19+0x32b/0x430 [btrfs]
CPU: 1 PID: 23700 Comm: btrfs Kdump: loaded Tainted: G W O 4.20.0-rc4-custom+ #30
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
RIP: 0010:cow_file_range.isra.19+0x32b/0x430 [btrfs]
Call Trace:
? free_extent_buffer+0x46/0x90 [btrfs]
run_delalloc_nocow+0x455/0x900 [btrfs]
btrfs_run_delalloc_range+0x1a7/0x360 [btrfs]
writepage_delalloc+0xf9/0x150 [btrfs]
__extent_writepage+0x125/0x3e0 [btrfs]
extent_write_cache_pages+0x1b6/0x3e0 [btrfs]
? __wake_up_common_lock+0x63/0xc0
extent_writepages+0x50/0x80 [btrfs]
do_writepages+0x41/0xd0
? __filemap_fdatawrite_range+0x9e/0xf0
__filemap_fdatawrite_range+0xbe/0xf0
btrfs_fdatawrite_range+0x1b/0x50 [btrfs]
__btrfs_write_out_cache+0x42c/0x480 [btrfs]
btrfs_write_out_ino_cache+0x84/0xd0 [btrfs]
btrfs_save_ino_cache+0x551/0x660 [btrfs]
commit_fs_roots+0xc5/0x190 [btrfs]
btrfs_commit_transaction+0x2bf/0x8d0 [btrfs]
btrfs_mksubvol+0x48d/0x4d0 [btrfs]
btrfs_ioctl_snap_create_transid+0x170/0x180 [btrfs]
btrfs_ioctl_snap_create_v2+0x124/0x180 [btrfs]
btrfs_ioctl+0x123f/0x3030 [btrfs]
The file extent generation of the free space inode is equal to the last
snapshot of the file root, so the inode will be passed to cow_file_rage.
But the inode was created and its extents were preallocated in
btrfs_save_ino_cache, there are no cow copies on disk.
The preallocated extent is not yet in the extent tree, and
btrfs_cross_ref_exist will ignore the -ENOENT returned by
check_committed_ref, so we can directly write the inode to the disk.
Fixes: 78d4295b1e ("btrfs: lift some btrfs_cross_ref_exist checks in nocow path")
CC: stable@vger.kernel.org # 4.18+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The log tree has a long standing problem that when a file is fsync'ed we
only check for new ancestors, created in the current transaction, by
following only the hard link for which the fsync was issued. We follow the
ancestors using the VFS' dget_parent() API. This means that if we create a
new link for a file in a directory that is new (or in an any other new
ancestor directory) and then fsync the file using an old hard link, we end
up not logging the new ancestor, and on log replay that new hard link and
ancestor do not exist. In some cases, involving renames, the file will not
exist at all.
Example:
mkfs.btrfs -f /dev/sdb
mount /dev/sdb /mnt
mkdir /mnt/A
touch /mnt/foo
ln /mnt/foo /mnt/A/bar
xfs_io -c fsync /mnt/foo
<power failure>
In this example after log replay only the hard link named 'foo' exists
and directory A does not exist, which is unexpected. In other major linux
filesystems, such as ext4, xfs and f2fs for example, both hard links exist
and so does directory A after mounting again the filesystem.
Checking if any new ancestors are new and need to be logged was added in
2009 by commit 12fcfd22fe ("Btrfs: tree logging unlink/rename fixes"),
however only for the ancestors of the hard link (dentry) for which the
fsync was issued, instead of checking for all ancestors for all of the
inode's hard links.
So fix this by tracking the id of the last transaction where a hard link
was created for an inode and then on fsync fallback to a full transaction
commit when an inode has more than one hard link and at least one new hard
link was created in the current transaction. This is the simplest solution
since this is not a common use case (adding frequently hard links for
which there's an ancestor created in the current transaction and then
fsync the file). In case it ever becomes a common use case, a solution
that consists of iterating the fs/subvol btree for each hard link and
check if any ancestor is new, could be implemented.
This solves many unexpected scenarios reported by Jayashree Mohan and
Vijay Chidambaram, and for which there is a new test case for fstests
under review.
Fixes: 12fcfd22fe ("Btrfs: tree logging unlink/rename fixes")
CC: stable@vger.kernel.org # 4.4+
Reported-by: Vijay Chidambaram <vvijay03@gmail.com>
Reported-by: Jayashree Mohan <jayashree2912@gmail.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The first auto-assigned value to enum is 0, we can use that and not
initialize all members where the auto-increment does the same. This is
used for values that are not part of on-disk format.
Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
We can use simple enum for values that are not part of on-disk format:
ordered extent flags.
Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
We can use simple enum for values that are not part of on-disk format:
extent map flags.
Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
We can use simple enum for values that are not part of on-disk format:
extent buffer flags;
Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
We can use simple enum for values that are not part of on-disk format:
root tree flags.
Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
We can use simple enum for values that are not part of on-disk format:
internal filesystem states.
Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We can use simple enum for values that are not part of on-disk format:
block reserve types.
Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
We can use simple enum for values that are not part of on-disk format:
global filesystem states.
Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
This function really checks whether adding more data to the bio will
straddle a stripe/chunk. So first let's give it a more appropraite name
- btrfs_bio_fits_in_stripe. Secondly, the offset parameter was never
used to just remove it. Thirdly, pages are submitted to either btree or
data inodes so it's guaranteed that tree->ops is set so replace the
check with an ASSERT. Finally, document the parameters of the function.
No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When it was introduced in commit f094ac32ab ("Btrfs: fix NULL pointer
after aborting a transaction"), it was not used.
Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
Document why map_private_extent_buffer() cannot return '1' (i.e. the map
spans two pages) for the csum_tree_block() case.
The current algorithm for detecting a page boundary crossing in
map_private_extent_buffer() will return a '1' *IFF* the extent buffer's
offset in the page + the offset passed in by csum_tree_block() and the
minimal length passed in by csum_tree_block() - 1 are bigger than
PAGE_SIZE.
We always pass BTRFS_CSUM_SIZE (32) as offset and a minimal length of 32
and the current extent buffer allocator always guarantees page aligned
extends, so the above condition can't be true.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
In map_private_extent_buffer() the 'offset' variable is initialized to a
page aligned version of the 'start' parameter.
But later on it is overwritten with either the offset from the extent
buffer's start or 0.
So get rid of the initial initialization.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
When a transaction commit starts, it attempts to pause scrub and it blocks
until the scrub is paused. So while the transaction is blocked waiting for
scrub to pause, we can not do memory allocation with GFP_KERNEL from scrub,
otherwise we risk getting into a deadlock with reclaim.
Checking for scrub pause requests is done early at the beginning of the
while loop of scrub_stripe() and later in the loop, scrub_extent() and
scrub_raid56_parity() are called, which in turn call scrub_pages() and
scrub_pages_for_parity() respectively. These last two functions do memory
allocations using GFP_KERNEL. Same problem could happen while scrubbing
the super blocks, since it calls scrub_pages().
We also can not have any of the worker tasks, created by the scrub task,
doing GFP_KERNEL allocations, because before pausing, the scrub task waits
for all the worker tasks to complete (also done at scrub_stripe()).
So make sure GFP_NOFS is used for the memory allocations because at any
time a scrub pause request can happen from another task that started to
commit a transaction.
Fixes: 58c4e17384 ("btrfs: scrub: use GFP_KERNEL on the submission path")
CC: stable@vger.kernel.org # 4.6+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
For data inodes this hook does nothing but to return -EAGAIN which is
used to signal to the endio routines that this bio belongs to a data
inode. If this is the case the actual retrying is handled by
bio_readpage_error. Alternatively, if this bio belongs to the btree
inode then btree_io_failed_hook just does some cleanup and doesn't retry
anything.
This patch simplifies the code flow by eliminating
readpage_io_failed_hook and instead open-coding btree_io_failed_hook in
end_bio_extent_readpage. Also eliminate some needless checks since IO is
always performed on either data inode or btree inode, both of which are
guaranteed to have their extent_io_tree::ops set.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The btrfs_bio_end_io_t typedef was introduced with commit
a1d3c4786a ("btrfs: btrfs_multi_bio replaced with btrfs_bio")
but never used anywhere. This commit also introduced a forward declaration
of 'struct btrfs_bio' which is only needed for btrfs_bio_end_io_t.
Remove both as they're not needed anywhere.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
The end_io callback implemented as btrfs_io_bio_endio_readpage only
calls kfree. Also the callback is set only in case the csum buffer is
allocated and not pointing to the inline buffer. We can use that
information to drop the indirection and call a helper that will free the
csums only in the right case.
This shrinks struct btrfs_io_bio by 8 bytes.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
The io_bio tracks checksums and has an inline buffer or an allocated
one. And there's a third member that points to the right one, but we
don't need to use an extra pointer for that. Let btrfs_io_bio::csum
point to the right buffer and check that the inline buffer is not
accidentally freed.
This shrinks struct btrfs_io_bio by 8 bytes.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
The async_cow::root is used to propagate fs_info to async_cow_submit.
We can't use inode to reach it because it could become NULL after
write without compression in async_cow_start.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
There's one caller and its code is simple, we can open code it in
run_one_async_done. The errors are passed through bio.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
Print a kernel log message when the balance ends, either for cancel or
completed or if it is paused.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The information about balance arguments is important for system audit,
this patch prints the textual representation when balance starts or is
resumed.
Example command:
$ btrfs balance start -f -mprofiles=raid1,convert=single,soft -dlimit=10..20,usage=50 /btrfs
Example kernel log output:
BTRFS info (device sdb): balance: start -f -dusage=50,limit=10..20 -mconvert=single,soft,profiles=raid1 -sconvert=single,soft,profiles=raid1
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog, simplify code ]
Signed-off-by: David Sterba <dsterba@suse.com>
Factor out helper that describes block group flags from
describe_relocation. The result will not be longer than the given size.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add comments ]
Signed-off-by: David Sterba <dsterba@suse.com>
If the quota enable and snapshot creation ioctls are called concurrently
we can get into a deadlock where the task enabling quotas will deadlock
on the fs_info->qgroup_ioctl_lock mutex because it attempts to lock it
twice, or the task creating a snapshot tries to commit the transaction
while the task enabling quota waits for the former task to commit the
transaction while holding the mutex. The following time diagrams show how
both cases happen.
First scenario:
CPU 0 CPU 1
btrfs_ioctl()
btrfs_ioctl_quota_ctl()
btrfs_quota_enable()
mutex_lock(fs_info->qgroup_ioctl_lock)
btrfs_start_transaction()
btrfs_ioctl()
btrfs_ioctl_snap_create_v2
create_snapshot()
--> adds snapshot to the
list pending_snapshots
of the current
transaction
btrfs_commit_transaction()
create_pending_snapshots()
create_pending_snapshot()
qgroup_account_snapshot()
btrfs_qgroup_inherit()
mutex_lock(fs_info->qgroup_ioctl_lock)
--> deadlock, mutex already locked
by this task at
btrfs_quota_enable()
Second scenario:
CPU 0 CPU 1
btrfs_ioctl()
btrfs_ioctl_quota_ctl()
btrfs_quota_enable()
mutex_lock(fs_info->qgroup_ioctl_lock)
btrfs_start_transaction()
btrfs_ioctl()
btrfs_ioctl_snap_create_v2
create_snapshot()
--> adds snapshot to the
list pending_snapshots
of the current
transaction
btrfs_commit_transaction()
--> waits for task at
CPU 0 to release
its transaction
handle
btrfs_commit_transaction()
--> sees another task started
the transaction commit first
--> releases its transaction
handle
--> waits for the transaction
commit to be completed by
the task at CPU 1
create_pending_snapshot()
qgroup_account_snapshot()
btrfs_qgroup_inherit()
mutex_lock(fs_info->qgroup_ioctl_lock)
--> deadlock, task at CPU 0
has the mutex locked but
it is waiting for us to
finish the transaction
commit
So fix this by setting the quota enabled flag in fs_info after committing
the transaction at btrfs_quota_enable(). This ends up serializing quota
enable and snapshot creation as if the snapshot creation happened just
before the quota enable request. The quota rescan task, scheduled after
committing the transaction in btrfs_quote_enable(), will do the accounting.
Fixes: 6426c7ad69 ("btrfs: qgroup: Fix qgroup accounting when creating snapshot")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The available allocation bits members from struct btrfs_fs_info are
protected by a sequence lock, and when starting balance we access them
incorrectly in two different ways:
1) In the read sequence lock loop at btrfs_balance() we use the values we
read from fs_info->avail_*_alloc_bits and we can immediately do actions
that have side effects and can not be undone (printing a message and
jumping to a label). This is wrong because a retry might be needed, so
our actions must not have side effects and must be repeatable as long
as read_seqretry() returns a non-zero value. In other words, we were
essentially ignoring the sequence lock;
2) Right below the read sequence lock loop, we were reading the values
from avail_metadata_alloc_bits and avail_data_alloc_bits without any
protection from concurrent writers, that is, reading them outside of
the read sequence lock critical section.
So fix this by making sure we only read the available allocation bits
while in a read sequence lock critical section and that what we do in the
critical section is repeatable (has nothing that can not be undone) so
that any eventual retry that is needed is handled properly.
Fixes: de98ced9e7 ("Btrfs: use seqlock to protect fs_info->avail_{data, metadata, system}_alloc_bits")
Fixes: 1450612797 ("btrfs: fix a bogus warning when converting only data or metadata")
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We can have a lot freed extents during the life span of transaction, so
the red black tree that keeps track of the ranges of each freed extent
(fs_info->freed_extents[]) can get quite big. When finishing a
transaction commit we find each range, process it (discard the extents,
unpin them) and then remove it from the red black tree.
We can use an extent state record as a cache when searching for a range,
so that when we clean the range we can use the cached extent state we
passed to the search function instead of iterating the red black tree
again. Doing things as fast as possible when finishing a transaction (in
state TRANS_STATE_UNBLOCKED) is convenient as it reduces the time we
block another task that wants to commit the next transaction.
So change clear_extent_dirty() to allow an optional extent state record to
be passed as an argument, which will be passed down to __clear_extent_bit.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This patch lands the last case which needs to be handled by the fsid
change code. Namely, this is the case where a multidisk filesystem has
already undergone at least one successful fsid change i.e all disks
have the METADATA_UUID incompat bit and power failure occurs as another
fsid change is in progress. When such an event occurs, disks could be
split in 2 groups. One of the groups will have both METADATA_UUID and
CHANGING_FSID_V2 flags set coupled with old fsid/metadata_uuid pairs.
The other group of disks will have only METADATA_UUID bit set and their
fsid will be different than the one in disks in the first group. Here
we look at the following cases:
a) A disk from the first group is scanned first, so fs_devices is
created with stale fsid/metdata_uuid. Then when a disk from the
second group is scanned it needs to first check whether there exists
such an fs_devices that has fsid_change set to true (because it was
created with a disk having the CHANGING_FSID_V2 flag), the
metadata_uuid and fsid of the fs_devices will be different (since it was
created by a disk which already has had at least 1 successful fsid change)
and finally the metadata_uuid of the fs_devices will equal that of the
currently scanned disk (because metadata_uuid never really changes).
When the correct fs_devices is found the information from the scanned
disk will replace the current one in fs_devices since the scanned disk
will have higher generation number.
b) A disk from the second group is scanned so fs_devices is created
as usual with differing fsid/metdata_uid. Then when a disk from the
first group is scanned the code detects that it has both
CHANGING_FSID_V2 and METADATA_UUID flags set and will search for
fs_devices that has differing metadata_uuid/fsid and whose
metadata_uuid is the same as that of the scanned device.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This commit continues hardening the scanning code to handle cases where
power loss could have caused disks in a multi-disk filesystem to be
in inconsistent state. Namely handle the situation that can occur when
some of the disks in multi-disk fs have completed their fsid change i.e
they have METADATA_UUID incompat flag set, have cleared the
CHANGING_FSID_V2 flag and their fsid/metadata_uuid are different. At
the same time the other half of the disks will have their
fsid/metadata_uuid unchanged and will only have CHANGING_FSID_V2 flag.
This is handled by introducing code in the scan path which:
a) Handles the case when a device with CHANGING_FSID_V2 flag is
scanned and as a result btrfs_fs_devices is created with matching
fsid/metdata_uuid. Subsequently, when a device with completed fsid
change is scanned it will detect this via the new code in find_fsid
i.e that such an fs_devices exist that fsid_change flag is set to true,
it's metadata_uuid/fsid match and the metadata_uuid of the scanned
device matches that of the fs_devices. In this case, it's important to
note that the devices which has its fsid change completed will have a
higher generation number than the device with FSID_CHANGING_V2 flag
set, so its superblock block will be used during mount. To prevent an
assertion triggering because the sb used for mounting will have
differing fsid/metadata_uuid than the ones in the fs_devices struct
also add code in device_list_add which overwrites the values in
fs_devices.
b) Alternatively we can end up with a device that completed its
fsid change be scanned first which will create the respective
btrfs_fs_devices struct with differing fsid/metadata_uuid. In this
case when a device with FSID_CHANGING_V2 flag set is scanned it will
call the newly added find_fsid_inprogress function which will return
the correct fs_devices.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In order to gracefully handle split-brain scenario during fsid change
(which are very unlikely, yet possible), two more pieces of information
will be necessary:
1. The highest generation number among all devices registered to a
particular btrfs_fs_devices
2. A boolean flag whether a given btrfs_fs_devices was created by a
device which had the FSID_CHANGING_V2 flag set.
This is a preparatory patch and just introduces the variables as well
as code which sets them, their actual use is going to happen in a later
patch.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Even though fsid change without rewrite is a very quick operation it's
still possible to experience a split-brain scenario if power loss occurs
at the most inconvenient time. This patch handles the case where power
failure occurs while the first transaction (the one setting
CHANGING_FSID_V2) flag is being persisted on disk. This can cause the
btrfs_fs_devices of this filesystem to be created by a device which:
a) has the CHANGING_FSID_V2 flag set but its fsid value is intact
b) or a device which doesn't have CHANGING_FSID_V2 flag set and its
fsid value is intact
This situation is trivially handled by the current find_fsid code since
in both cases the devices are going to be treated like ordinary devices.
Since btrfs is always mounted using the superblock of the latest
device (the one with highest generation number), meaning it will have
the CHANGING_FSID_V2 flag set, ensure it's being cleared on mount. On
the first transaction commit following mount all disks will have it
cleared.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently btrfs_fs_info structure contains a copy of the
fsid/metadata_uuid fields. Same values are also contained in the
btrfs_fs_devices structure which fs_info has a reference to. Let's
reduce duplication by removing the fields from fs_info and always refer
to the ones in fs_devices. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since the metadata_uuid is a new incompat feature it requires the
respective sysfs hooks. This patch adds the 'metdata_uuid' feature to
be shown if it supported by the kernel. Additionally it adds
/sys/fs/btrfs/UUID/metadata_uuid attribute which allows one to read
the current metadata_uuid.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This field is going to be used when the user wants to change the UUID
of the filesystem without having to rewrite all metadata blocks. This
field adds another level of indirection such that when the FSID is
changed what really happens is the current UUID (the one with which the
fs was created) is copied to the 'metadata_uuid' field in the superblock
as well as a new incompat flag is set METADATA_UUID. When the kernel
detects this flag is set it knows that the superblock in fact has 2
UUIDs:
1. Is the UUID which is user-visible, currently known as FSID.
2. Metadata UUID - this is the UUID which is stamped into all on-disk
datastructures belonging to this file system.
When the new incompat flag is present device scanning checks whether
both fsid/metadata_uuid of the scanned device match any of the
registered filesystems. When the flag is not set then both UUIDs are
equal and only the FSID is retained on disk, metadata_uuid is set only
in-memory during mount.
Additionally a new metadata_uuid field is also added to the fs_info
struct. It's initialised either with the FSID in case METADATA_UUID
incompat flag is not set or with the metdata_uuid of the superblock
otherwise.
This commit introduces the new fields as well as the new incompat flag
and switches all users of the fsid to the new logic.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ minor updates in comments ]
Signed-off-by: David Sterba <dsterba@suse.com>
Several functions in BTRFS are only used inside the source file they are
declared if CONFIG_BTRFS_FS_RUN_SANITY_TESTS is not defined. However if
CONFIG_BTRFS_FS_RUN_SANITY_TESTS is defined these functions are shared
with the unit tests code.
Before the introduction of the EXPORT_FOR_TESTS macro, these functions
could not be declared as static and the compiler had a harder task when
optimizing and inlining them.
As we have EXPORT_FOR_TESTS now, use it where appropriate to support the
compiler.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Depending on whether CONFIG_BTRFS_FS_RUN_SANITY_TESTS is set, some BTRFS
functions are either local to the file they are implemented in and thus
should be declared static or are called from within the test
implementation defined in a different file.
Introduce an EXPORT_FOR_TESTS macro which depending on
CONFIG_BTRFS_FS_RUN_SANITY_TESTS either adds the 'static' keyword to a
function or not.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
Up to commit 32955c5422 ("btrfs: switch to discard_new_inode()") the
drop_on_err variable in btrfs_mkdir() was used to check whether the
inode had to be dropped via iput().
After commit 32955c5422 ("btrfs: switch to discard_new_inode()")
discard_new_inode() is called when err is set and inode is non NULL.
Therefore drop_on_err is not used anymore and thus causes a warning when
building with -Wunused-but-set-variable.
Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
lock_delalloc_pages should only return 2 values - 0 in case of success
and -EAGAIN if the range of pages to be locked should be shrunk due to
some of gone. Manual inspections confirms that this is indeed the case
since __process_pages_contig is where lock_delalloc_pages gets its
return value. The latter always returns 0 or -EAGAIN so the invariant
holds. No functional changes.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
All callers of this function pass BTRFS_MAX_EXTENT_SIZE (128M) so let's
reduce the argument count and make that a local variable. No functional
changes.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It's unnecessary to check map->stripes[i].dev for NULL given its value
is already set and dereferenced above the the check. No functional
changes.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
As of now only user requested replace cancel can cancel the
replace-scrub so no need to log the error.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When we successfully cancel the device replace, its scrub worker returns
-ECANCELED, which is then passed to btrfs_dev_replace_finishing.
It cleans up based on the returned status and propagates the same
-ECANCELED back the parent function. As of now only user can cancel the
replace-scrub, so its ok to silence the warning here.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We recast the replace return status
BTRFS_IOCTL_DEV_REPLACE_RESULT_SCRUB_INPROGRESS to 0, to indicate no
error.
And since BTRFS_IOCTL_DEV_REPLACE_RESULT_NO_ERROR should also return 0,
which is also declared as 0, so we just return. Instead add it to the if
statement so that there is enough clarity while reading the code.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When the replace state is in the suspended state, btrfs_scrub_cancel()
should fail with -ENOTCONN as there is no scrub running. As a safety
catch check if btrfs_scrub_cancel() returns -ENOTCONN and assert if it
doesn't.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The device-replace needs to check the result code of the scrub workers
in btrfs_dev_replace_cancel and distinguish if successful cancel
operation and when the there was no operation running.
If btrfs_scrub_cancel() fails, return
BTRFS_IOCTL_DEV_REPLACE_RESULT_NOT_STARTED so that user can try
to cancel the replace again.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
The device replace cancel thread can race with the replace start thread
and if fs_info::scrubs_running is not yet set, btrfs_scrub_cancel() will
fail to stop the scrub thread.
The scrub thread continues with the scrub for replace which then will
try to write to the target device and which is already freed by the
cancel thread.
scrub_setup_ctx() warns as tgtdev is NULL.
struct scrub_ctx *scrub_setup_ctx(struct btrfs_device *dev, int is_dev_replace)
{
...
if (is_dev_replace) {
WARN_ON(!fs_info->dev_replace.tgtdev); <===
sctx->pages_per_wr_bio = SCRUB_PAGES_PER_WR_BIO;
sctx->wr_tgtdev = fs_info->dev_replace.tgtdev;
sctx->flush_all_writes = false;
}
[ 6724.497655] BTRFS info (device sdb): dev_replace from /dev/sdb (devid 1) to /dev/sdc started
[ 6753.945017] BTRFS info (device sdb): dev_replace from /dev/sdb (devid 1) to /dev/sdc canceled
[ 6852.426700] WARNING: CPU: 0 PID: 4494 at fs/btrfs/scrub.c:622 scrub_setup_ctx.isra.19+0x220/0x230 [btrfs]
...
[ 6852.428928] RIP: 0010:scrub_setup_ctx.isra.19+0x220/0x230 [btrfs]
...
[ 6852.432970] Call Trace:
[ 6852.433202] btrfs_scrub_dev+0x19b/0x5c0 [btrfs]
[ 6852.433471] btrfs_dev_replace_start+0x48c/0x6a0 [btrfs]
[ 6852.433800] btrfs_dev_replace_by_ioctl+0x3a/0x60 [btrfs]
[ 6852.434097] btrfs_ioctl+0x2476/0x2d20 [btrfs]
[ 6852.434365] ? do_sigaction+0x7d/0x1e0
[ 6852.434623] do_vfs_ioctl+0xa9/0x6c0
[ 6852.434865] ? syscall_trace_enter+0x1c8/0x310
[ 6852.435124] ? syscall_trace_enter+0x1c8/0x310
[ 6852.435387] ksys_ioctl+0x60/0x90
[ 6852.435663] __x64_sys_ioctl+0x16/0x20
[ 6852.435907] do_syscall_64+0x50/0x180
[ 6852.436150] entry_SYSCALL_64_after_hwframe+0x49/0xbe
Further, as the replace thread enters scrub_write_page_to_dev_replace()
without the target device it panics:
static int scrub_add_page_to_wr_bio(struct scrub_ctx *sctx,
struct scrub_page *spage)
{
...
bio_set_dev(bio, sbio->dev->bdev); <======
[ 6929.715145] BUG: unable to handle kernel NULL pointer dereference at 00000000000000a0
..
[ 6929.717106] Workqueue: btrfs-scrub btrfs_scrub_helper [btrfs]
[ 6929.717420] RIP: 0010:scrub_write_page_to_dev_replace+0xb4/0x260
[btrfs]
..
[ 6929.721430] Call Trace:
[ 6929.721663] scrub_write_block_to_dev_replace+0x3f/0x60 [btrfs]
[ 6929.721975] scrub_bio_end_io_worker+0x1af/0x490 [btrfs]
[ 6929.722277] normal_work_helper+0xf0/0x4c0 [btrfs]
[ 6929.722552] process_one_work+0x1f4/0x520
[ 6929.722805] ? process_one_work+0x16e/0x520
[ 6929.723063] worker_thread+0x46/0x3d0
[ 6929.723313] kthread+0xf8/0x130
[ 6929.723544] ? process_one_work+0x520/0x520
[ 6929.723800] ? kthread_delayed_work_timer_fn+0x80/0x80
[ 6929.724081] ret_from_fork+0x3a/0x50
Fix this by letting the btrfs_dev_replace_finishing() to do the job of
cleaning after the cancel, including freeing of the target device.
btrfs_dev_replace_finishing() is called when btrfs_scub_dev() returns
along with the scrub return status.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In a secnario where balance and replace co-exists as below,
- start balance
- pause balance
- start replace
- reboot
and when system restarts, balance resumes first. Then the replace is
attempted to restart but will fail as the EXCL_OP lock is already held
by the balance. If so place the replace state back to
BTRFS_IOCTL_DEV_REPLACE_STATE_SUSPENDED state.
Fixes: 010a47bde9 ("btrfs: add proper safety check before resuming dev-replace")
CC: stable@vger.kernel.org # 4.18+
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
At the time of forced unmount we place the running replace to
BTRFS_IOCTL_DEV_REPLACE_STATE_SUSPENDED state, so when the system comes
back and expect the target device is missing.
Then let the replace state continue to be in
BTRFS_IOCTL_DEV_REPLACE_STATE_SUSPENDED state instead of
BTRFS_IOCTL_DEV_REPLACE_STATE_STARTED as there isn't any matching scrub
running as part of replace.
Fixes: e93c89c1aa ("Btrfs: add new sources for device replace code")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There isn't any other consumer other than in its own file dev-replace.c.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It's not that impossible to imagine that a device OR a btrfs image is
copied just by using the dd or the cp command. Which in case both the
copies of the btrfs will have the same fsid. If on the system with
automount enabled, the copied FS gets scanned.
We have a known bug in btrfs, that we let the device path be changed
after the device has been mounted. So using this loop hole the new
copied device would appears as if its mounted immediately after it's
been copied.
For example:
Initially.. /dev/mmcblk0p4 is mounted as /
$ lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
mmcblk0 179:0 0 29.2G 0 disk
|-mmcblk0p4 179:4 0 4G 0 part /
|-mmcblk0p2 179:2 0 500M 0 part /boot
|-mmcblk0p3 179:3 0 256M 0 part [SWAP]
`-mmcblk0p1 179:1 0 256M 0 part /boot/efi
$ btrfs fi show
Label: none uuid: 07892354-ddaa-4443-90ea-f76a06accaba
Total devices 1 FS bytes used 1.40GiB
devid 1 size 4.00GiB used 3.00GiB path /dev/mmcblk0p4
Copy mmcblk0 to sda
$ dd if=/dev/mmcblk0 of=/dev/sda
And immediately after the copy completes the change in the device
superblock is notified which the automount scans using btrfs device scan
and the new device sda becomes the mounted root device.
$ lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 1 14.9G 0 disk
|-sda4 8:4 1 4G 0 part /
|-sda2 8:2 1 500M 0 part
|-sda3 8:3 1 256M 0 part
`-sda1 8:1 1 256M 0 part
mmcblk0 179:0 0 29.2G 0 disk
|-mmcblk0p4 179:4 0 4G 0 part
|-mmcblk0p2 179:2 0 500M 0 part /boot
|-mmcblk0p3 179:3 0 256M 0 part [SWAP]
`-mmcblk0p1 179:1 0 256M 0 part /boot/efi
$ btrfs fi show /
Label: none uuid: 07892354-ddaa-4443-90ea-f76a06accaba
Total devices 1 FS bytes used 1.40GiB
devid 1 size 4.00GiB used 3.00GiB path /dev/sda4
The bug is quite nasty that you can't either unmount /dev/sda4 or
/dev/mmcblk0p4. And the problem does not get solved until you take sda
out of the system on to another system to change its fsid using the
'btrfstune -u' command.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Instead of hardcoding exceptions for RAID5 and RAID6 in the code, use an
nparity field in raid_attr.
Signed-off-by: Hans van Kranenburg <hans.van.kranenburg@mendix.com>
Signed-off-by: David Sterba <dsterba@suse.com>
RAID5 and RAID6 profile store one copy of the data, not 2 or 3. These
values are not yet used anywhere so there's no change.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Hans van Kranenburg <hans.van.kranenburg@mendix.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Commit 92e222df7b "btrfs: alloc_chunk: fix DUP stripe size handling"
fixed calculating the stripe_size for a new DUP chunk.
However, the same calculation reappears a bit later, and that one was
not changed yet. The resulting bug that is exposed is that the newly
allocated device extents ('stripes') can have a few MiB overlap with the
next thing stored after them, which is another device extent or the end
of the disk.
The scenario in which this can happen is:
* The block device for the filesystem is less than 10GiB in size.
* The amount of contiguous free unallocated disk space chosen to use for
chunk allocation is 20% of the total device size, or a few MiB more or
less.
An example:
- The filesystem device is 7880MiB (max_chunk_size gets set to 788MiB)
- There's 1578MiB unallocated raw disk space left in one contiguous
piece.
In this case stripe_size is first calculated as 789MiB, (half of
1578MiB).
Since 789MiB (stripe_size * data_stripes) > 788MiB (max_chunk_size), we
enter the if block. Now stripe_size value is immediately overwritten
while calculating an adjusted value based on max_chunk_size, which ends
up as 788MiB.
Next, the value is rounded up to a 16MiB boundary, 800MiB, which is
actually more than the value we had before. However, the last comparison
fails to detect this, because it's comparing the value with the total
amount of free space, which is about twice the size of stripe_size.
In the example above, this means that the resulting raw disk space being
allocated is 1600MiB, while only a gap of 1578MiB has been found. The
second device extent object for this DUP chunk will overlap for 22MiB
with whatever comes next.
The underlying problem here is that the stripe_size is reused all the
time for different things. So, when entering the code in the if block,
stripe_size is immediately overwritten with something else. If later we
decide we want to have the previous value back, then the logic to
compute it was copy pasted in again.
With this change, the value in stripe_size is not unnecessarily
destroyed, so the duplicated calculation is not needed any more.
Signed-off-by: Hans van Kranenburg <hans.van.kranenburg@mendix.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The variable num_bytes is really a way too generic name for a variable
in this function. There are a dozen other variables that hold a number
of bytes as value.
Give it a name that actually describes what it does, which is holding
the size of the chunk that we're allocating.
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Hans van Kranenburg <hans.van.kranenburg@mendix.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The variable num_bytes is used to store the chunk length of the chunk
that we're allocating. Do not reuse it for something really different in
the same function.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Hans van Kranenburg <hans.van.kranenburg@mendix.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Snapshot is expected to be fast. But if there are writers steadily
creating dirty pages in our subvolume, the snapshot may take a very long
time to complete. To fix the problem, we use tagged writepage for
snapshot flusher as we do in the generic write_cache_pages(), so we can
omit pages dirtied after the snapshot command.
This does not change the semantics regarding which data get to the
snapshot, if there are pages being dirtied during the snapshotting
operation. There's a sync called before snapshot is taken in old/new
case, any IO in flight just after that may be in the snapshot but this
depends on other system effects that might still sync the IO.
We do a simple snapshot speed test on a Intel D-1531 box:
fio --ioengine=libaio --iodepth=32 --bs=4k --rw=write --size=64G
--direct=0 --thread=1 --numjobs=1 --time_based --runtime=120
--filename=/mnt/sub/testfile --name=job1 --group_reporting & sleep 5;
time btrfs sub snap -r /mnt/sub /mnt/snap; killall fio
original: 1m58sec
patched: 6.54sec
This is the best case for this patch since for a sequential write case,
we omit nearly all pages dirtied after the snapshot command.
For a multi writers, random write test:
fio --ioengine=libaio --iodepth=32 --bs=4k --rw=randwrite --size=64G
--direct=0 --thread=1 --numjobs=4 --time_based --runtime=120
--filename=/mnt/sub/testfile --name=job1 --group_reporting & sleep 5;
time btrfs sub snap -r /mnt/sub /mnt/snap; killall fio
original: 15.83sec
patched: 10.35sec
The improvement is smaller compared to the sequential write case,
since we omit only half of the pages dirtied after snapshot command.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Ethan Lien <ethanlien@synology.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This parameter was never used, yet was part of the interface of the
function ever since its introduction as extent_io_ops::writepage_end_io_hook
in e6dcd2dc9c ("Btrfs: New data=ordered implementation"). Now that
NULL is passed everywhere as a value for this parameter let's remove it
for good. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The only remaining use of the 'epd' argument in writepage_delalloc is
to reference the extent_io_tree which was set in extent_writepages. Since
it is guaranteed that page->mapping of any page passed to
writepage_delalloc (and __extent_writepage as the sole caller) to be
equal to that passed in extent_writepages we can directly get the
io_tree via the already passed inode (which is also taken from
page->mapping->host). No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If epd::extent_locked is set then writepage_delalloc terminates. Make
this a bit more apparent in the caller by simply bubbling the check up.
This enables to remove epd as an argument to writepage_delalloc in a
future patch. No functional change.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Before btrfs_map_bio submits all stripe bios it does a number of checks
to ensure the device for every stripe is present. However, it doesn't do
a DEV_STATE_MISSING check, instead this is relegated to the lower level
btrfs_schedule_bio (in the async submission case, sync submission
doesn't check DEV_STATE_MISSING at all). Additionally
btrfs_schedule_bios does the duplicate device->bdev check which has
already been performed in btrfs_map_bio.
This patch moves the DEV_STATE_MISSING check in btrfs_map_bio and
removes the duplicate device->bdev check. Doing so ensures that no bio
cloning/submission happens for both async/sync requests in the face of
missing device. This makes the async io submission path slightly shorter
in terms of instruction count. No functional changes.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
dev_replace::replace_state has been set to
BTRFS_DEV_REPLACE_ITEM_STATE_NEVER_STARTED (0) in the same function,
So delete the line which sets replace_state = 0;
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The io_err field of struct btrfs_log_ctx is no longer used after the
recent simplification of the fast fsync path, where we now wait for
ordered extents to complete before logging the inode. We did this in
commit b5e6c3e170 ("btrfs: always wait on ordered extents at fsync
time") and commit a2120a473a ("btrfs: clean up the left over
logged_list usage") removed its last use.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We currently are in a loop finding each range (corresponding to a btree
node/leaf) in a log root's extent io tree and then clean it up. This is a
waste of time since we are traversing the extent io tree's rb_tree more
times then needed (one for a range lookup and another for cleaning it up)
without any good reason.
We free the log trees when we are in the critical section of a transaction
commit (the transaction state is set to TRANS_STATE_COMMIT_DOING), so it's
of great convenience to do everything as fast as possible in order to
reduce the time we block other tasks from starting a new transaction.
So fix this by traversing the extent io tree once and cleaning up all its
records in one go while traversing it.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The loop construct in free_extent_buffer was added in
242e18c7c1 ("Btrfs: reduce lock contention on extent buffer locks")
as means of reducing the times the eb lock is taken, the non-last ref
count is decremented and lock is released. As the special handling
of UNMAPPED extent buffers was removed now there is only one decrement
op which is happening for EXTENT_BUFFER_UNMAPPED case.
This commit modifies the loop condition so that in case of UNMAPPED
buffers the eb's lock is taken only if we are 100% sure the eb is going
to be freed by the current executor of the code. Additionally, remove
superfluous ref count ops in btrfs test.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that the whole of btrfs code has been audited for eb reference count
management it's time to remove the hunk in free_extent_buffer that
essentially considered the condition
"eb->ref == 2 && EXTENT_BUFFER_DUMMY"
to equal "eb->ref = 1". Also remove the last location
which takes an extra reference count in alloc_test_extent_buffer.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In qgroup_rescan_leaf a copy is made of the target leaf by calling
btrfs_clone_extent_buffer. The latter allocates a new buffer and
attaches a new set of pages and copies the content of the source buffer.
The new scratch buffer is only used to iterate it's items, it's not
published anywhere and cannot be accessed by a third party.
Hence, it's not necessary to perform any locking on it whatsoever.
Furthermore, remove the extra extent_buffer_get call since the new
buffer is always allocated with a reference count of 1 which is
sufficient here. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When the 2 comparison trees roots are initialised they are private to
the function and already have reference counts of 1 each. There is no
need to further increment the reference count since the cloned buffers
are already accessed via struct btrfs_path. Eventually the 2 paths used
for comparison are going to be released, effectively disposing of the
cloned buffers.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When a rewound buffer is created it already has a ref count of 1 and the
dummy flag set. Then another ref is taken bumping the count to 2.
Finally when this buffer is released from btrfs_release_path the extra
reference is decremented by the special handling code in
free_extent_buffer.
However, this special code is in fact redundant sinca ref count of 1 is
still correct since the buffer is only accessed via btrfs_path struct.
This paves the way forward of removing the special handling in
free_extent_buffer.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
get_old_root used used only by btrfs_search_old_slot to initialise the
path structure. The old root is always a cloned buffer (either via alloc
dummy or via btrfs_clone_extent_buffer) and its reference count is 2: 1
from allocation, 1 from extent_buffer_get call in get_old_root.
This latter explicit ref count acquire operation is in fact unnecessary
since the semantic is such that the newly allocated buffer is handed
over to the btrfs_path for lifetime management. Considering this just
remove the extra extent_buffer_get in get_old_root.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In iterate_inode_exrefs the eb is cloned via btrfs_clone_extent_buffer
which creates a private extent buffer with the dummy flag set and ref
count of 1. Then this buffer is locked for reading and its ref count is
incremented by 1. Finally it's fed to the passed iterate_irefs_t
function. The actual iterate call back is inode_to_path (coming from
paths_from_inode) which feeds the eb to btrfs_ref_to_path. In this final
function the passed eb is only read by first assigning it to the local
eb variable. This variable is only modified in the case another eb was
referenced from the passed path that is eb != eb_in check triggers.
Considering this there is no point in locking the cloned eb in
iterate_inode_refs since it's never being modified and is not published
anywhere. Furthermore the cloned eb is completely fine having its ref
count be 1.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In iterate_inode_refs the eb is cloned via btrfs_clone_extent_buffer
which creates a private extent buffer with the dummy flag set and ref
count of 1. Then this buffer is locked for reading and its ref count is
incremented by 1. Finally it's fed to the passed iterate_irefs_t
function. The actual iterate call back is inode_to_path (coming from
paths_from_inode) which feeds the eb to btrfs_ref_to_path. In this final
function the passed eb is only read by first assigning it to the local
eb variable. This variable is only modified in the case another eb was
referenced from the passed path that is eb != eb_in check triggers.
Considering this there is no point in locking the cloned eb in
iterate_inode_refs since it's never being modified and is not published
anywhere. Furthermore the cloned eb is completely fine having its ref
count be 1.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In extent-io self test, we need 2 ordered extents at its maximum size to
do the test.
Instead of using the intermediate numbers, use BTRFS_MAX_EXTENT_SIZE for
@max_bytes, and twice @max_bytes for @total_dirty. This should explain
why we need all these magic numbers and prevent people to modify them by
accident.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Btrfs has not allowed swap files since commit 35054394c4 ("Btrfs: stop
providing a bmap operation to avoid swapfile corruptions"). However, now
that the proper restrictions are in place, Btrfs can support swap files
through the swap file a_ops, similar to iomap in commit 67482129cd
("iomap: add a swapfile activation function").
For Btrfs, activation needs to make sure that the file can be used as a
swap file, which currently means that it must be fully allocated as
NOCOW with no compression on one device. It must also do the proper
tracking so that ioctls will not interfere with the swap file.
Deactivation clears this tracking.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The Btrfs swap code is going to need it, so give it a btrfs_ prefix and
make it non-static.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
A later patch will implement swap file support for Btrfs, but before we
do that, we need to make sure that the various Btrfs ioctls cannot
change a swap file.
When a swap file is active, we must make sure that the extents of the
file are not moved and that they don't become shared. That means that
the following are not safe:
- chattr +c (enable compression)
- reflink
- dedupe
- snapshot
- defrag
Don't allow those to happen on an active swap file.
Additionally, balance, resize, device remove, and device replace are
also unsafe if they affect an active swapfile. Add a red-black tree of
block groups and devices which contain an active swapfile. Relocation
checks each block group against this tree and skips it or errors out for
balance or resize, respectively. Device remove and device replace check
the tree for the device they will operate on.
Note that we don't have to worry about chattr -C (disable nocow), which
we ignore for non-empty files, because an active swapfile must be
non-empty and can't be truncated. We also don't have to worry about
autodefrag because it's only done on COW files. Truncate and fallocate
are already taken care of by the generic code. Device add doesn't do
relocation so it's not an issue, either.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is the counterpart to merge_extent_hook, similarly, it's used only
for data/freespace inodes so let's remove it, rename it and call it
directly where necessary. No functional changes.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This callback is used only for data and free space inodes. Such inodes
are guaranteed to have their extent_io_tree::private_data set to the
inode struct. Exploit this fact to directly call the function. Also give
it a more descriptive name. No functional changes.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is the counterpart to ex-set_bit_hook (now btrfs_set_delalloc_extent),
similar to what was done before remove clear_bit_hook and rename the
function. No functional changes.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This callback is used to properly account delalloc extents for data
inodes (ordinary file inodes and freespace v1 inodes). Those can be
easily identified since they have their extent_io trees ->private_data
member point to the inode. Let's exploit this fact to remove the
needless indirection through extent_io_hooks and directly call the
function. Also give the function a name which reflects its purpose -
btrfs_set_delalloc_extent.
This patch also modified test_find_delalloc so that the extent_io_tree
used for testing doesn't have its ->private_data set which would have
caused a crash in btrfs_set_delalloc_extent due to the btrfs_inode->root
member not being initialised. The old version of the code also didn't
call set_bit_hook since the extent_io ops weren't set for the inode. No
functional changes.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This callback was only used in debug builds by btrfs_leak_debug_check.
A better approach is to move its implementation in
btrfs_leak_debug_check and ensure the latter is only executed for extent
tree which have ->private_data set i.e. relate to a data node and not
the btree one. No functional changes.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This callback is ony ever called for data page writeout so there is no
need to actually abstract it via extent_io_ops. Lets just export it,
remove the definition of the callback and call it directly in the
functions that invoke the callback. Also rename the function to
btrfs_writepage_endio_finish_ordered since what it really does is
account finished io in the ordered extent data structures. No
functional changes.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This hook is called only from __extent_writepage_io which is already
called only from the data page writeout path. So there is no need to
make an indirect call via extent_io_ops. This patch just removes the
callback definition, exports the callback function and calls it directly
at the only call site. Also give the function a more descriptive name.
No functional changes.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This callback is called only from writepage_delalloc which in turn is
guaranteed to be called from the data page writeout path. In the end
there is no reason to have the call to this function to be indrected via
the extent_io_ops structure. This patch removes the callback definition,
exports the function and calls it directly. No functional changes.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ rename to btrfs_run_delalloc_range ]
Signed-off-by: David Sterba <dsterba@suse.com>
This will be used in future patches that remove the optional
extent_io_ops callbacks.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add extra dev extent end check against device boundary.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Enhance btrfs_verify_dev_extents() to remember previous checked dev
extents, so it can verify no dev extents can overlap.
Analysis from Hans:
"Imagine allocating a DATA|DUP chunk.
In the chunk allocator, we first set...
max_stripe_size = SZ_1G;
max_chunk_size = BTRFS_MAX_DATA_CHUNK_SIZE
... which is 10GiB.
Then...
/* we don't want a chunk larger than 10% of writeable space */
max_chunk_size = min(div_factor(fs_devices->total_rw_bytes, 1),
max_chunk_size);
Imagine we only have one 7880MiB block device in this filesystem. Now
max_chunk_size is down to 788MiB.
The next step in the code is to search for max_stripe_size * dev_stripes
amount of free space on the device, which is in our example 1GiB * 2 =
2GiB. Imagine the device has exactly 1578MiB free in one contiguous
piece. This amount of bytes will be put in devices_info[ndevs - 1].max_avail
Next we recalculate the stripe_size (which is actually the device extent
length), based on the actual maximum amount of available raw disk space:
stripe_size = div_u64(devices_info[ndevs - 1].max_avail, dev_stripes);
stripe_size is now 789MiB
Next we do...
data_stripes = num_stripes / ncopies
...where data_stripes ends up as 1, because num_stripes is 2 (the amount
of device extents we're going to have), and DUP has ncopies 2.
Next there's a check...
if (stripe_size * data_stripes > max_chunk_size)
...which matches because 789MiB * 1 > 788MiB.
We go into the if code, and next is...
stripe_size = div_u64(max_chunk_size, data_stripes);
...which resets stripe_size to max_chunk_size: 788MiB
Next is a fun one...
/* bump the answer up to a 16MB boundary */
stripe_size = round_up(stripe_size, SZ_16M);
...which changes stripe_size from 788MiB to 800MiB.
We're not done changing stripe_size yet...
/* But don't go higher than the limits we found while searching
* for free extents
*/
stripe_size = min(devices_info[ndevs - 1].max_avail,
stripe_size);
This is bad. max_avail is twice the stripe_size (we need to fit 2 device
extents on the same device for DUP).
The result here is that 800MiB < 1578MiB, so it's unchanged. However,
the resulting DUP chunk will need 1600MiB disk space, which isn't there,
and the second dev_extent might extend into the next thing (next
dev_extent? end of device?) for 22MiB.
The last shown line of code relies on a situation where there's twice
the value of stripe_size present as value for the variable stripe_size
when it's DUP. This was actually the case before commit 92e222df7b
"btrfs: alloc_chunk: fix DUP stripe size handling", from which I quote:
"[...] in the meantime there's a check to see if the stripe_size does
not exceed max_chunk_size. Since during this check stripe_size is twice
the amount as intended, the check will reduce the stripe_size to
max_chunk_size if the actual correct to be used stripe_size is more than
half the amount of max_chunk_size."
In the previous version of the code, the 16MiB alignment (why is this
done, by the way?) would result in a 50% chance that it would actually
do an 8MiB alignment for the individual dev_extents, since it was
operating on double the size. Does this matter?
Does it matter that stripe_size can be set to anything which is not
16MiB aligned because of the amount of remaining available disk space
which is just taken?
What is the main purpose of this round_up?
The most straightforward thing to do seems something like...
stripe_size = min(
div_u64(devices_info[ndevs - 1].max_avail, dev_stripes),
stripe_size
)
..just putting half of the max_avail into stripe_size."
Link: https://lore.kernel.org/linux-btrfs/b3461a38-e5f8-f41d-c67c-2efac8129054@mendix.com/
Reported-by: Hans van Kranenburg <hans.van.kranenburg@mendix.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
[ add analysis from report ]
Signed-off-by: David Sterba <dsterba@suse.com>
We have a complex loop design for find_free_extent(), that has different
behavior for each loop, some even includes new chunk allocation.
Instead of putting such a long code into find_free_extent() and makes it
harder to read, just extract them into find_free_extent_update_loop().
With all the cleanups, the main find_free_extent() should be pretty
barebone:
find_free_extent()
|- Iterate through all block groups
| |- Get a valid block group
| |- Try to do clustered allocation in that block group
| |- Try to do unclustered allocation in that block group
| |- Check if the result is valid
| | |- If valid, then exit
| |- Jump to next block group
|
|- Push harder to find free extents
|- If not found, re-iterate all block groups
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Su Yue <suy.fnst@cn.fujitsu.com>
[ copy callchain from changelog to function comment ]
Signed-off-by: David Sterba <dsterba@suse.com>
This patch will extract unclsutered extent allocation code into
find_free_extent_unclustered().
And this helper function will use return value to indicate what to do
next.
This should make find_free_extent() a little easier to read.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Su Yue <suy.fnst@cn.fujitsu.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
[Update merge conflict with fb5c39d7a8 ("btrfs: don't use ctl->free_space for max_extent_size")]
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have two main methods to find free extents inside a block group:
1) clustered allocation
2) unclustered allocation
This patch will extract the clustered allocation into
find_free_extent_clustered() to make it a little easier to read.
Instead of jumping between different labels in find_free_extent(), the
helper function will use return value to indicate different behavior.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Su Yue <suy.fnst@cn.fujitsu.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Instead of tons of different local variables in find_free_extent(),
extract them into find_free_extent_ctl structure, and add better
explanation for them.
Some modification may looks redundant, but will later greatly simplify
function parameter list during find_free_extent() refactor.
Also add two comments to co-operate with fb5c39d7a8 ("btrfs: don't use
ctl->free_space for max_extent_size"), to make ffe_ctl->max_extent_size
update more reader-friendly.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Su Yue <suy.fnst@cn.fujitsu.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Introduce a new wrapper update_bytes_pinned to replace open coded
bytes_pinned modifiers. Now the underflows of space_info::bytes_pinned
get detected and reported.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Although we have space_info::bytes_may_use underflow detection in
btrfs_free_reserved_data_space_noquota(), we have more callers who are
subtracting number from space_info::bytes_may_use.
So instead of doing underflow detection for every caller, introduce a
new wrapper update_bytes_may_use() to replace open coded bytes_may_use
modifiers.
This also introduce a macro to declare more wrappers, but currently
space_info::bytes_may_use is the mostly interesting one.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Tracking pending ordered extents per transaction was introduced in commit
50d9aa99bd ("Btrfs: make sure logged extents complete in the current
transaction V3") and later updated in commit 161c3549b4 ("Btrfs: change
how we wait for pending ordered extents").
However now that on fsync we always wait for ordered extents to complete
before logging, done in commit 5636cf7d6d ("btrfs: remove the logged
extents infrastructure"), we no longer need the stuff to track for pending
ordered extents, which was not completely removed in the mentioned commit.
So remove the remaining of the pending ordered extents infrastructure.
Reviewed-by: Liu Bo <bo.liu@linux.alibaba.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The logged_start and logged_end variables, at btrfs_log_changed_extents,
were added in commit 8c6c592831 ("btrfs: log csums for all modified
extents"). However since the recent simplification for fsync, which makes
us wait for all ordered extents to complete before logging extents, we
no longer need those variables. Commit a2120a473a ("btrfs: clean up the
left over logged_list usage") forgot to remove them.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Use the aptly named function rather than open coding it. No functional
changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Merge misc fixes from Andrew Morton:
"11 fixes"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
scripts/spdxcheck.py: always open files in binary mode
checkstack.pl: fix for aarch64
userfaultfd: check VM_MAYWRITE was set after verifying the uffd is registered
fs/iomap.c: get/put the page in iomap_page_create/release()
hugetlbfs: call VM_BUG_ON_PAGE earlier in free_huge_page()
memblock: annotate memblock_is_reserved() with __init_memblock
psi: fix reference to kernel commandline enable
arch/sh/include/asm/io.h: provide prototypes for PCI I/O mapping in asm/io.h
mm/sparse: add common helper to mark all memblocks present
mm: introduce common STRUCT_PAGE_MAX_SHIFT define
alpha: fix hang caused by the bootmem removal
Calling UFFDIO_UNREGISTER on virtual ranges not yet registered in uffd
could trigger an harmless false positive WARN_ON. Check the vma is
already registered before checking VM_MAYWRITE to shut off the false
positive warning.
Link: http://lkml.kernel.org/r/20181206212028.18726-2-aarcange@redhat.com
Cc: <stable@vger.kernel.org>
Fixes: 29ec90660d ("userfaultfd: shmem/hugetlbfs: only allow to register VM_MAYWRITE vmas")
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reported-by: syzbot+06c7092e7d71218a2c16@syzkaller.appspotmail.com
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
Acked-by: Hugh Dickins <hughd@google.com>
Acked-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
migrate_page_move_mapping() expects pages with private data set to have
a page_count elevated by 1. This is what used to happen for xfs through
the buffer_heads code before the switch to iomap in commit 82cb14175e
("xfs: add support for sub-pagesize writeback without buffer_heads").
Not having the count elevated causes move_pages() to fail on memory
mapped files coming from xfs.
Make iomap compatible with the migrate_page_move_mapping() assumption by
elevating the page count as part of iomap_page_create() and lowering it
in iomap_page_release().
It causes the move_pages() syscall to misbehave on memory mapped files
from xfs. It does not not move any pages, which I suppose is "just" a
perf issue, but it also ends up returning a positive number which is out
of spec for the syscall. Talking to Michal Hocko, it sounds like
returning positive numbers might be a necessary update to move_pages()
anyway though
(https://lkml.kernel.org/r/20181116114955.GJ14706@dhcp22.suse.cz).
I only hit this in tests that verify that move_pages() actually moved
the pages. The test also got confused by the positive return from
move_pages() (it got treated as a success as positive numbers were not
expected and not handled) making it a bit harder to track down what's
going on.
Link: http://lkml.kernel.org/r/20181115184140.1388751-1-pjaroszynski@nvidia.com
Fixes: 82cb14175e ("xfs: add support for sub-pagesize writeback without buffer_heads")
Signed-off-by: Piotr Jaroszynski <pjaroszynski@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: William Kucharski <william.kucharski@oracle.com>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Brian Foster <bfoster@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAlwUCecQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpsAvD/9Uwn/Mthw/w3BBHL1eVjFnCGHJ8/zVDSSu
/XaP3nXugHicL7Je2EyHs0l4teUC42nIU7SQ+9B/UtyXA+usNvV7YEYqTD0U4ZYy
gLuBL70f6L17uDq5821jR4e64Q3sDy/OgcTpKu3VwnJPYUhJpQuU7fKqN4zi7Asn
5sI4KD99txmFMRNu33uusy98kJ4w5aeg6W1ACP7AxXYjyCY6Chi/7go8fEEK1Szb
9jx0BhbscuZplvvzHw3WAHAaVGGUIjLlmUdqJOisj/GVnCck2IqKqhXND9H/WNWk
FmzJeNTIq9Ag6kKN2WJ6xweNF6U+0Kb1Nn3F4jhoaVBhLxhpsjZ6BdHGWiT5wu4T
xd+hIC4pykA7Y7CqPfpbNZzriX4SJd508XerVjfBmgWIumJLG2D2IfquHmT0udRw
Qlb8pa3na5/dWXYbCOSjQdeBC5vCKCUuGnw15F5wA293khBNGiXVlztwCNgrkclQ
4IpAEUYpRZiK138hDx/BL++yanALYdPHHUS4HtwVMhpi89Bel3FioNzGaaBlElZc
mdSCyHc0RC7uMPMEbWl4bR+/cE3yjf43Gg3ftLaKLo5IQmZtu1pCEVXMatplZ1U/
jbbwwPtx44RYV3gjPirr+YgF0smS9ZgMkkk3oLwysWSjAV5v+wW2MxjTBae3xaSJ
VNoLhtefkw==
=wHit
-----END PGP SIGNATURE-----
Merge tag 'for-linus-20181214' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
"Three small fixes for this week. contains:
- spectre indexing fix for aio (Jeff)
- fix for the previous zeroing bio fix, we don't need it for user
mapped pages, and in fact it breaks some applications if we do
(Keith)
- allocation failure fix for null_blk with zoned (Shin'ichiro)"
* tag 'for-linus-20181214' of git://git.kernel.dk/linux-block:
block: Fix null_blk_zoned creation failure with small number of zones
aio: fix spectre gadget in lookup_ioctx
block/bio: Do not zero user pages
Commit 9d5b86ac13 ("fs/locks: Remove fl_nspid and use fs-specific l_pid
for remote locks") specified that the l_pid returned for F_GETLK on a local
file that has a remote lock should be the pid of the lock manager process.
That commit, while updating other filesystems, failed to update lockd, such
that locks created by lockd had their fl_pid set to that of the remote
process holding the lock. Fix that here to be the pid of lockd.
Also, fix the client case so that the returned lock pid is negative, which
indicates a remote lock on a remote file.
Fixes: 9d5b86ac13 ("fs/locks: Remove fl_nspid and use fs-specific...")
Cc: stable@vger.kernel.org
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
side. Disable it for now.
-----BEGIN PGP SIGNATURE-----
iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAlwTzaUTHGlkcnlvbW92
QGdtYWlsLmNvbQAKCRBKf944AhHzi6eyB/97jq3Wzipr9I+Y3LwJeNKUu/8+LZ1d
EDSbL1mgskzv0iX9fCds+wTzMksvm4V+n7I/DvoadbSEgTz2dUXYqoKBf3+4PEd1
f7K69XHe8X/+ME3wGoOc4oLgNws0yGE5kcmiP5wB9Tlo8EkfqFuu6753wWi8xaUf
so3TX9C+prqYmmhqRI8sesU84rMKjTkkK5dvIgVW4WLBJKsVAhmooyuRSwdVGZOv
m7+8m7PWUJm8stMRDqsJ/0j4COiMeteQGQx6T67gkCUF6pBdJjO2IRNx2VAhybsx
k/sMM/VfrQth/V4zsHZQD75BDQ+wu3Ncz8axIT5QYsAtuh9x0zr2av2e
=kQS2
-----END PGP SIGNATURE-----
Merge tag 'ceph-for-4.20-rc7' of https://github.com/ceph/ceph-client
Pull ceph fix from Ilya Dryomov:
"Luis discovered a problem with the new copyfrom offload on the server
side. Disable it for now"
* tag 'ceph-for-4.20-rc7' of https://github.com/ceph/ceph-client:
ceph: make 'nocopyfrom' a default mount option
This patch reorders flow from
- update page
- set_page_dirty
- wait_on_page_writeback
to
- wait_on_page_writeback
- update page
- set_page_dirty
The reason is:
- set_page_dirty will increase reference of dirty page, the reference
should be cleared before wait_on_page_writeback to keep its consistency.
- some devices need stable page during page writebacking, so we
should not change page's data.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If IPU failed, nothing is commited, we should end page writeback.
Signed-off-by: Sheng Yong <shengyong1@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This adds an option in ioctl(F2FS_IOC_SHUTDOWN) in order to trigger fsck by
setting a NEED_FSCK flag.
Generally, shutdown is used for the test to validate filesystem consistency, and
setting NEED_FSCK flag can be used for Android to trigger fsck.f2fs at boot time
explicitly so that we could measure the elapsed time as well as force filesystem
check.
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
proc_sys_lookup can fail with ENOMEM instead of ENOENT when the
corresponding sysctl table is being unregistered. In our case we see
this upon opening /proc/sys/net/*/conf files while network interfaces
are being deleted, which confuses our configuration daemon.
The problem was successfully reproduced and this fix tested on v4.9.122
and v4.20-rc6.
v2: return ERR_PTRs in all cases when proc_sys_make_inode fails instead
of mixing them with NULL. Thanks Al Viro for the feedback.
Fixes: ace0c791e6 ("proc/sysctl: Don't grab i_lock under sysctl_lock.")
Cc: stable@vger.kernel.org
Signed-off-by: Ivan Delalande <colona@arista.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
UBIFS's recovery code strictly assumes that a deleted inode will never
come back, therefore it removes all data which belongs to that inode
as soon it faces an inode with link count 0 in the replay list.
Before O_TMPFILE this assumption was perfectly fine. With O_TMPFILE
it can lead to data loss upon a power-cut.
Consider a journal with entries like:
0: inode X (nlink = 0) /* O_TMPFILE was created */
1: data for inode X /* Someone writes to the temp file */
2: inode X (nlink = 0) /* inode was changed, xattr, chmod, … */
3: inode X (nlink = 1) /* inode was re-linked via linkat() */
Upon replay of entry #2 UBIFS will drop all data that belongs to inode X,
this will lead to an empty file after mounting.
As solution for this problem, scan the replay list for a re-link entry
before dropping data.
Fixes: 474b93704f ("ubifs: Implement O_TMPFILE")
Cc: stable@vger.kernel.org
Cc: Russell Senior <russell@personaltelco.net>
Cc: Rafał Miłecki <zajec5@gmail.com>
Reported-by: Russell Senior <russell@personaltelco.net>
Reported-by: Rafał Miłecki <zajec5@gmail.com>
Tested-by: Rafał Miłecki <rafal@milecki.pl>
Signed-off-by: Richard Weinberger <richard@nod.at>
When ubifs is build without the LZO compressor and no compressor is
given the creation of the default file system will fail. before
selection the LZO compressor check if it is present and if not fall back
to the zlib or none.
Signed-off-by: Gabor Juhos <juhosg@openwrt.org>
Signed-off-by: Hauke Mehrtens <hauke@hauke-m.de>
Signed-off-by: Richard Weinberger <richard@nod.at>
If the call to ubifs_read_nnode() fails in ubifs_lpt_calc_hash() an
error is returned without freeing the memory allocated to 'buf'.
Read and check the root node before allocating the buffer.
Detected by CoverityScan, CID 1441025 ("Resource leak")
Signed-off-by: Garry McNulty <garrmcnu@gmail.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
The new authentication support causes a build failure
when CONFIG_KEYS is disabled, so add a dependency.
fs/ubifs/auth.c: In function 'ubifs_init_authentication':
fs/ubifs/auth.c:249:16: error: implicit declaration of function 'request_key'; did you mean 'request_irq'? [-Werror=implicit-function-declaration]
keyring_key = request_key(&key_type_logon, c->auth_key_name, NULL);
Fixes: d8a22773a1 ("ubifs: Enable authentication support")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Richard Weinberger <richard@nod.at>
Instead of adding yet another dependency on UBIFS_FS, wrap the whole
block of ubifs config options in a single "if UBIFS_FS".
Fixes: d8a22773a1 ("ubifs: Enable authentication support")
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Acked-by: Sascha Hauer <s.hauer@pengutronix.de>
Signed-off-by: Richard Weinberger <richard@nod.at>
Having two shash descriptors on the stack cause a very significant kernel
stack usage that can cross the warning threshold:
fs/ubifs/replay.c: In function 'authenticate_sleb':
fs/ubifs/replay.c:633:1: error: the frame size of 1144 bytes is larger than 1024 bytes [-Werror=frame-larger-than=]
Normally, gcc optimizes the out, but with CONFIG_CC_OPTIMIZE_FOR_DEBUGGING,
it does not. Splitting the two stack allocations into separate functions
means that they will use the same memory again. In normal configurations
(optimizing for size or performance), those should get inlined and we get
the same behavior as before.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Richard Weinberger <richard@nod.at>
Since mkfs always formats the filesystem with the realtime bitmap and
summary inodes immediately after the root directory, we should expect
that both of them are present and loadable, even if there isn't a
realtime volume attached. There's no reason to skip this if rbmino ==
NULLFSINO; in fact, this causes an immediate crash if the there /is/ a
realtime volume and someone writes to it.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Bill O'Donnell <billodo@redhat.com>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCXBDMLgAKCRDh3BK/laaZ
PPuRAP0X4zYWFh3mcGlcjjfzaP2W/3F8nVsXjo+YADi9nJ+wAwD+LIeL7zGr8Mw8
EixiC+OJyL31O5ZOyHGoPEhhDz4O+Ao=
=hWRh
-----END PGP SIGNATURE-----
Merge tag 'ovl-fixes-4.20-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs
Pull overlayfs fixes from Miklos Szeredi:
"Needed to revert a patch, because it possibly introduces a security
hole. Since the patch is basically a conceptual cleanup, not a bug
fix, it's safe to revert. I'm not giving up on this, and discussions
seemed to have reached an agreement over how to move forward, but that
can wait 'till the next release.
The other two patches are fixes for bugs introduced in recent
releases"
* tag 'ovl-fixes-4.20-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
Revert "ovl: relax permission checking on underlying layers"
ovl: fix decode of dir file handle with multi lower layers
ovl: fix missing override creds in link of a metacopy upper
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCXBDKkAAKCRDh3BK/laaZ
PCXdAPwOWqLXpkBL76YaIbgFVzS+S5btlhHwVSZ0w/r7HGA3uQD+IgsHbky1MdSv
rYyKcg+lVzA7GI7tcoQUhC2D9aZ8tAQ=
=I0eL
-----END PGP SIGNATURE-----
Merge tag 'fuse-fixes-4.20-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse
Pull fuse fixes from Miklos Szeredi:
"There's one patch fixing a minor but long lived bug, the others are
fixing regressions introduced in this cycle"
* tag 'fuse-fixes-4.20-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
fuse: continue to send FUSE_RELEASEDIR when FUSE_OPEN returns ENOSYS
fuse: Fix memory leak in fuse_dev_free()
fuse: fix revalidation of attributes for permission check
fuse: fix fsync on directory
fuse: Add bad inode check in fuse_destroy_inode()
The realtime summary is a two-dimensional array on disk, effectively:
u32 rsum[log2(number of realtime extents) + 1][number of blocks in the bitmap]
rsum[log][bbno] is the number of extents of size 2**log which start in
bitmap block bbno.
xfs_rtallocate_extent_near() uses xfs_rtany_summary() to check whether
rsum[log][bbno] != 0 for any log level. However, the summary array is
stored in row-major order (i.e., like an array in C), so all of these
entries are not adjacent, but rather spread across the entire summary
file. In the worst case (a full bitmap block), xfs_rtany_summary() has
to check every level.
This means that on a moderately-used realtime device, an allocation will
waste a lot of time finding, reading, and releasing buffers for the
realtime summary. In particular, one of our storage services (which runs
on servers with 8 very slow CPUs and 15 8 TB XFS realtime filesystems)
spends almost 5% of its CPU cycles in xfs_rtbuf_get() and
xfs_trans_brelse() called from xfs_rtany_summary().
One solution would be to also store the summary with the dimensions
swapped. However, this would require a disk format change to a very old
component of XFS.
Instead, we can cache the minimum size which contains any extents. We do
so lazily; rather than guaranteeing that the cache contains the precise
minimum, it always contains a loose lower bound which we tighten when we
read or update a summary block. This only uses a few kilobytes of memory
and is already serialized via the realtime bitmap and summary inode
locks, so the cost is minimal. With this change, the same workload only
spends 0.2% of its CPU cycles in the realtime allocator.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
A big block filesystem might require more than one inobt record to cover
all the inodes in the block. In these cases it is not correct to round
the irec count up to the nearest block because this causes us to
overestimate the number of inode blocks we expect to find.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Store the inode cluster alignment information in units of inodes and
blocks in the mount data so that we don't have to keep recalculating
them.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Store the number of inodes and blocks per inode cluster in the mount
data so that we don't have to keep recalculating them.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Add new helpers to convert units of fs blocks into inodes, and AG blocks
into AG inodes, respectively. Convert all the open-coded conversions
and XFS_OFFBNO_TO_AGINO(, , 0) calls to use them, as appropriate. The
OFFBNO_TO_AGINO macro is retained for xfs_repair.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Owner information for static fs metadata can be defined readonly at
build time because it never changes across filesystems. This enables us
to reduce stack usage (particularly in scrub) because we can use the
statically defined oinfo structures.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Only certain functions actually change the contents of an
xfs_owner_info; the rest can accept a const struct pointer. This will
enable us to save stack space by hoisting static owner info types to
be const global variables.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
There's no need to bundle a pointer to the defer op type into the defer
op control structure. Instead, store the defer op type enum, which
enables us to shorten some of the lines.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Recently, we forgot to port a new defer op type to xfsprogs, which
caused us some userspace pain. Reorganize the way we make libxfs
clients supply defer op type information so that all type information
has to be provided at build time instead of risky runtime dynamic
configuration.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
A log recovery failure has been reproduced where a symlink inode has
a zero length in extent form. It was caused by a shutdown during a
combined fstress+fsmark workload.
The underlying problem is the issue in xfs_inactive_symlink(): the
inode is unlocked between the symlink inactivation/truncation and
the inode being freed. This opens a window for the inode to be
written to disk before it xfs_ifree() removes it from the unlinked
list, marks it free in the inobt and zeros the mode.
For shortform inodes, the fix is simple. xfs_ifree() clears the data
fork state, so there's no need to do it in xfs_inactive_symlink().
This means the shortform fork verifier will not see a zero length
data fork as it mirrors the inode size through to xfs_ifree()), and
hence if the inode gets written back and the fork verifiers are run
they will still see a fork that matches the on-disk inode size.
For extent form (remote) symlinks, it is a little more tricky. Here
we explicitly set the inode size to zero, so the above race can lead
to zero length symlinks on disk. Because the inode is unlinked at
this point (i.e. on the unlinked list) and unreferenced, it can
never be seen again by a user. Hence when we set the inode size to
zeor, also change the type to S_IFREG. xfs_ifree() expects S_IFREG
inodes to be of zero length, and so this avoids all the problems of
zero length symlinks ever hitting the disk. It also avoids the
problem of needing to handle zero length symlink inodes in log
recovery to replay the extent free intents and the remaining
deferops to free the extents the symlink used.
Also add a couple of asserts to warn us if zero length symlinks end
up in either the symlink create or inactivation paths.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
There is a statement that has an unwanted space in the indentation.
Remove it.
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The function xfs_alloc_get_freelist calls xfs_perag_put to drop the
reference. However, pag->pagf_btreeblks is read and written after the
put operation. This patch moves the put operation later.
Signed-off-by: Pan Bian <bianpan2016@163.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
[darrick: minor changelog edits]
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
In xfs_reflink_end_cow, we allocate a single transaction for the entire
end_cow operation and then loop the CoW fork mappings to move them to
the data fork. This design fails on a heavily fragmented filesystem
where an inode's data fork has exactly one more extent than would fit in
an extents-format fork, because the unmap can collapse the data fork
into extents format (freeing the bmbt block) but the remap can expand
the data fork back into a (newly allocated) bmbt block. If the number
of extents we end up remapping is large, we can overflow the block
reservation because we reserved blocks assuming that we were adding
mappings into an already-cleared area of the data fork.
Let's say we have 8 extents in the data fork, 8 extents in the CoW fork,
and the data fork can hold at most 7 extents before needing to convert
to btree format; and that blocks A-P are discontiguous single-block
extents:
0......7
D: ABCDEFGH
C: IJKLMNOP
When a write to file blocks 0-7 completes, we must remap I-P into the
data fork. We start by removing H from the btree-format data fork. Now
we have 7 extents, so we convert the fork to extents format, freeing the
bmbt block. We then move P into the data fork and it now has 8 extents
again. We must convert the data fork back to btree format, requiring a
block allocation. If we repeat this sequence for blocks 6-5-4-3-2-1-0,
we'll need a total of 8 block allocations to remap all 8 blocks. We
reserved only enough blocks to handle one btree split (5 blocks on a 4k
block filesystem), which means we overflow the block reservation.
To fix this issue, create a separate helper function to remap a single
extent, and change _reflink_end_cow to call it in a tight loop over the
entire range we're completing. As a side effect this also removes the
size restrictions on how many extents we can end_cow at a time, though
nobody ever hit that. It is not reasonable to reserve N blocks to remap
N blocks.
Note that this can be reproduced after ~320 million fsx ops while
running generic/938 (long soak directio fsx exerciser):
XFS: Assertion failed: tp->t_blk_res >= tp->t_blk_res_used, file: fs/xfs/xfs_trans.c, line: 116
<machine registers snipped>
Call Trace:
xfs_trans_dup+0x211/0x250 [xfs]
xfs_trans_roll+0x6d/0x180 [xfs]
xfs_defer_trans_roll+0x10c/0x3b0 [xfs]
xfs_defer_finish_noroll+0xdf/0x740 [xfs]
xfs_defer_finish+0x13/0x70 [xfs]
xfs_reflink_end_cow+0x2c6/0x680 [xfs]
xfs_dio_write_end_io+0x115/0x220 [xfs]
iomap_dio_complete+0x3f/0x130
iomap_dio_rw+0x3c3/0x420
xfs_file_dio_aio_write+0x132/0x3c0 [xfs]
xfs_file_write_iter+0x8b/0xc0 [xfs]
__vfs_write+0x193/0x1f0
vfs_write+0xba/0x1c0
ksys_write+0x52/0xc0
do_syscall_64+0x50/0x160
entry_SYSCALL_64_after_hwframe+0x49/0xbe
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
When inode is corrupted so that extent type is invalid, some functions
(such as udf_truncate_extents()) will just BUG. Check that extent type
is valid when loading the inode to memory.
Reported-by: Anatoly Trosinenko <anatoly.trosinenko@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Make all the required change to start use the ib_device_ops structure.
Signed-off-by: Kamal Heib <kamalheib1@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
This patch is based on an idea from Steve Whitehouse. The idea is
to dump the number of pages for inodes in the glock dumps.
The additional locking required me to drop const from quite a few
places.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Fix the resource group wrap-around logic in gfs2_rbm_find that commit
e579ed4f44 broke. The bug can lead to unnecessary repeated scanning of the
same bitmaps; there is a risk that future changes will turn this into an
endless loop.
Fixes: e579ed4f44 ("GFS2: Introduce rbm field bii")
Cc: stable@vger.kernel.org # v3.13+
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
When FUSE_OPEN returns ENOSYS, the no_open bit is set on the connection.
Because the FUSE_RELEASE and FUSE_RELEASEDIR paths share code, this
incorrectly caused the FUSE_RELEASEDIR request to be dropped and never sent
to userspace.
Pass an isdir bool to distinguish between FUSE_RELEASE and FUSE_RELEASEDIR
inside of fuse_file_put.
Fixes: 7678ac5061 ("fuse: support clients that don't implement 'open'")
Cc: <stable@vger.kernel.org> # v3.14
Signed-off-by: Chad Austin <chadaustin@fb.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
In gfs2_create_inode, after setting and releasing the acl / default_acl, the
acl / default_acl pointers are not set to NULL as they should be. In that
state, when the function reaches label fail_free_acls, gfs2_create_inode will
try to release the same acls again.
Fix that by setting the pointers to NULL after releasing the acls. Slightly
simplify the logic. Also, posix_acl_release checks for NULL already, so
there is no need to duplicate those checks here.
Fixes: e01580bf9e ("gfs2: use generic posix ACL infrastructure")
Reported-by: Pan Bian <bianpan2016@163.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: stable@vger.kernel.org # v4.9+
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Field bd_ops was set but never used, so I removed it, and all
code supporting it.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Acked-by: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Matthew pointed out that the ioctx_table is susceptible to spectre v1,
because the index can be controlled by an attacker. The below patch
should mitigate the attack for all of the aio system calls.
Reported-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Matthew pointed out that the ioctx_table is susceptible to spectre v1,
because the index can be controlled by an attacker. The below patch
should mitigate the attack for all of the aio system calls.
Cc: stable@vger.kernel.org
Reported-by: Matthew Wilcox <willy@infradead.org>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Since we found a problem with the 'copy-from' operation after objects have
been truncated, offloading object copies to OSDs should be discouraged
until the issue is fixed.
Thus, this patch adds the 'nocopyfrom' mount option to the default mount
options which effectily means that remote copies won't be done in
copy_file_range unless they are explicitly enabled at mount time.
[ Adjust ceph_show_options() accordingly. ]
Link: https://tracker.ceph.com/issues/37378
Signed-off-by: Luis Henriques <lhenriques@suse.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Use bio(s) to read in the journal sequentially in large chunks and
locate the head of the journal.
This version addresses the issues Christoph pointed out w.r.t error handling
and using deprecated API.
Signed-off-by: Abhi Das <adas@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Cc: Christoph Hellwig <hch@infradead.org>
Move and re-order the error checks and hash/crc computations into another
function __get_log_header() so it can be used in scenarios where buffer_heads
are not being used for the log header.
Signed-off-by: Abhi Das <adas@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Change gfs2_log_XXX_bio family of functions so they can be used
with different bios, not just sdp->sd_log_bio.
This patch also contains some clean up suggested by Andreas.
Signed-off-by: Abhi Das <adas@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Tells you how many milliseconds map_journal_extents and find_jhead
take.
Signed-off-by: Abhi Das <adas@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
The gfs2_is_ordered and gfs2_is_writeback checks are weird in that they
implicitly check for !gfs2_is_jdata. This makes understanding how to
use those functions correctly a challenge. Clean this up by making
gfs2_is_ordered and gfs2_is_writeback take a super block instead of an
inode and by removing the implicit !gfs2_is_jdata checks. Update the
callers accordingly.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Use the aptly named function rather than opencoding i_writecount check.
No functional changes.
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Jan Kara <jack@suse.cz>
prepare_bprm_creds is not used outside exec.c, so there's no reason for it
to have external linkage.
Signed-off-by: Chanho Min <chanho.min@lge.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
If new mode is the same as old mode we don't have to reset
inode mode in the rest of the code, so compare old and new
mode before setting update_mode flag.
Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
-----BEGIN PGP SIGNATURE-----
iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAlwNpb0eHHRvcnZhbGRz
QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGwGwH/00UHnXfxww3ixxz
zwTVDzptA6SPm6s84yJOWatM5fXhPiAltZaHSYF9lzRzNU71NCq7Frhq3fQUIXKM
OxqDn9nfSTWcjWTk2q5n2keyRV/KIn67YX7UgqFc1bO/mqtVjEgNWaMyblhI+e9E
giu1ZXayHr43jK1cDOmGExZubXUq7Vsc9TOlrd+d2SwIqeEP7TCMrPhnHDwCNvX2
UU5dtANpVzGtHaBcr37wJj+L8kODCc0f+PQ3g2ar5jTHst5SLlHp2u0AMRnUmgdi
VkGx+mu/uk8mtwUqMIMqhplklVoqK6LTeLqsY5Xt32SKruw9UqyJGdphLjW2QP/g
MkmA1lI=
=7kaD
-----END PGP SIGNATURE-----
Merge tag 'v4.20-rc6' into for-4.21/block
Pull in v4.20-rc6 to resolve the conflict in NVMe, but also to get the
two corruption fixes. We're going to be overhauling the direct dispatch
path, and we need to do that on top of the changes we made for that
in mainline.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAlwMfwoACgkQiiy9cAdy
T1FkqAv/Tty/Od1u3r/il26vogLunG9JvmC62upgfpb684isEOnYn0oaT9FElunB
C43FXF8Zw73zKv/9IqmLExNGZJ0Jm+5eCjnGX+e05QWN8ais91V2IW+4JMLwCga3
KWBD0V1SBa355cObpvWe1WzrIzm1lwa3aCO/lpENo7Ph+kZXcFVIuOPRRxwLOUU9
dkZCtvs5Pb8LVu2k1/jUXh1hqibUGR3V46txDMH6oomsLNV97xTC9URBCbpeWb3f
x4mpUeWP3DnUyK8TTr3W1MWh1jlUHg3/8yDHfCuxuluAgRFQpI8Eru18XE8abnkz
loFu9eLVT6iwJAlzmkXFF4cGi8LXqz3x4Z+1DopH2S2DBwedw9Tc1dptwOuUbpfg
SO4jMOaNXOUSnN+0pCdEiO1z/Nc0YaNg++5AvGRM0BRohpNk6X2COPocf9HvcFI8
xWajm+nkSsEaQNtXRAj/ku7cLovNgHTJt0KXQyCgxBVol19Gf5davHTe5+pPnsfQ
e+jdUek4
=wWgw
-----END PGP SIGNATURE-----
Merge tag '4.20-rc5-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs fixes from Steve French:
"Three small fixes: a fix for smb3 direct i/o, a fix for CIFS DFS for
stable and a minor cifs Kconfig fix"
* tag '4.20-rc5-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6:
CIFS: Avoid returning EBUSY to upper layer VFS
cifs: Fix separator when building path from dentry
cifs: In Kconfig CONFIG_CIFS_POSIX needs depends on legacy (insecure cifs)
* Fix the Xarray conversion of fsdax to properly handle
dax_lock_mapping_entry() in the presense of pmd entries.
* Fix inode destruction racing a new lock request.
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJcDLC5AAoJEB7SkWpmfYgC0yAP/2NULGPapIoR2aw7GKPOStHR
tOMGE8+ZDtpd19oVoEiB3PuhMkiCj5Pkyt2ka9eUXSk3VR/N6EtGwBANFUO+0tEh
yU8nca2T2G1f+MsZbraWA9zj9aPnfpLt46r7p/AjIB2j1/vyXkYMkXmkXtCvMEUi
3Q9dVsT53fWncJBlTe/WW84wWQp77VsVafN4/UP75dv8cN9F+u91P1xvUXu2AKus
RBqjlp6G6xMZ9HcWCQmStdPCi+qbO4k3oTZcohd/DYLb70ZL1kLoEMOjK2/3GnaV
YMHZGWRAK5jroDZbdXXnJHE+4/opZpSWhW/J1WFEJUSSr7YmZsWhwegmrBlQ5cp3
V0fZ7LjBWjgtUKsjlmfmu4ccLY87LzVlfIEtJH2+QNiLy/Pl/O0P/Jzg9kHVLEgo
zph2yIQlx3yshej8EEBPylDJVtYVzOyjLNLx7r3bL6AzQv+6CKqjxtYOiIDwYQsA
Db1InfeDzAAoH6NYSbqN+yG01Bz7zeCYT7Ch4rYSKcA1i8vUMR6iObI32fC3xJQm
+TxYmqngrJaU8XOMikIem/qGn4fBpYFULrIdK7lj/WmIxQq8Hp0mqz1ikMIoErzN
ZXV5RcbbvSSZi5EOrrVKbr2IfIcx+HYo/bIAkRCR69o8bcIZnLPHDky/7gl7GS/5
sc9A/kysseeF45L2+6QG
=kUil
-----END PGP SIGNATURE-----
Merge tag 'dax-fixes-4.20-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull dax fixes from Dan Williams:
"The last of the known regression fixes and fallout from the Xarray
conversion of the filesystem-dax implementation.
On the path to debugging why the dax memory-failure injection test
started failing after the Xarray conversion a couple more fixes for
the dax_lock_mapping_entry(), now called dax_lock_page(), surfaced.
Those plus the bug that started the hunt are now addressed. These
patches have appeared in a -next release with no issues reported.
Note the touches to mm/memory-failure.c are just the conversion to the
new function signature for dax_lock_page().
Summary:
- Fix the Xarray conversion of fsdax to properly handle
dax_lock_mapping_entry() in the presense of pmd entries
- Fix inode destruction racing a new lock request"
* tag 'dax-fixes-4.20-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
dax: Fix unlock mismatch with updated API
dax: Don't access a freed inode
dax: Check page->mapping isn't NULL
- Fix broken project quota inode counts
- Fix incorrect PAGE_MASK/PAGE_SIZE usage
- Fix incorrect return value in btree verifier
- Fix WARN_ON remap flags false positive
- Fix splice read overflows
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAlwGu/sACgkQ+H93GTRK
tOum2g/5AWN2OMMqbCYru3BSrp4my26/CgGgsLdfR5go+fKvSI8txAm5Wx8r2doP
aBQ0LuJaKm16bufpFx+LQDKvE/AV8yd3tL58ozPSbpPjdQFpl4DR3N95pURQJVhP
Yt3tDuPk6ODJ/pKYSbM0GTTyJ5jr+tUlKU6kmAqNzg1W644TWZVV9zFCVU202sj9
xZ6Auan3yb5wXk2vhgxPgHrVh5ngRm2G+/V8KxCVoWyP/lHUjgEcWJEu+/Jh8xOQ
6dAYxdfngQqdlXI3dpJDOkFHbqcSrhl8+fQEnp+g0RyWcKPDS+L6wk4Xlkhaorjy
nDB4GcnjCPHdJ8x/9pUlNmBrlXpAz4oNYqlQcJKssd5q9p+eTBH2drRDc5VDM5KU
xSHsdIIxQ1YewJ3uHcIbaYBsK+XcrWCtypUTC73GeGkLqS1qKCuKCEnMQOR3czeE
/Hyq6/eTSt2uG62MAOZGdIW4uaJsLTnXnfDElZ1YsdwtMEzKbYg1Ll2s86vudrRu
Otyl3EVdXQjtWRxrelA5eexspoJNoZS69D6Haqt8MGc2HJoJGTuparV6YlsEeGoW
bZFO8JV/Q0WFCa2cWRXVe+D+kx8fyFoDsKY1mR4Vwp5s0jQXp9/rQ81+zVjQ4wB2
TU53a4+hMLKLW5aNH3ge3yB9ZlHZhEzex6hrlZH3kuqpAM3Yj00=
=18EP
-----END PGP SIGNATURE-----
Merge tag 'xfs-4.20-fixes-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs fixes from Darrick Wong:
"Here are hopefully the last set of fixes for 4.20.
There's a fix for a longstanding statfs reporting problem with project
quotas, a correction for page cache invalidation behaviors when
fallocating near EOF, and a fix for a broken metadata verifier return
code.
Finally, the most important fix is to the pipe splicing code (aka the
generic copy_file_range fallback) to avoid pointless short directio
reads by only asking the filesystem for as much data as there are
available pages in the pipe buffer. Our previous fix (simulated short
directio reads because the number of pages didn't match the length of
the read requested) caused subtle problems on overlayfs, so that part
is reverted.
Anyhow, this series passes fstests -g all on xfs and overlay+xfs, and
has passed 17 billion fsx operations problem-free since I started
testing
Summary:
- Fix broken project quota inode counts
- Fix incorrect PAGE_MASK/PAGE_SIZE usage
- Fix incorrect return value in btree verifier
- Fix WARN_ON remap flags false positive
- Fix splice read overflows"
* tag 'xfs-4.20-fixes-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
iomap: partially revert 4721a60109 (simulated directio short read on EFAULT)
splice: don't read more than available pipe space
vfs: allow some remap flags to be passed to vfs_clone_file_range
xfs: fix inverted return from xfs_btree_sblock_verify_crc
xfs: fix PAGE_MASK usage in xfs_free_file_space
fs/xfs: fix f_ffree value for statfs when project quota is set
One of the goals of this series is to remove a separate reference to
the css of the bio. This can and should be accessed via bio_blkcg(). In
this patch, wbc_init_bio() now requires a bio to have a device
associated with it.
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
- spaces before tabs,
- spaces at the end of lines,
- multiple blank lines,
- blank lines before EXPORT_SYMBOL,
can all go.
Signed-off-by: NeilBrown <neilb@suse.com>
Reviewed-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
posix_unblock_lock() is not specific to posix locks, and behaves
nearly identically to locks_delete_block() - the former returning a
status while the later doesn't.
So discard posix_unblock_lock() and use locks_delete_block() instead,
after giving that function an appropriate return value.
Signed-off-by: NeilBrown <neilb@suse.com>
Reviewed-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
When we find an existing lock which conflicts with a request,
and the request wants to wait, we currently add the request
to a list. When the lock is removed, the whole list is woken.
This can cause the thundering-herd problem.
To reduce the problem, we make use of the (new) fact that
a pending request can itself have a list of blocked requests.
When we find a conflict, we look through the existing blocked requests.
If any one of them blocks the new request, the new request is attached
below that request, otherwise it is added to the list of blocked
requests, which are now known to be mutually non-conflicting.
This way, when the lock is released, only a set of non-conflicting
locks will be woken, the rest can stay asleep.
If the lock request cannot be granted and the request needs to be
requeued, all the other requests it blocks will then be woken
To make this more concrete:
If you have a many-core machine, and have many threads all wanting to
briefly lock a give file (udev is known to do this), you can get quite
poor performance.
When one thread releases a lock, it wakes up all other threads that
are waiting (classic thundering-herd) - one will get the lock and the
others go to sleep.
When you have few cores, this is not very noticeable: by the time the
4th or 5th thread gets enough CPU time to try to claim the lock, the
earlier threads have claimed it, done what was needed, and released.
So with few cores, many of the threads don't end up contending.
With 50+ cores, lost of threads can get the CPU at the same time,
and the contention can easily be measured.
This patchset creates a tree of pending lock requests in which siblings
don't conflict and each lock request does conflict with its parent.
When a lock is released, only requests which don't conflict with each
other a woken.
Testing shows that lock-acquisitions-per-second is now fairly stable
even as the number of contending process goes to 1000. Without this
patch, locks-per-second drops off steeply after a few 10s of
processes.
There is a small cost to this extra complexity.
At 20 processes running a particular test on 72 cores, the lock
acquisitions per second drops from 1.8 million to 1.4 million with
this patch. For 100 processes, this patch still provides 1.4 million
while without this patch there are about 700,000.
Reported-and-tested-by: Martin Wilck <mwilck@suse.de>
Signed-off-by: NeilBrown <neilb@suse.com>
Reviewed-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
posix_locks_conflict() and flock_locks_conflict() both return int.
leases_conflict() returns bool.
This inconsistency will cause problems for the next patch if not
fixed.
So change posix_locks_conflict() and flock_locks_conflict() to return
bool.
Also change the locks_conflict() helper.
And convert some
return (foo);
to
return foo;
Signed-off-by: NeilBrown <neilb@suse.com>
Reviewed-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Now that requests can block other requests, we
need to be careful to always clean up those blocked
requests.
Any time that we wait for a request, we might have
other requests attached, and when we stop waiting,
we must clean them up.
If the lock was granted, the requests might have been
moved to the new lock, though when merged with a
pre-exiting lock, this might not happen.
In all cases we don't want blocked locks to remain
attached, so we remove them to be safe.
Signed-off-by: NeilBrown <neilb@suse.com>
Reviewed-by: J. Bruce Fields <bfields@redhat.com>
Tested-by: syzbot+a4a3d526b4157113ec6a@syzkaller.appspotmail.com
Tested-by: kernel test robot <rong.a.chen@intel.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
EBUSY is not handled by VFS, and will be passed to user-mode. This is not
correct as we need to wait for more credits.
This patch also fixes a bug where rsize or wsize is used uninitialized when
the call to server->ops->wait_mtu_credits() fails.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Long Li <longli@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Highlights include:
Stable fixes:
- Fix a page leak when using RPCSEC_GSS/krb5p to encrypt data.
Bugfixes:
- Fix a regression that causes the RPC receive code to hang
- Fix call_connect_status() so that it handles tasks that got transmitted
while queued waiting for the socket lock.
- Fix a memory leak in call_encode()
- Fix several other connect races.
- Fix receive code error handling.
- Use the discard iterator rather than MSG_TRUNC for compatibility with
AF_UNIX/AF_LOCAL sockets.
- nfs: don't dirty kernel pages read by direct-io
- pnfs/Flexfiles fix to enforce per-mirror stateid only for NFSv4 data
servers
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJcCWIOAAoJEA4mA3inWBJc3BsP/i/VXd0ZSxxL8i/++qCR1KGT
/p0+t2HbrhPzb3jKmuaBe/6T6bLMbpmkwbesA6cHENkaPiOqxPhxLsJlh4o2BHwg
NcjAbbov/hkakFAHlp69KqiL7DZe8YEqQE8GlUnn+3C3RM3i2TSRQ3AGXUH22P2a
MY5fqiub2PmEwe2UZR8BzIEQd5w60AzTNXzQb181/+SCTOPdJTKneh0Tw54lD4d6
vWKhi64cyQxQxshCvrX6IpcNWu9qwm7qDGQ3rDAg0whunve4YGtTz1suRUk888M4
VfNxA8skFZuaQS/UU6oek2xaeMlSzEoJQXimKLYTEJKoqf7sWxfNLAfqHwnfyo4T
Yab3cfVRs5KgEltVZyodb9oVQd6KI13hYeT+vXubz2kq1Ode4NJCnzgEefOP0hNV
ENDal0hqBrfjfVIkpg/wfgRJln/W4Y/U0oPPm50eJJxa0ZKTfftBWo4me5DwCFF9
0/XhPdFWTvZsYjmSGRC1RsaSrzUvO+wFo3tKQ2lQqf8QP3ix9ZtGQHN+h8RN9SxK
ti5OxTMsfM3jYg7+yu4yOAQkcCcoaDA37+JztpuUSlMRfNss8uM7cQKsQ4WQf6Nr
24At5Wr/ib7hVkAQ5oB98UWh5q1ZLzmmHhzsf8KacTSNcfjgu0H0DmKtm3CfThFK
xfTHotzM3IqbUXRZQ7++
=M/mt
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-4.20-5' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client bugfixes from Trond Myklebust:
"This is mainly fallout from the updates to the SUNRPC code that is
being triggered from less common combinations of NFS mount options.
Highlights include:
Stable fixes:
- Fix a page leak when using RPCSEC_GSS/krb5p to encrypt data.
Bugfixes:
- Fix a regression that causes the RPC receive code to hang
- Fix call_connect_status() so that it handles tasks that got
transmitted while queued waiting for the socket lock.
- Fix a memory leak in call_encode()
- Fix several other connect races.
- Fix receive code error handling.
- Use the discard iterator rather than MSG_TRUNC for compatibility
with AF_UNIX/AF_LOCAL sockets.
- nfs: don't dirty kernel pages read by direct-io
- pnfs/Flexfiles fix to enforce per-mirror stateid only for NFSv4
data servers"
* tag 'nfs-for-4.20-5' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
SUNRPC: Don't force a redundant disconnection in xs_read_stream()
SUNRPC: Fix up socket polling
SUNRPC: Use the discard iterator rather than MSG_TRUNC
SUNRPC: Treat EFAULT as a truncated message in xs_read_stream_request()
SUNRPC: Fix up handling of the XDRBUF_SPARSE_PAGES flag
SUNRPC: Fix RPC receive hangs
SUNRPC: Fix a potential race in xprt_connect()
SUNRPC: Fix a memory leak in call_encode()
SUNRPC: Fix leak of krb5p encode pages
SUNRPC: call_connect_status() must handle tasks that got transmitted
nfs: don't dirty kernel pages read by direct-io
flexfiles: enforce per-mirror stateid only for v4 DSes
struct timespec is not y2038 safe.
struct __kernel_timespec is the new y2038 safe structure for all
syscalls that are using struct timespec.
Update io_pgetevents interfaces to use struct __kernel_timespec.
sigset_t also has different representations on 32 bit and 64 bit
architectures. Hence, we need to support the following different
syscalls:
New y2038 safe syscalls:
(Controlled by CONFIG_64BIT_TIME for 32 bit ABIs)
Native 64 bit(unchanged) and native 32 bit : sys_io_pgetevents
Compat : compat_sys_io_pgetevents_time64
Older y2038 unsafe syscalls:
(Controlled by CONFIG_32BIT_COMPAT_TIME for 32 bit ABIs)
Native 32 bit : sys_io_pgetevents_time32
Compat : compat_sys_io_pgetevents
Note that io_getevents syscalls do not have a y2038 safe solution.
Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
struct timespec is not y2038 safe.
struct __kernel_timespec is the new y2038 safe structure for all
syscalls that are using struct timespec.
Update pselect interfaces to use struct __kernel_timespec.
sigset_t also has different representations on 32 bit and 64 bit
architectures. Hence, we need to support the following different
syscalls:
New y2038 safe syscalls:
(Controlled by CONFIG_64BIT_TIME for 32 bit ABIs)
Native 64 bit(unchanged) and native 32 bit : sys_pselect6
Compat : compat_sys_pselect6_time64
Older y2038 unsafe syscalls:
(Controlled by CONFIG_32BIT_COMPAT_TIME for 32 bit ABIs)
Native 32 bit : pselect6_time32
Compat : compat_sys_pselect6
Note that all other versions of select syscalls will not have
y2038 safe versions.
Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
struct timespec is not y2038 safe.
struct __kernel_timespec is the new y2038 safe structure for all
syscalls that are using struct timespec.
Update ppoll interfaces to use struct __kernel_timespec.
sigset_t also has different representations on 32 bit and 64 bit
architectures. Hence, we need to support the following different
syscalls:
New y2038 safe syscalls:
(Controlled by CONFIG_64BIT_TIME for 32 bit ABIs)
Native 64 bit(unchanged) and native 32 bit : sys_ppoll
Compat : compat_sys_ppoll_time64
Older y2038 unsafe syscalls:
(Controlled by CONFIG_32BIT_COMPAT_TIME for 32 bit ABIs)
Native 32 bit : ppoll_time32
Compat : compat_sys_ppoll
Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Refactor the logic to restore the sigmask before the syscall
returns into an api.
This is useful for versions of syscalls that pass in the
sigmask and expect the current->sigmask to be changed during
the execution and restored after the execution of the syscall.
With the advent of new y2038 syscalls in the subsequent patches,
we add two more new versions of the syscalls (for pselect, ppoll
and io_pgetevents) in addition to the existing native and compat
versions. Adding such an api reduces the logic that would need to
be replicated otherwise.
Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Refactor reading sigset from userspace and updating sigmask
into an api.
This is useful for versions of syscalls that pass in the
sigmask and expect the current->sigmask to be changed during,
and restored after, the execution of the syscall.
With the advent of new y2038 syscalls in the subsequent patches,
we add two more new versions of the syscalls (for pselect, ppoll,
and io_pgetevents) in addition to the existing native and compat
versions. Adding such an api reduces the logic that would need to
be replicated otherwise.
Note that the calls to sigprocmask() ignored the return value
from the api as the function only returns an error on an invalid
first argument that is hardcoded at these call sites.
The updated logic uses set_current_blocked() instead.
Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Make sure to use the CIFS_DIR_SEP(cifs_sb) as path separator for
prefixpath too. Fixes a bug with smb1 UNIX extensions.
Fixes: a6b5058faf ("fs/cifs: make share unaccessible at root level mountable")
Signed-off-by: Paulo Alcantara <palcantara@suse.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
CC: Stable <stable@vger.kernel.org>
Missing a dependency. Shouldn't show cifs posix extensions
in Kconfig if CONFIG_CIFS_ALLOW_INSECURE_DIALECTS (ie SMB1
protocol) is disabled.
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAlwH0k8ACgkQxWXV+ddt
WDtmVg/+Kgvk7laQI9bLEr1/30eG1JfBUMHcVE1F8+g99l28m1Yihjd21j9norVd
YexBz53jgKou+zV+37CKWBYT1uDPq7CIoxctkdE2j9U0s+RmsqDrhech0dsBsfMR
jo9VnHJFuJSxGMhjfGnFV+wMtAr4q5aQptNGBl+hR1MvMneroktFv+0WiLmp0Vhj
+6Iq9WAClJYpgk//cI7nhKkscdzWwRyN3V9RUtdNeYklk1D7l1WprlaPzw22WA9u
VjQVMICjEaJeIixIwT/D8lz05QgjKlqy1z6faYG5JuJxoYQikuNv/xe2dhZVm35A
aNsBR0byf3zzuXKQZAlvXJ6/gYPvep+KI7epPyBOdycaqoZza7rQ+/MkSAgQ77Vk
yBnQuhqiw9Srjh6LDWFkNclVln2wymRKd1SqpZmFPRZre/8L+DU+I8RRaeS2/WcE
M2k+awRD0oVofbB+hxkFIoR+I1Ggkp2rxQlTT/41tGx0geWC3AGX+TlKSW6ZM5HD
lRmRXIsVocfighKEnI3Zy7ecZuwCI4/4D6+PQtyhCJb3tDigZ/a4UEYdSVucG8CG
SuQ5YMn+MyyKT0wH8xkGKDGT15YZ+u9Q/BmPHZRL6sSouFpiCQHA5miD1YA+t1d9
qMjH6Ycz46Y3j2M0BDfDcm714zoD5/bgeSy5SPC3Zh5lQCGpeIk=
=VW/F
-----END PGP SIGNATURE-----
Merge tag 'for-4.20-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fix from David Sterba:
"A patch in 4.19 introduced a sanity check that was too strict and a
filesystem cannot be mounted.
This happens for filesystems with more than 10 devices and has been
reported by a few users so we need the fix to propagate to stable"
* tag 'for-4.20-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: tree-checker: Don't check max block group size as current max chunk size limit is unreliable
As a precaution, make sure we check event_len when copying to userspace.
Based on old feedback: https://lkml.kernel.org/r/542D9FE5.3010009@gmx.de
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Internal to dax_unlock_mapping_entry(), dax_unlock_entry() is used to
store a replacement entry in the Xarray at the given xas-index with the
DAX_LOCKED bit clear. When called, dax_unlock_entry() expects the unlocked
value of the entry relative to the current Xarray state to be specified.
In most contexts dax_unlock_entry() is operating in the same scope as
the matched dax_lock_entry(). However, in the dax_unlock_mapping_entry()
case the implementation needs to recall the original entry. In the case
where the original entry is a 'pmd' entry it is possible that the pfn
performed to do the lookup is misaligned to the value retrieved in the
Xarray.
Change the api to return the unlock cookie from dax_lock_page() and pass
it to dax_unlock_page(). This fixes a bug where dax_unlock_page() was
assuming that the page was PMD-aligned if the entry was a PMD entry with
signatures like:
WARNING: CPU: 38 PID: 1396 at fs/dax.c:340 dax_insert_entry+0x2b2/0x2d0
RIP: 0010:dax_insert_entry+0x2b2/0x2d0
[..]
Call Trace:
dax_iomap_pte_fault.isra.41+0x791/0xde0
ext4_dax_huge_fault+0x16f/0x1f0
? up_read+0x1c/0xa0
__do_fault+0x1f/0x160
__handle_mm_fault+0x1033/0x1490
handle_mm_fault+0x18b/0x3d0
Link: https://lkml.kernel.org/r/20181130154902.GL10377@bombadil.infradead.org
Fixes: 9f32d22130 ("dax: Convert dax_lock_mapping_entry to XArray")
Reported-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Matthew Wilcox <willy@infradead.org>
Tested-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
As the man(2) page for utime/utimes states, EPERM is returned when the
second parameter of utime or utimes is not NULL, the caller's effective UID
does not match the owner of the file, and the caller is not privileged.
However, in a NFS directory mounted from knfsd, it will return EACCES
(from nfsd_setattr-> fh_verify->nfsd_permission). This patch fixes
that.
Signed-off-by: zhengbin <zhengbin13@huawei.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
In commit 4721a60109, we tried to fix a problem wherein directio reads
into a splice pipe will bounce EFAULT/EAGAIN all the way out to
userspace by simulating a zero-byte short read. This happens because
some directio read implementations (xfs) will call
bio_iov_iter_get_pages to grab pipe buffer pages and issue asynchronous
reads, but as soon as we run out of pipe buffers that _get_pages call
returns EFAULT, which the splice code translates to EAGAIN and bounces
out to userspace.
In that commit, the iomap code catches the EFAULT and simulates a
zero-byte read, but that causes assertion errors on regular splice reads
because xfs doesn't allow short directio reads. This causes infinite
splice() loops and assertion failures on generic/095 on overlayfs
because xfs only permit total success or total failure of a directio
operation. The underlying issue in the pipe splice code has now been
fixed by changing the pipe splice loop to avoid avoid reading more data
than there is space in the pipe.
Therefore, it's no longer necessary to simulate the short directio, so
remove the hack from iomap.
Fixes: 4721a60109 ("iomap: dio data corruption and spurious errors when pipes fill")
Reported-by: Murphy Zhou <jencce.kernel@gmail.com>
Ranted-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
In commit 4721a60109, we tried to fix a problem wherein directio reads
into a splice pipe will bounce EFAULT/EAGAIN all the way out to
userspace by simulating a zero-byte short read. This happens because
some directio read implementations (xfs) will call
bio_iov_iter_get_pages to grab pipe buffer pages and issue asynchronous
reads, but as soon as we run out of pipe buffers that _get_pages call
returns EFAULT, which the splice code translates to EAGAIN and bounces
out to userspace.
In that commit, the iomap code catches the EFAULT and simulates a
zero-byte read, but that causes assertion errors on regular splice reads
because xfs doesn't allow short directio reads.
The brokenness is compounded by splice_direct_to_actor immediately
bailing on do_splice_to returning <= 0 without ever calling ->actor
(which empties out the pipe), so if userspace calls back we'll EFAULT
again on the full pipe, and nothing ever gets copied.
Therefore, teach splice_direct_to_actor to clamp its requests to the
amount of free space in the pipe and remove the simulated short read
from the iomap directio code.
Fixes: 4721a60109 ("iomap: dio data corruption and spurious errors when pipes fill")
Reported-by: Murphy Zhou <jencce.kernel@gmail.com>
Ranted-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
In overlayfs, ovl_remap_file_range calls vfs_clone_file_range on the
lower filesystem's inode, passing through whatever remap flags it got
from its caller. Since vfs_copy_file_range first tries a filesystem's
remap function with REMAP_FILE_CAN_SHORTEN, this can get passed through
to the second vfs_copy_file_range call, and this isn't an issue.
Change the WARN_ON to look only for the DEDUP flag.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
xfs_btree_sblock_verify_crc is a bool so should not be returning
a failaddr_t; worse, if xfs_log_check_lsn fails it returns
__this_address which looks like a boolean true (i.e. success)
to the caller.
(interestingly xfs_btree_lblock_verify_crc doesn't have the issue)
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
In commit e53c4b598, I *tried* to teach xfs to force writeback when we
fzero/fpunch right up to EOF so that if EOF is in the middle of a page,
the post-EOF part of the page gets zeroed before we return to userspace.
Unfortunately, I missed the part where PAGE_MASK is ~(PAGE_SIZE - 1),
which means that we totally fail to zero if we're fpunching and EOF is
within the first page. Worse yet, the same PAGE_MASK thinko plagues the
filemap_write_and_wait_range call, so we'd initiate writeback of the
entire file, which (mostly) masked the thinko.
Drop the tricky PAGE_MASK and replace it with correct usage of PAGE_SIZE
and the proper rounding macros.
Fixes: e53c4b598 ("xfs: ensure post-EOF zeroing happens after zeroing part of a file")
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
No one is going to poll for aio (yet), so we must clear the HIPRI
flag, as we would otherwise send it down the poll queues, where no
one will be polling for completions.
Signed-off-by: Christoph Hellwig <hch@lst.de>
IOCB_HIPRI, not RWF_HIPRI.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAlwEZdIeHHRvcnZhbGRz
QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGAlQH/19oax2Za3IPqF4X
DM3lal5M6zlUVkoYstqzpbR3MqUwgEnMfvoeMDC6mI9N4/+r2LkV7cRR8HzqQCCS
jDfD69IzRGb52VSeJmbOrkxBWsR1Nn0t4Z3rEeLPxwaOoNpRc8H973MbAQ2FKMpY
S4Y3jIK1dNiRRxdh52NupVkQF+djAUwkBuVk/rrvRJmTDij4la03cuCDAO+Di9lt
GHlVvygKw2SJhDR+z3ArwZNmE0ceCcE6+W7zPHzj2KeWuKrZg22kfUD454f2YEIw
FG0hu9qecgtpYCkLSm2vr4jQzmpsDoyq3ZfwhjGrP4qtvPC3Db3vL3dbQnkzUcJu
JtwhVCE=
=O1q1
-----END PGP SIGNATURE-----
Merge tag 'v4.20-rc5' into for-4.21/block
Pull in v4.20-rc5, solving a conflict we'll otherwise get in aio.c and
also getting the merge fix that went into mainline that users are
hitting testing for-4.21/block and/or for-next.
* tag 'v4.20-rc5': (664 commits)
Linux 4.20-rc5
PCI: Fix incorrect value returned from pcie_get_speed_cap()
MAINTAINERS: Update linux-mips mailing list address
ocfs2: fix potential use after free
mm/khugepaged: fix the xas_create_range() error path
mm/khugepaged: collapse_shmem() do not crash on Compound
mm/khugepaged: collapse_shmem() without freezing new_page
mm/khugepaged: minor reorderings in collapse_shmem()
mm/khugepaged: collapse_shmem() remember to clear holes
mm/khugepaged: fix crashes due to misaccounted holes
mm/khugepaged: collapse_shmem() stop if punched or truncated
mm/huge_memory: fix lockdep complaint on 32-bit i_size_read()
mm/huge_memory: splitting set mapping+index before unfreeze
mm/huge_memory: rename freeze_page() to unmap_page()
initramfs: clean old path before creating a hardlink
kernel/kcov.c: mark funcs in __sanitizer_cov_trace_pc() as notrace
psi: make disabling/enabling easier for vendor kernels
proc: fixup map_files test on arm
debugobjects: avoid recursive calls with kmemleak
userfaultfd: shmem: UFFDIO_COPY: set the page dirty if VM_WRITE is not set
...
Revert commit c22397888f "exec: make de_thread() freezable" as
requested by Ingo Molnar:
"So there's a new regression in v4.20-rc4, my desktop produces this
lockdep splat:
[ 1772.588771] WARNING: pkexec/4633 still has locks held!
[ 1772.588773] 4.20.0-rc4-custom-00213-g93a49841322b #1 Not tainted
[ 1772.588775] ------------------------------------
[ 1772.588776] 1 lock held by pkexec/4633:
[ 1772.588778] #0: 00000000ed85fbf8 (&sig->cred_guard_mutex){+.+.}, at: prepare_bprm_creds+0x2a/0x70
[ 1772.588786] stack backtrace:
[ 1772.588789] CPU: 7 PID: 4633 Comm: pkexec Not tainted 4.20.0-rc4-custom-00213-g93a49841322b #1
[ 1772.588792] Call Trace:
[ 1772.588800] dump_stack+0x85/0xcb
[ 1772.588803] flush_old_exec+0x116/0x890
[ 1772.588807] ? load_elf_phdrs+0x72/0xb0
[ 1772.588809] load_elf_binary+0x291/0x1620
[ 1772.588815] ? sched_clock+0x5/0x10
[ 1772.588817] ? search_binary_handler+0x6d/0x240
[ 1772.588820] search_binary_handler+0x80/0x240
[ 1772.588823] load_script+0x201/0x220
[ 1772.588825] search_binary_handler+0x80/0x240
[ 1772.588828] __do_execve_file.isra.32+0x7d2/0xa60
[ 1772.588832] ? strncpy_from_user+0x40/0x180
[ 1772.588835] __x64_sys_execve+0x34/0x40
[ 1772.588838] do_syscall_64+0x60/0x1c0
The warning gets triggered by an ancient lockdep check in the freezer:
(gdb) list *0xffffffff812ece06
0xffffffff812ece06 is in flush_old_exec (./include/linux/freezer.h:57).
52 * DO NOT ADD ANY NEW CALLERS OF THIS FUNCTION
53 * If try_to_freeze causes a lockdep warning it means the caller may deadlock
54 */
55 static inline bool try_to_freeze_unsafe(void)
56 {
57 might_sleep();
58 if (likely(!freezing(current)))
59 return false;
60 return __refrigerator(false);
61 }
I reviewed the ->cred_guard_mutex code, and the mutex is held across all
of exec() - and we always did this.
But there's this recent -rc4 commit:
> Chanho Min (1):
> exec: make de_thread() freezable
c22397888f: exec: make de_thread() freezable
I believe this commit is bogus, you cannot call try_to_freeze() from
de_thread(), because it's holding the ->cred_guard_mutex."
Reported-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
[BUG]
A completely valid btrfs will refuse to mount, with error message like:
BTRFS critical (device sdb2): corrupt leaf: root=2 block=239681536 slot=172 \
bg_start=12018974720 bg_len=10888413184, invalid block group size, \
have 10888413184 expect (0, 10737418240]
This has been reported several times as the 4.19 kernel is now being
used. The filesystem refuses to mount, but is otherwise ok and booting
4.18 is a workaround.
Btrfs check returns no error, and all kernels used on this fs is later
than 2011, which should all have the 10G size limit commit.
[CAUSE]
For a 12 devices btrfs, we could allocate a chunk larger than 10G due to
stripe stripe bump up.
__btrfs_alloc_chunk()
|- max_stripe_size = 1G
|- max_chunk_size = 10G
|- data_stripe = 11
|- if (1G * 11 > 10G) {
stripe_size = 976128930;
stripe_size = round_up(976128930, SZ_16M) = 989855744
However the final stripe_size (989855744) * 11 = 10888413184, which is
still larger than 10G.
[FIX]
For the comprehensive check, we need to do the full check at chunk read
time, and rely on bg <-> chunk mapping to do the check.
We could just skip the length check for now.
Fixes: fce466eab7 ("btrfs: tree-checker: Verify block_group_item")
Cc: stable@vger.kernel.org # v4.19+
Reported-by: Wang Yugui <wangyugui@e16-tech.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This reverts commit 007ea44892.
The commit broke some selinux-testsuite cases, and it looks like there's no
straightforward fix keeping the direction of this patch, so revert for now.
The original patch was trying to fix the consistency of permission checks, and
not an observed bug. So reverting should be safe.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Pull RCU changes from Paul E. McKenney:
- Convert RCU's BUG_ON() and similar calls to WARN_ON() and similar.
- Replace calls of RCU-bh and RCU-sched update-side functions
to their vanilla RCU counterparts. This series is a step
towards complete removal of the RCU-bh and RCU-sched update-side
functions.
( Note that some of these conversions are going upstream via their
respective maintainers. )
- Documentation updates, including a number of flavor-consolidation
updates from Joel Fernandes.
- Miscellaneous fixes.
- Automate generation of the initrd filesystem used for
rcutorture testing.
- Convert spin_is_locked() assertions to instead use lockdep.
( Note that some of these conversions are going upstream via their
respective maintainers. )
- SRCU updates, especially including a fix from Dennis Krein
for a bag-on-head-class bug.
- RCU torture-test updates.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Commit e2b911c535 ("ext4: clean up feature test macros with
predicate functions") broke the EXT4_IOC_GROUP_ADD ioctl. This was
not noticed since only very old versions of resize2fs (before
e2fsprogs 1.42) use this ioctl. However, using a new kernel with an
enterprise Linux userspace will cause attempts to use online resize to
fail with "No reserved GDT blocks".
Fixes: e2b911c535 ("ext4: clean up feature test macros with predicate...")
Cc: stable@kernel.org # v4.4
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: ruippan (潘睿) <ruippan@tencent.com>
As dax inches closer to production use, an administrator should not
be surprised by silently disabling the feature they asked for.
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
ext4_xattr_destroy_cache() can handle NULL pointer correctly,
so there is no need to check NULL pointer before calling
ext4_xattr_destroy_cache().
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
There is a statement that is indented with spaces, replace it with
a tab.
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
There are several lines that are indented too far, clean these
up by removing the tabs.
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
In case of error, ext4_try_to_write_inline_data() should unlock
and release the page it holds.
Fixes: f19d5870cb ("ext4: add normal write support for inline data")
Cc: stable@kernel.org # 3.8
Signed-off-by: Maurizio Lombardi <mlombard@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
The function frees qf_inode via iput but then pass qf_inode to
lockdep_set_quota_inode on the failure path. This may result in a
use-after-free bug. The patch frees df_inode only when it is never used.
Fixes: daf647d2dd ("ext4: add lockdep annotations for i_data_sem")
Cc: stable@kernel.org # 4.6
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Pan Bian <bianpan2016@163.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
We can hold j_state_lock for writing at the beginning of
jbd2_journal_commit_transaction() for a rather long time (reportedly for
30 ms) due cleaning revoke bits of all revoked buffers under it. The
handling of revoke tables as well as cleaning of t_reserved_list, and
checkpoint lists does not need j_state_lock for anything. It is only
needed to prevent new handles from joining the transaction. Generally
T_LOCKED transaction state prevents new handles from joining the
transaction - except for reserved handles which have to allowed to join
while we wait for other handles to complete.
To prevent reserved handles from joining the transaction while cleaning
up lists, add new transaction state T_SWITCH and watch for it when
starting reserved handles. With this we can just drop the lock for
operations that don't need it.
Reported-and-tested-by: Adrian Hunter <adrian.hunter@intel.com>
Suggested-by: "Theodore Y. Ts'o" <tytso@mit.edu>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Given corruption in the ftrace records, it might be possible to allocate
tmp_prz without assigning prz to it, but still marking it as needing to
be freed, which would cause at least a NULL dereference.
smatch warnings:
fs/pstore/ram.c:340 ramoops_pstore_read() error: we previously assumed 'prz' could be null (see line 255)
https://lists.01.org/pipermail/kbuild-all/2018-December/055528.html
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Fixes: 2fbea82bbb ("pstore: Merge per-CPU ftrace records into one")
Cc: "Joel Fernandes (Google)" <joel@joelfernandes.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Instead of running with interrupts disabled, use a semaphore. This should
make it easier for backends that may need to sleep (e.g. EFI) when
performing a write:
|BUG: sleeping function called from invalid context at kernel/sched/completion.c:99
|in_atomic(): 1, irqs_disabled(): 1, pid: 2236, name: sig-xstate-bum
|Preemption disabled at:
|[<ffffffff99d60512>] pstore_dump+0x72/0x330
|CPU: 26 PID: 2236 Comm: sig-xstate-bum Tainted: G D 4.20.0-rc3 #45
|Call Trace:
| dump_stack+0x4f/0x6a
| ___might_sleep.cold.91+0xd3/0xe4
| __might_sleep+0x50/0x90
| wait_for_completion+0x32/0x130
| virt_efi_query_variable_info+0x14e/0x160
| efi_query_variable_store+0x51/0x1a0
| efivar_entry_set_safe+0xa3/0x1b0
| efi_pstore_write+0x109/0x140
| pstore_dump+0x11c/0x330
| kmsg_dump+0xa4/0xd0
| oops_exit+0x22/0x30
...
Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Fixes: 21b3ddd39f ("efi: Don't use spinlocks for efi vars")
Signed-off-by: Kees Cook <keescook@chromium.org>
Bool initializations should use true and false. Bool tests don't need
comparisons.
Signed-off-by: Thomas Meyer <thomas@m3y3r.de>
Signed-off-by: Kees Cook <keescook@chromium.org>
The ramoops backend currently calls persistent_ram_save_old() even
if a buffer is empty. While this appears to work, it is does not seem
like the right thing to do and could lead to future bugs so lets avoid
that. It also prevents misleading prints in the logs which claim the
buffer is valid.
I got something like:
found existing buffer, size 0, start 0
When I was expecting:
no valid data in buffer (sig = ...)
This bails out early (and reports with pr_debug()), since it's an
acceptable state.
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Co-developed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
(1) remove type argument from ramoops_get_next_prz()
Since we store the type of the prz when we initialize it, we no longer
need to pass it again in ramoops_get_next_prz() since we can just use
that to setup the pstore record. So lets remove it from the argument list.
(2) remove max argument from ramoops_get_next_prz()
Looking at the code flow, the 'max' checks are already being done on
the prz passed to ramoops_get_next_prz(). Lets remove it to simplify
this function and reduce its arguments.
(3) further reduce ramoops_get_next_prz() arguments by passing record
Both the id and type fields of a pstore_record are set by
ramoops_get_next_prz(). So we can just pass a pointer to the pstore_record
instead of passing individual elements. This results in cleaner more
readable code and fewer lines.
In addition lets also remove the 'update' argument since we can detect
that. Changes are squashed into a single patch to reduce fixup conflicts.
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
In later patches we will need to map types to names, so create a
constant table for that which can also be used in different parts of
old and new code. This saves the type in the PRZ which will be useful
in later patches.
Instead of having an explicit PSTORE_TYPE_UNKNOWN, just use ..._MAX.
This includes removing the now redundant filename templates which can use
a single format string. Also, there's no reason to limit the "is it still
compressed?" test to only PSTORE_TYPE_DMESG when building the pstorefs
filename. Records are zero-initialized, so a backend would need to have
explicitly set compressed=1.
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Co-developed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
This improves and updates some comments:
- dump handler comment out of sync from calling convention
- fix kern-doc typo
and improves status output:
- reminder that only kernel crash dumps are compressed
- do not be silent about ECC infrastructure failures
Signed-off-by: Kees Cook <keescook@chromium.org>
In order to more easily perform automated regression testing, this
adds pr_debug() calls to report each prz allocation which can then be
verified against persistent storage. Specifically, seeing the dividing
line between header, data, any ECC bytes. (And the general assignment
output is updated to remove the bogus ECC blocksize which isn't actually
recorded outside the prz instance.)
Signed-off-by: Kees Cook <keescook@chromium.org>
With both ram.c and ram_core.c built into ramoops.ko, it doesn't make
sense to have differing pr_fmt prefixes. This fixes ram_core.c to use
the module name (as ram.c already does). Additionally improves region
reservation error to include the region name.
Signed-off-by: Kees Cook <keescook@chromium.org>
When initialing a prz, if invalid data is found (no PERSISTENT_RAM_SIG),
the function call path looks like this:
ramoops_init_prz ->
persistent_ram_new -> persistent_ram_post_init -> persistent_ram_zap
persistent_ram_zap
As we can see, persistent_ram_zap() is called twice.
We can avoid this by adding an option to persistent_ram_new(), and
only call persistent_ram_zap() when it is needed.
Signed-off-by: Peng Wang <wangpeng15@xiaomi.com>
[kees: minor tweak to exit path and commit log]
Signed-off-by: Kees Cook <keescook@chromium.org>
Since the console writer does not use the preallocated crash dump buffer
any more, there is no reason to perform locking around it.
Fixes: 70ad35db33 ("pstore: Convert console write to use ->write_buf")
Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
The pre-allocated compression buffer used for crash dumping was also
being used for decompression. This isn't technically safe, since it's
possible the kernel may attempt a crashdump while pstore is populating the
pstore filesystem (and performing decompression). Instead, just allocate
a separate buffer for decompression. Correctness is preferred over
performance here.
Signed-off-by: Kees Cook <keescook@chromium.org>
The warning added in commit 3b0e761ba8
"dlm: print log message when cluster name is not set"
did not account for the fact that lockspaces created
from userland do not supply a cluster name, so bogus
warnings are printed every time a userland lockspace
is created.
Signed-off-by: David Teigland <teigland@redhat.com>
Let the passed in array be const (and thus placed in rodata) instead of
a mutable array of const pointers.
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Jani Nikula <jani.nikula@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20181004143750.30880-1-jani.nikula@intel.com
NULL check before some freeing functions is not needed.
Signed-off-by: Thomas Meyer <thomas@m3y3r.de>
Signed-off-by: David Teigland <teigland@redhat.com>
fuse_invalidate_attr() now sets fi->inval_mask instead of fi->i_time, hence
we need to check the inval mask in fuse_permission() as well.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Fixes: 2f1e81965f ("fuse: allow fine grained attr cache invaldation")
Commit ab2257e994 ("fuse: reduce size of struct fuse_inode") moved parts
of fields related to writeback on regular file and to directory caching
into a union. However fuse_fsync_common() called from fuse_dir_fsync()
touches some writeback related fields, resulting in a crash.
Move writeback related parts from fuse_fsync_common() to fuse_fysnc().
Reported-by: Brett Girton <btgirton@gmail.com>
Tested-by: Brett Girton <btgirton@gmail.com>
Fixes: ab2257e994 ("fuse: reduce size of struct fuse_inode")
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Since commit bb21ce0ad2 we always enforce per-mirror stateid.
However, this makes sense only for v4+ servers.
Signed-off-by: Tigran Mkrtchyan <tigran.mkrtchyan@desy.de>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
jffs2_sync_fs makes the assumption that if CONFIG_JFFS2_FS_WRITEBUFFER
is defined then a write buffer is available and has been initialized.
However, this does is not the case when the mtd device has no
out-of-band buffer:
int jffs2_nand_flash_setup(struct jffs2_sb_info *c)
{
if (!c->mtd->oobsize)
return 0;
...
The resulting call to cancel_delayed_work_sync passing a uninitialized
(but zeroed) delayed_work struct forces lockdep to become disabled.
[ 90.050639] overlayfs: upper fs does not support tmpfile.
[ 90.652264] INFO: trying to register non-static key.
[ 90.662171] the code is fine but needs lockdep annotation.
[ 90.673090] turning off the locking correctness validator.
[ 90.684021] CPU: 0 PID: 1762 Comm: mount_root Not tainted 4.14.63 #0
[ 90.696672] Stack : 00000000 00000000 80d8f6a2 00000038 805f0000 80444600 8fe364f4 805dfbe7
[ 90.713349] 80563a30 000006e2 8068370c 00000001 00000000 00000001 8e2fdc48 ffffffff
[ 90.730020] 00000000 00000000 80d90000 00000000 00000106 00000000 6465746e 312e3420
[ 90.746690] 6b636f6c 03bf0000 f8000000 20676e69 00000000 80000000 00000000 8e2c2a90
[ 90.763362] 80d90000 00000001 00000000 8e2c2a90 00000003 80260dc0 08052098 80680000
[ 90.780033] ...
[ 90.784902] Call Trace:
[ 90.789793] [<8000f0d8>] show_stack+0xb8/0x148
[ 90.798659] [<8005a000>] register_lock_class+0x270/0x55c
[ 90.809247] [<8005cb64>] __lock_acquire+0x13c/0xf7c
[ 90.818964] [<8005e314>] lock_acquire+0x194/0x1dc
[ 90.828345] [<8003f27c>] flush_work+0x200/0x24c
[ 90.837374] [<80041dfc>] __cancel_work_timer+0x158/0x210
[ 90.847958] [<801a8770>] jffs2_sync_fs+0x20/0x54
[ 90.857173] [<80125cf4>] iterate_supers+0xf4/0x120
[ 90.866729] [<80158fc4>] sys_sync+0x44/0x9c
[ 90.875067] [<80014424>] syscall_common+0x34/0x58
Signed-off-by: Daniel Santos <daniel.santos@pobox.com>
Reviewed-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Boris Brezillon <boris.brezillon@bootlin.com>
bug.2018.11.12a: Get rid of BUG_ON() and friends
consolidate.2018.12.01a: Continued RCU flavor-consolidation cleanup
doc.2018.11.12a: Documentation updates
fixes.2018.11.12a: Miscellaneous fixes
initrd.2018.11.08b: Automate creation of rcutorture initrd
sil.2018.11.12a: Remove more spin_unlock_wait() calls
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAlwC1c4QHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgppxmD/4pqn8REEh/QUXWhCJbOXLLLxfQju7Uxs/v
j2Bc6W/e7Z9jvKAs06IIhaV6SxBrM0oUebf/hJY0E/kTSHiNPJqx/X3W9hFYOo+p
EJau3vavOrxVzgq5zt8S/i//HeanT+H37nE9WDqSRKXTta8JFDw+DoysepILTUvN
WGDjuplPcurwmf2W1qES+5vNy/Jpln9ErNuqPBSjc6shozQ8WAzvuupVs+uZEpeK
+gqrx0pJYrtoU+pSUK+Bt6bSzzp8Z0qHGIVMAabNULbz43qblK0ILRE+qLFbFwsB
62EMMtX9b2Lsvqpoe2cQ+deQlUalsGVmpyE+7GP/evZbVmtD/NoH6cJQ/dA/tFtw
cluL3rWBJKB5OZ1yatDE2/rUYsGo5FzqMUz/tIWSf2FdZcLfhRNLka7DueSA6NQe
wtLJU9GrME67+t+PqncjDxoyQYma4oynAcc5dfqlBQv5OP7HDf4TP28g8FdkHjcy
fEXAp58516YZiCpoWZf6dPR9fUQ0A1eF+qxHnUacy5tHN4AKPrccU3+k+0WStFNf
qaOPkj4kWtv17d2DO4UoqAtBqFO16QCYSsa5+drpDeTOq9QgGqA6O+sGngN0LsxS
F7x3msgBIkgEFYFtpuMBXnamdooiZMKrzI0Ctn7PK8b5Qx1OgRNCZcTQD4uql1Fj
L6R/6Ynibg==
=lMlT
-----END PGP SIGNATURE-----
Merge tag 'for-linus-20181201' of git://git.kernel.dk/linux-block
Pull block layer fixes from Jens Axboe:
- Single range elevator discard merge fix, that caused crashes (Ming)
- Fix for a regression in O_DIRECT, where we could potentially lose the
error value (Maximilian Heyne)
- NVMe pull request from Christoph, with little fixes all over the map
for NVMe.
* tag 'for-linus-20181201' of git://git.kernel.dk/linux-block:
block: fix single range discard merge
nvme-rdma: fix double freeing of async event data
nvme: flush namespace scanning work just before removing namespaces
nvme: warn when finding multi-port subsystems without multipathing enabled
fs: fix lost error code in dio_complete
nvme-pci: fix surprise removal
nvme-fc: initialize nvme_req(rq)->ctrl after calling __nvme_fc_init_request()
nvme: Free ctrl device name on init failure
Merge misc fixes from Andrew Morton:
"31 fixes"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (31 commits)
ocfs2: fix potential use after free
mm/khugepaged: fix the xas_create_range() error path
mm/khugepaged: collapse_shmem() do not crash on Compound
mm/khugepaged: collapse_shmem() without freezing new_page
mm/khugepaged: minor reorderings in collapse_shmem()
mm/khugepaged: collapse_shmem() remember to clear holes
mm/khugepaged: fix crashes due to misaccounted holes
mm/khugepaged: collapse_shmem() stop if punched or truncated
mm/huge_memory: fix lockdep complaint on 32-bit i_size_read()
mm/huge_memory: splitting set mapping+index before unfreeze
mm/huge_memory: rename freeze_page() to unmap_page()
initramfs: clean old path before creating a hardlink
kernel/kcov.c: mark funcs in __sanitizer_cov_trace_pc() as notrace
psi: make disabling/enabling easier for vendor kernels
proc: fixup map_files test on arm
debugobjects: avoid recursive calls with kmemleak
userfaultfd: shmem: UFFDIO_COPY: set the page dirty if VM_WRITE is not set
userfaultfd: shmem: add i_size checks
userfaultfd: shmem/hugetlbfs: only allow to register VM_MAYWRITE vmas
userfaultfd: shmem: allocate anonymous memory for MAP_PRIVATE shmem
...
-----BEGIN PGP SIGNATURE-----
iQIVAwUAXAFfSPu3V2unywtrAQKPQRAAiHDs2d35Kc2qkTFLwGiP+wr+3+7Cyz7A
hrWAvR7Oe7nBFVPmp6pwEnpBhf3mPsWlQpw3ZKZPo4fDQyRX+mDFC+2C7hkU1Q/J
BkjTG4vYn1jiQGlL3SD1PfUxcWfwzoK4cz+V3hnFY5y0dsKiBZBR1Lw5G+UkaCnD
4VaC3VAG56Vh14o5qSF3TWLZFyZ+JN6YA/M/DnwRPl8y4jnj1tJLs1DjdpEcWv6r
15FKb2FRYaC7MRehpXd22JX6fv5ii2xazU3IfLucBrb4Vj+wAJrBY4wA3x/CFkAa
as1VmxLkgoJEWa3M71tQOJBC8+QqkRb++PRUI3aadt2H4hXHfx3AmBuKkVroeS8o
0BDhWGiTW4AqXUajkQcTc/mKV2x6h83V3DLyBRL1iC3+7qaBVhPNtxW+v6ln0Ce1
FRG2I9LZp+RtWrVVyIPsa03V2V5OD7PTIBXK6TYtuqL+3uu7TNNc+UySvqDHWLL+
Zo2ogpq//kZbjMdntNVhDEj12LW3zG05dtNuFEeJeuPM28yiXXtoWDmI49RAUQ4v
RN6SwEXnKWehwG+YITYavV6gfHWlXdZ7grgCMHyViF/s9khBp7AGxbRzR0JXgXqL
ko1Ojpbq2mdvjwGFQfde4MAqAxM3FPxdxGVLrgi+lgGTsEKv6IzrTo28teyAM81O
D6cH0ldY90w=
=6y+F
-----END PGP SIGNATURE-----
Merge tag 'fscache-fixes-20181130' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
Pull fscache and cachefiles fixes from David Howells:
"Misc fixes:
- Fix an assertion failure at fs/cachefiles/xattr.c:138 caused by a
race between a cache object lookup failing and someone attempting
to reenable that object, thereby triggering an update of the
object's attributes.
- Fix an assertion failure at fs/fscache/operation.c:449 caused by a
split atomic subtract and atomic read that allows a race to happen.
- Fix a leak of backing pages when simultaneously reading the same
page from the same object from two or more threads.
- Fix a hang due to a race between a cache object being discarded and
the corresponding cookie being reenabled.
There are also some minor cleanups:
- Cast an enum value to a different enum type to prevent clang from
generating a warning. This shouldn't cause any sort of change in
the emitted code.
- Use ktime_get_real_seconds() instead of get_seconds(). This is just
used to uniquify a filename for an object to be placed in the
graveyard. Objects placed there are deleted by cachfilesd in
userspace immediately thereafter.
- Remove an initialised, but otherwise unused variable. This should
have been entirely optimised away anyway"
* tag 'fscache-fixes-20181130' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
fscache, cachefiles: remove redundant variable 'cache'
cachefiles: avoid deprecated get_seconds()
cachefiles: Explicitly cast enumerated type in put_object
fscache: fix race between enablement and dropping of object
cachefiles: Fix page leak in cachefiles_read_backing_file while vmscan is active
fscache: Fix race in fscache_op_complete() due to split atomic_sub & read
cachefiles: Fix an assertion failure when trying to update a failed object
ocfs2_get_dentry() calls iput(inode) to drop the reference count of
inode, and if the reference count hits 0, inode is freed. However, in
this function, it then reads inode->i_generation, which may result in a
use after free bug. Move the put operation later.
Link: http://lkml.kernel.org/r/1543109237-110227-1-git-send-email-bianpan2016@163.com
Fixes: 781f200cb7a("ocfs2: Remove masklog ML_EXPORT.")
Signed-off-by: Pan Bian <bianpan2016@163.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Cc: Changwei Ge <ge.changwei@h3c.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
After the VMA to register the uffd onto is found, check that it has
VM_MAYWRITE set before allowing registration. This way we inherit all
common code checks before allowing to fill file holes in shmem and
hugetlbfs with UFFDIO_COPY.
The userfaultfd memory model is not applicable for readonly files unless
it's a MAP_PRIVATE.
Link: http://lkml.kernel.org/r/20181126173452.26955-4-aarcange@redhat.com
Fixes: ff62a34210 ("hugetlb: implement memfd sealing")
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Hugh Dickins <hughd@google.com>
Reported-by: Jann Horn <jannh@google.com>
Fixes: 4c27fe4c4c ("userfaultfd: shmem: add shmem_mcopy_atomic_pte for userfaultfd support")
Cc: <stable@vger.kernel.org>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
hfs_bmap_free() frees node via hfs_bnode_put(node). However it then
reads node->this when dumping error message on an error path, which may
result in a use-after-free bug. This patch frees node only when it is
never used.
Link: http://lkml.kernel.org/r/1543053441-66942-1-git-send-email-bianpan2016@163.com
Signed-off-by: Pan Bian <bianpan2016@163.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Ernesto A. Fernandez <ernesto.mnd.fernandez@gmail.com>
Cc: Joe Perches <joe@perches.com>
Cc: Viacheslav Dubeyko <slava@dubeyko.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
hfs_bmap_free() frees the node via hfs_bnode_put(node). However, it
then reads node->this when dumping error message on an error path, which
may result in a use-after-free bug. This patch frees the node only when
it is never again used.
Link: http://lkml.kernel.org/r/1542963889-128825-1-git-send-email-bianpan2016@163.com
Fixes: a1185ffa2fc ("HFS rewrite")
Signed-off-by: Pan Bian <bianpan2016@163.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Joe Perches <joe@perches.com>
Cc: Ernesto A. Fernandez <ernesto.mnd.fernandez@gmail.com>
Cc: Viacheslav Dubeyko <slava@dubeyko.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
ocfs2_defrag_extent may fall into deadlock.
ocfs2_ioctl_move_extents
ocfs2_ioctl_move_extents
ocfs2_move_extents
ocfs2_defrag_extent
ocfs2_lock_allocators_move_extents
ocfs2_reserve_clusters
inode_lock GLOBAL_BITMAP_SYSTEM_INODE
__ocfs2_flush_truncate_log
inode_lock GLOBAL_BITMAP_SYSTEM_INODE
As backtrace shows above, ocfs2_reserve_clusters() will call inode_lock
against the global bitmap if local allocator has not sufficient cluters.
Once global bitmap could meet the demand, ocfs2_reserve_cluster will
return success with global bitmap locked.
After ocfs2_reserve_cluster(), if truncate log is full,
__ocfs2_flush_truncate_log() will definitely fall into deadlock because
it needs to inode_lock global bitmap, which has already been locked.
To fix this bug, we could remove from
ocfs2_lock_allocators_move_extents() the code which intends to lock
global allocator, and put the removed code after
__ocfs2_flush_truncate_log().
ocfs2_lock_allocators_move_extents() is referred by 2 places, one is
here, the other does not need the data allocator context, which means
this patch does not affect the caller so far.
Link: http://lkml.kernel.org/r/20181101071422.14470-1-lchen@suse.com
Signed-off-by: Larry Chen <lchen@suse.com>
Reviewed-by: Changwei Ge <ge.changwei@h3c.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull vfs fixes from Al Viro:
"Assorted fixes all over the place.
The iov_iter one is this cycle regression (splice from UDP triggering
WARN_ON()), the rest is older"
* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
afs: Use d_instantiate() rather than d_add() and don't d_drop()
afs: Fix missing net error handling
afs: Fix validation/callback interaction
iov_iter: teach csum_and_copy_to_iter() to handle pipe-backed ones
exportfs: do not read dentry after free
exportfs: fix 'passing zero to ERR_PTR()' warning
aio: fix failure to put the file pointer
sysv: return 'err' instead of 0 in __sysv_write_inode
Currently, a lock can block pending requests, but all pending
requests are equal. If lots of pending requests are
mutually exclusive, this means they will all be woken up
and all but one will fail. This can hurt performance.
So we will allow pending requests to block other requests.
Only the first request will be woken, and it will wake the others.
This patch doesn't implement this fully, but prepares the way.
- It acknowledges that a request might be blocking other requests,
and when the request is converted to a lock, those blocked
requests are moved across.
- When a request is requeued or discarded, all blocked requests are
woken.
- When deadlock-detection looks for the lock which blocks a
given request, we follow the chain of ->fl_blocker all
the way to the top.
Tested-by: kernel test robot <rong.a.chen@intel.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Reviewed-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Both locks_remove_posix() and locks_remove_flock() use a
struct file_lock without calling locks_init_lock() on it.
This means the various list_heads are not initialized, which
will become a problem with a later patch.
So change them both to initialize properly. For flock locks,
this involves using flock_make_lock(), and changing it to
allow a file_lock to be passed in, so memory allocation isn't
always needed.
Signed-off-by: NeilBrown <neilb@suse.com>
Reviewed-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Rather than assuming all-zeros is sufficient, use the available API to
initialize the file_lock structure use for unlock. VFS-level changes
will soon make it important that the list_heads in file_lock are
always properly initialized.
Signed-off-by: NeilBrown <neilb@suse.com>
Reviewed-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Rather than assuming all-zeros is sufficient, use the available API to
initialize the file_lock structure use for unlock. VFS-level changes
will soon make it important that the list_heads in file_lock are
always properly initialized.
Signed-off-by: NeilBrown <neilb@suse.com>
Reviewed-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Using memcpy() to copy lock requests leaves the various
list_head in an inconsistent state.
As we will soon attach a list of conflicting request to
another pending request, we need these lists to be consistent.
So change NFS to use locks_init_lock/locks_copy_lock instead
of memcpy.
Signed-off-by: NeilBrown <neilb@suse.com>
Reviewed-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
This functionality will be useful in future patches, so
split it out from locks_wake_up_blocks().
Signed-off-by: NeilBrown <neilb@suse.com>
Reviewed-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
struct file lock contains an 'fl_next' pointer which
is used to point to the lock that this request is blocked
waiting for. So rename it to fl_blocker.
The fl_blocked list_head in an active lock is the head of a list of
blocked requests. In a request it is a node in that list.
These are two distinct uses, so replace with two list_heads
with different names.
fl_blocked_requests is the head of a list of blocked requests
fl_blocked_member is a node in a member of that list.
The two different list_heads are never used at the same time, but that
will change in a future patch.
Note that a tracepoint is changed to report fl_blocker instead
of fl_next.
Signed-off-by: NeilBrown <neilb@suse.com>
Reviewed-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Variable 'cache' is being assigned but is never used hence it is
redundant and can be removed.
Cleans up clang warning:
warning: variable 'cache' set but not used [-Wunused-but-set-variable]
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David Howells <dhowells@redhat.com>
get_seconds() returns an unsigned long can overflow on some architectures
and is deprecated because of that. In cachefs, we cast that number to
a a 32-bit integer, which will overflow in year 2106 on all architectures.
As confirmed by David Howells, the overflow probably isn't harmful
in the end, since the timestamps are only used to make the file names
unique, but they don't strictly have to be in monotonically increasing
order since the files only exist in order to be deleted as quickly
as possible.
Moving to ktime_get_real_seconds() avoids the deprecated interface.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David Howells <dhowells@redhat.com>
Clang warns when one enumerated type is implicitly converted to another.
fs/cachefiles/namei.c:247:50: warning: implicit conversion from
enumeration type 'enum cachefiles_obj_ref_trace' to different
enumeration type 'enum fscache_obj_ref_trace' [-Wenum-conversion]
cache->cache.ops->put_object(&xobject->fscache,
cachefiles_obj_put_wait_retry);
Silence this warning by explicitly casting to fscache_obj_ref_trace,
which is also done in put_object.
Reported-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>
Signed-off-by: David Howells <dhowells@redhat.com>
It was observed that a process blocked indefintely in
__fscache_read_or_alloc_page(), waiting for FSCACHE_COOKIE_LOOKING_UP
to be cleared via fscache_wait_for_deferred_lookup().
At this time, ->backing_objects was empty, which would normaly prevent
__fscache_read_or_alloc_page() from getting to the point of waiting.
This implies that ->backing_objects was cleared *after*
__fscache_read_or_alloc_page was was entered.
When an object is "killed" and then "dropped",
FSCACHE_COOKIE_LOOKING_UP is cleared in fscache_lookup_failure(), then
KILL_OBJECT and DROP_OBJECT are "called" and only in DROP_OBJECT is
->backing_objects cleared. This leaves a window where
something else can set FSCACHE_COOKIE_LOOKING_UP and
__fscache_read_or_alloc_page() can start waiting, before
->backing_objects is cleared
There is some uncertainty in this analysis, but it seems to be fit the
observations. Adding the wake in this patch will be handled correctly
by __fscache_read_or_alloc_page(), as it checks if ->backing_objects
is empty again, after waiting.
Customer which reported the hang, also report that the hang cannot be
reproduced with this fix.
The backtrace for the blocked process looked like:
PID: 29360 TASK: ffff881ff2ac0f80 CPU: 3 COMMAND: "zsh"
#0 [ffff881ff43efbf8] schedule at ffffffff815e56f1
#1 [ffff881ff43efc58] bit_wait at ffffffff815e64ed
#2 [ffff881ff43efc68] __wait_on_bit at ffffffff815e61b8
#3 [ffff881ff43efca0] out_of_line_wait_on_bit at ffffffff815e625e
#4 [ffff881ff43efd08] fscache_wait_for_deferred_lookup at ffffffffa04f2e8f [fscache]
#5 [ffff881ff43efd18] __fscache_read_or_alloc_page at ffffffffa04f2ffe [fscache]
#6 [ffff881ff43efd58] __nfs_readpage_from_fscache at ffffffffa0679668 [nfs]
#7 [ffff881ff43efd78] nfs_readpage at ffffffffa067092b [nfs]
#8 [ffff881ff43efda0] generic_file_read_iter at ffffffff81187a73
#9 [ffff881ff43efe50] nfs_file_read at ffffffffa066544b [nfs]
#10 [ffff881ff43efe70] __vfs_read at ffffffff811fc756
#11 [ffff881ff43efee8] vfs_read at ffffffff811fccfa
#12 [ffff881ff43eff18] sys_read at ffffffff811fda62
#13 [ffff881ff43eff50] entry_SYSCALL_64_fastpath at ffffffff815e986e
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: David Howells <dhowells@redhat.com>
commit e259221763 ("fs: simplify the
generic_write_sync prototype") reworked callers of generic_write_sync(),
and ended up dropping the error return for the directio path. Prior to
that commit, in dio_complete(), an error would be bubbled up the stack,
but after that commit, errors passed on to dio_complete were eaten up.
This was reported on the list earlier, and a fix was proposed in
https://lore.kernel.org/lkml/20160921141539.GA17898@infradead.org/, but
never followed up with. We recently hit this bug in our testing where
fencing io errors, which were previously erroring out with EIO, were
being returned as success operations after this commit.
The fix proposed on the list earlier was a little short -- it would have
still called generic_write_sync() in case `ret` already contained an
error. This fix ensures generic_write_sync() is only called when there's
no pending error in the write. Additionally, transferred is replaced
with ret to bring this code in line with other callers.
Fixes: e259221763 ("fs: simplify the generic_write_sync prototype")
Reported-by: Ravi Nankani <rnankani@amazon.com>
Signed-off-by: Maximilian Heyne <mheyne@amazon.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
CC: Torsten Mehlan <tomeh@amazon.de>
CC: Uwe Dannowski <uwed@amazon.de>
CC: Amit Shah <aams@amazon.de>
CC: David Woodhouse <dwmw@amazon.co.uk>
CC: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The bio referencing has a trick that doesn't do any actual atomic
inc/dec on the reference count until we have to elevator to > 1. For the
async IO O_DIRECT case, we can't use the simple DIO variants, so we use
__blkdev_direct_IO(). It always grabs an extra reference to the bio
after allocation, which means we then enter the slower path of actually
having to do atomic_inc/dec on the count.
We don't need to do that for the async case, unless we end up going
multi-bio, in which case we're already doing huge amounts of IO. For the
smaller IO case (< BIO_MAX_PAGES), we can do without the extra ref.
Based on an earlier patch (and commit log) from Jens Axboe.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Use d_instantiate() rather than d_add() and don't d_drop() in
afs_vnode_new_inode(). The dentry shouldn't be removed as it's not
changing its name.
Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
kAFS can be given certain network errors (EADDRNOTAVAIL, EHOSTDOWN and
ERFKILL) that it doesn't handle in its server/address rotation algorithms.
They cause the probing and rotation to abort immediately rather than
rotating.
Fix this by:
(1) Abstracting out the error prioritisation from the VL and FS rotation
algorithms into a common function and expand usage into the server
probing code.
When multiple errors are available, this code selects the one we'd
prefer to return.
(2) Add handling for EADDRNOTAVAIL, EHOSTDOWN and ERFKILL.
Fixes: 0fafdc9f88 ("afs: Fix file locking")
Fixes: 0338747d8454 ("afs: Probe multiple fileservers simultaneously")
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
When afs_validate() is called to validate a vnode (inode), there are two
unhandled cases in the fastpath at the top of the function:
(1) If the vnode is promised (AFS_VNODE_CB_PROMISED is set), the break
counters match and the data has expired, then there's an implicit case
in which the vnode needs revalidating.
This has no consequences since the default "valid = false" set at the
top of the function happens to do the right thing.
(2) If the vnode is not promised and it hasn't been deleted
(AFS_VNODE_DELETED is not set) then there's a default case we're not
handling in which the vnode is invalid. If the vnode is invalid, we
need to bring cb_s_break and cb_v_break up to date before we refetch
the status.
As a consequence, once the server loses track of the client
(ie. sufficient time has passed since we last sent it an operation),
it will send us a CB.InitCallBackState* operation when we next try to
talk to it. This calls afs_init_callback_state() which increments
afs_server::cb_s_break, but this then doesn't propagate to the
afs_vnode record.
The result being that every afs_validate() call thereafter sends a
status fetch operation to the server.
Clarify and fix this by:
(A) Setting valid in all the branches rather than initialising it at the
top so that the compiler catches where we've missed.
(B) Restructuring the logic in the 'promised' branch so that we set valid
to false if the callback is due to expire (or has expired) and so that
the final case is that the vnode is still valid.
(C) Adding an else-statement that ups cb_s_break and cb_v_break if the
promised and deleted cases don't match.
Fixes: c435ee3455 ("afs: Overhaul the callback handling")
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The synchronize_rcu() in namespace_unlock() is called every time
a filesystem is unmounted. If a great many filesystems are mounted,
this can cause a noticable slow-down in, for example, system shutdown.
The sequence:
mkdir -p /tmp/Mtest/{0..5000}
time for i in /tmp/Mtest/*; do mount -t tmpfs tmpfs $i ; done
time umount /tmp/Mtest/*
on a 4-cpu VM can report 8 seconds to mount the tmpfs filesystems, and
100 seconds to unmount them.
Boot the same VM with 1 CPU and it takes 18 seconds to mount the
tmpfs filesystems, but only 36 to unmount.
If we change the synchronize_rcu() to synchronize_rcu_expedited()
the umount time on a 4-cpu VM drop to 0.6 seconds
I think this 200-fold speed up is worth the slightly high system
impact of using synchronize_rcu_expedited().
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> (from general rcu perspective)
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The actual number of bytes stored in a PRZ is smaller than the
bytes requested by platform data, since there is a header on each
PRZ. Additionally, if ECC is enabled, there are trailing bytes used
as well. Normally this mismatch doesn't matter since PRZs are circular
buffers and the leading "overflow" bytes are just thrown away. However, in
the case of a compressed record, this rather badly corrupts the results.
This corruption was visible with "ramoops.mem_size=204800 ramoops.ecc=1".
Any stored crashes would not be uncompressable (producing a pstorefs
"dmesg-*.enc.z" file), and triggering errors at boot:
[ 2.790759] pstore: crypto_comp_decompress failed, ret = -22!
Backporting this depends on commit 70ad35db33 ("pstore: Convert console
write to use ->write_buf")
Reported-by: Joel Fernandes <joel@joelfernandes.org>
Fixes: b0aad7a99c ("pstore: Add compression support to pstore")
Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAlv/mEEACgkQnJ2qBz9k
QNm0kggA8SckBccdUZqpjxohASx9crp+jJeor2TRYCwyVfqdbDjkCOfv9k7+ddD4
D1qN/AEudQXPzr/DrjI0W9fnFG957FLo60RQ0aGwsq/3wfmzfMJ0qXIw2tyoqjIH
VxFecL2f3TKMr5zU9N1QoxJVAMAo5LOGbXO+qNYgTaOJ0EBpw9kxK14ib9JOi+pQ
CItZkAv4eOWMsA/AaRsWN7AGmsziwcMdoAZYRT2wiWdTmubYn66Negnffq2jVHjP
/kYlFjNcgpsuTg1AiTM+JlBwctEeLm37PUyimip3cdYwu6HcfNmCHDjEIzN2YphJ
twzAauTEr0bjNFiNWYraUPVHEcspNw==
=kKDs
-----END PGP SIGNATURE-----
Merge tag 'fixes_for_v4.20-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull ext2 and udf fixes from Jan Kara:
"Three small ext2 and udf fixes"
* tag 'fixes_for_v4.20-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
ext2: fix potential use after free
ext2: initialize opts.s_mount_opt as zero before using it
udf: Allow mounting volumes with incorrect identification strings
Trivial fix to clean up indentation, add in missing tabs.
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
__cld_pipe_upcall() emits a "do not call blocking ops when
!TASK_RUNNING" warning due to the dput() call in rpc_queue_upcall().
Fix it by using a completion instead of hand coding the wait.
Signed-off-by: Scott Mayhew <smayhew@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Anatoly Trosinenko reports that this:
1) Checkout fresh master Linux branch (tested with commit e195ca6cb)
2) Copy x84_64-config-4.14 to .config, then enable NFS server v4 and build
3) From `kvm-xfstests shell`:
results in NULL dereference in locks_end_grace.
Check that nfsd has been started before trying to end the grace period.
Reported-by: Anatoly Trosinenko <anatoly.trosinenko@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
After we drop the i_pages lock, the inode can be freed at any time.
The get_unlocked_entry() code has no choice but to reacquire the lock,
so it can't be used here. Create a new wait_entry_unlocked() which takes
care not to acquire the lock or dereference the address_space in any way.
Fixes: c2a7d2a115 ("filesystem-dax: Introduce dax_lock_mapping_entry()")
Cc: <stable@vger.kernel.org>
Signed-off-by: Matthew Wilcox <willy@infradead.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
If we race with inode destroy, it's possible for page->mapping to be
NULL before we even enter this routine, as well as after having slept
waiting for the dax entry to become unlocked.
Fixes: c2a7d2a115 ("filesystem-dax: Introduce dax_lock_mapping_entry()")
Cc: <stable@vger.kernel.org>
Reported-by: Jan Kara <jack@suse.cz>
Signed-off-by: Matthew Wilcox <willy@infradead.org>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAlv9qYgACgkQxWXV+ddt
WDupPQ/8DdeLZYQG1tlx2Q+X4/tqPVyAzUjguYzbIY7wvSs1zbEEedENsD8E97yC
So8ooGnP5B6/dqVidLFQBPwTXN59GybYbrDci8qh0DOJTl3+1r8byD9JC+iofrOF
tltJkZ+eCOQyyqHHzlzw15uNOg48Qzj1oXvTAcE0P6iN5UcvcfwRW/S39pjsn63C
63zc09XJ1hmJMJTWZo5h3GoD2UvzrwGXPKXNdv/NWkw9sqQbWdjvZFdqKbvY1VeM
Oa6FPAPErJqEEEePhpDYbyRcnzjJRMs0deLGpGGChGldQxgMO8ILzBwh/KalfzK7
h7LIuv1EclUqlyv0mXPqg2E/C3n2UMPqQYFsK9Lt+4Y/PkrWA2jx0lSg0fBl3k8c
7PyiTqPNPNF8LU48tPEnOzJuNPkquOycgdyQOUpHnS43OF5OLIb6tVyjK4eJHRWw
xtP65M72qM8T65+gsxYcdm0lvIDLidIwFS+2g4ibKU7EwlYkTC9AHFIAyFKTgxeP
MpkIH90mKhSxOpbq8RICgr2jWcJZYoFQ4soi1oE+bgyjv75PyhJ0eXOprCh/4KZp
nkXlPy2skkO9gGecyvr51x/opDEjEkObyOjQm2LhhWYvgcnHgW8Zp1jhQKxabHvz
iZdVIs/agOerpk1d9ZBHhIXOeS2UcE5klqVRAdf961Wobh+HNis=
=cCvI
-----END PGP SIGNATURE-----
Merge tag 'for-4.20-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"Some of these bugs are being hit during testing so we'd like to get
them merged, otherwise there are usual stability fixes for stable
trees"
* tag 'for-4.20-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: relocation: set trans to be NULL after ending transaction
Btrfs: fix race between enabling quotas and subvolume creation
Btrfs: send, fix infinite loop due to directory rename dependencies
Btrfs: ensure path name is null terminated at btrfs_control_ioctl
Btrfs: fix rare chances for data loss when doing a fast fsync
btrfs: Always try all copies when reading extent buffers
[Description]
In a heavily loaded system where the system pagecache is nearing memory
limits and fscache is enabled, pages can be leaked by fscache while trying
read pages from cachefiles backend. This can happen because two
applications can be reading same page from a single mount, two threads can
be trying to read the backing page at same time. This results in one of
the threads finding that a page for the backing file or netfs file is
already in the radix tree. During the error handling cachefiles does not
clean up the reference on backing page, leading to page leak.
[Fix]
The fix is straightforward, to decrement the reference when error is
encountered.
[dhowells: Note that I've removed the clearance and put of newpage as
they aren't attested in the commit message and don't appear to actually
achieve anything since a new page is only allocated is newpage!=NULL and
any residual new page is cleared before returning.]
[Testing]
I have tested the fix using following method for 12+ hrs.
1) mkdir -p /mnt/nfs ; mount -o vers=3,fsc <server_ip>:/export /mnt/nfs
2) create 10000 files of 2.8MB in a NFS mount.
3) start a thread to simulate heavy VM presssure
(while true ; do echo 3 > /proc/sys/vm/drop_caches ; sleep 1 ; done)&
4) start multiple parallel reader for data set at same time
find /mnt/nfs -type f | xargs -P 80 cat > /dev/null &
find /mnt/nfs -type f | xargs -P 80 cat > /dev/null &
find /mnt/nfs -type f | xargs -P 80 cat > /dev/null &
..
..
find /mnt/nfs -type f | xargs -P 80 cat > /dev/null &
find /mnt/nfs -type f | xargs -P 80 cat > /dev/null &
5) finally check using cat /proc/fs/fscache/stats | grep -i pages ;
free -h , cat /proc/meminfo and page-types -r -b lru
to ensure all pages are freed.
Reviewed-by: Daniel Axtens <dja@axtens.net>
Signed-off-by: Shantanu Goel <sgoel01@yahoo.com>
Signed-off-by: Kiran Kumar Modukuri <kiran.modukuri@gmail.com>
[dja: forward ported to current upstream]
Signed-off-by: Daniel Axtens <dja@axtens.net>
Signed-off-by: David Howells <dhowells@redhat.com>
kmem_cache_destroy(NULL) is safe, so removes NULL check before
freeing the mem. This patch also fix ifnullfree.cocci warnings.
Signed-off-by: Wen Yang <wen.yang99@zte.com.cn>
Signed-off-by: David Teigland <teigland@redhat.com>
If cachefiles gets an error other then ENOENT when trying to look up an
object in the cache (in this case, EACCES), the object state machine will
eventually transition to the DROP_OBJECT state.
This state invokes fscache_drop_object() which tries to sync the auxiliary
data with the cache (this is done lazily since commit 402cb8dda9) on an
incomplete cache object struct.
The problem comes when cachefiles_update_object_xattr() is called to
rewrite the xattr holding the data. There's an assertion there that the
cache object points to a dentry as we're going to update its xattr. The
assertion trips, however, as dentry didn't get set.
Fix the problem by skipping the update in cachefiles if the object doesn't
refer to a dentry. A better way to do it could be to skip the update from
the DROP_OBJECT state handler in fscache, but that might deny the cache the
opportunity to update intermediate state.
If this error occurs, the kernel log includes lines that look like the
following:
CacheFiles: Lookup failed error -13
CacheFiles:
CacheFiles: Assertion failed
------------[ cut here ]------------
kernel BUG at fs/cachefiles/xattr.c:138!
...
Workqueue: fscache_object fscache_object_work_func [fscache]
RIP: 0010:cachefiles_update_object_xattr.cold.4+0x18/0x1a [cachefiles]
...
Call Trace:
cachefiles_update_object+0xdd/0x1c0 [cachefiles]
fscache_update_aux_data+0x23/0x30 [fscache]
fscache_drop_object+0x18e/0x1c0 [fscache]
fscache_object_work_func+0x74/0x2b0 [fscache]
process_one_work+0x18d/0x340
worker_thread+0x2e/0x390
? pwq_unbound_release_workfn+0xd0/0xd0
kthread+0x112/0x130
? kthread_bind+0x30/0x30
ret_from_fork+0x35/0x40
Note that there are actually two issues here: (1) EACCES happened on a
cache object and (2) an oops occurred. I think that the second is a
consequence of the first (it certainly looks like it ought to be). This
patch only deals with the second.
Fixes: 402cb8dda9 ("fscache: Attach the index key and aux data to the cookie")
Reported-by: Zhibin Li <zhibli@redhat.com>
Signed-off-by: David Howells <dhowells@redhat.com>
If we want to re-enable nat_bits, we rely on fsck which requires full scan
of directory tree. Let's do that by regular fsck or unclean shutdown.
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
We fail to advance the read pointer when reading the stat.oh field that
identifies the lock-holder in a TEST result.
This turns out not to matter if the server is knfsd, which always
returns a zero-length field. But other servers (Ganesha is an example)
may not do this. The result is bad values in fcntl F_GETLK results.
Fix this.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
The idea here was that renaming a file on a nosubtreecheck export would
make lookups of the old filehandle return STALE, making it impossible
for clients to reclaim opens.
But during the grace period I think we should also hold off on
operations that would break delegations.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Zero-length writes are legal; from 5661 section 18.32.3: "If the count
is zero, the WRITE will succeed and return a count of zero subject to
permissions checking".
This check is unnecessary and is causing zero-length reads to return
EINVAL.
Cc: stable@vger.kernel.org
Fixes: 3fd9557aec "NFSD: Refactor the generic write vector fill helper"
Cc: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Now that synchronize_rcu() waits for preempt-disable regions of code
as well as RCU read-side critical sections, synchronize_sched() can be
replaced by synchronize_rcu(). This commit therefore makes this change.
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: <linux-fsdevel@vger.kernel.org>
kernfs_notify() does two notifications: poll and fsnotify. Originally,
both notifications were done from scheduled work context and all that
kernfs_notify() did was schedule the work.
This patch simply moves the poll notification from the scheduled work
handler to kernfs_notify(). The fsnotify notification still needs to be
done from scheduled work context because it can sleep (it needs to lock
a mutex).
If the poll notification is time critical (the notified thread needs to
wake as quickly as possible), it's better to do it from kernfs_notify()
directly. One example is calling sysfs_notify_dirent() from a hardware
interrupt handler to wake up a thread and handle the interrupt in user
space.
Signed-off-by: Radu Rendec <radu.rendec@gmail.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The function ext2_xattr_set calls brelse(bh) to drop the reference count
of bh. After that, bh may be freed. However, following brelse(bh),
it reads bh->b_data via macro HDR(bh). This may result in a
use-after-free bug. This patch moves brelse(bh) after reading field.
CC: stable@vger.kernel.org
Signed-off-by: Pan Bian <bianpan2016@163.com>
Signed-off-by: Jan Kara <jack@suse.cz>
We need to initialize opts.s_mount_opt as zero before using it, else we
may get some unexpected mount options.
Fixes: 088519572c ("ext2: Parse mount options into a dedicated structure")
CC: stable@vger.kernel.org
Signed-off-by: xingaopeng <xingaopeng@huawei.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Previously, we added a parameter @map.m_may_create to trigger OPU
allocation and call f2fs_balance_fs() correctly.
But in get_more_blocks(), @create has been overwritten by below code.
So the function f2fs_map_blocks() will not allocate new block address
but directly go out. Meanwile,there are several functions calling
f2fs_map_blocks() directly and @map.m_may_create not initialized.
CODE:
create = dio->op == REQ_OP_WRITE;
if (dio->flags & DIO_SKIP_HOLES) {
if (fs_startblk <= ((i_size_read(dio->inode) - 1) >>
i_blkbits))
create = 0;
}
This patch fixes it.
Signed-off-by: Jia Zhu <zhujia13@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Previously, we allocated a new block address for OPU mode in direct_IO.
But the new address couldn't be assigned to @map->m_pblk correctly.
This patch fix it.
Cc: <stable@vger.kernel.org>
Fixes: 511f52d02f05 ("f2fs: allow out-place-update for direct IO in LFS mode")
Signed-off-by: Jia Zhu <zhujia13@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Adjust the trace print in f2fs_get_victim() to cover GC done by
F2FS_IOC_GARBAGE_COLLECT_RANGE.
Signed-off-by: Sahitya Tummala <stummala@codeaurora.org>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Allow node type segments also to be GC'd via f2fs ioctl
F2FS_IOC_GARBAGE_COLLECT_RANGE.
Signed-off-by: Sahitya Tummala <stummala@codeaurora.org>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The function truncate_node frees the page with f2fs_put_page. However,
the page index is read after that. So, the patch reads the index before
freeing the page.
Fixes: bf39c00a9a ("f2fs: drop obsolete node page when it is truncated")
Cc: <stable@vger.kernel.org>
Signed-off-by: Pan Bian <bianpan2016@163.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
When call f2fs_acl_create_masq() failed, the caller f2fs_acl_create()
should return -EIO instead of -ENOMEM, this patch makes it consistent
with posix_acl_create() which has been fixed in commit beaf226b86
("posix_acl: don't ignore return value of posix_acl_create_masq()").
Fixes: 83dfe53c18 ("f2fs: fix reference leaks in f2fs_acl_create")
Signed-off-by: Tiezhu Yang <kernelpatch@126.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
After merging the f2fs tree, today's linux-next build
(x86_64_allmodconfig) produced this warning:
In file included from fs/f2fs/dir.c:11:
fs/f2fs/f2fs.h: In function '__mark_inode_dirty_flag':
fs/f2fs/f2fs.h:2388:6: warning: this statement may fall through [-Wimplicit-fallthrough=]
if (set)
^
fs/f2fs/f2fs.h:2390:2: note: here
case FI_DATA_EXIST:
^~~~
Exposed by my use of -Wimplicit-fallthrough
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The following race could lead to inconsistent SIT bitmap:
Task A Task B
====== ======
f2fs_write_checkpoint
block_operations
f2fs_lock_all
down_write(node_change)
down_write(node_write)
... sync ...
up_write(node_change)
f2fs_file_write_iter
set_inode_flag(FI_NO_PREALLOC)
......
f2fs_write_begin(index=0, has inline data)
prepare_write_begin
__do_map_lock(AIO) => down_read(node_change)
f2fs_convert_inline_page => update SIT
__do_map_lock(AIO) => up_read(node_change)
f2fs_flush_sit_entries <= inconsistent SIT
finish write checkpoint
sudden-power-off
If SPO occurs after checkpoint is finished, SIT bitmap will be set
incorrectly.
Signed-off-by: Sheng Yong <shengyong1@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If namelen is corrupted to have very long value, fill_dentries can copy
wrong memory area.
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Previously, when f2fs finds which temp bio cache owns the target page,
it will flush all the three temp bio caches, but we only need to flush
one single bio cache indeed, which can help to keep bio merged.
Signed-off-by: Yunlong Song <yunlong.song@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In get_more_blocks(), we may override @create as below code:
create = dio->op == REQ_OP_WRITE;
if (dio->flags & DIO_SKIP_HOLES) {
if (fs_startblk <= ((i_size_read(dio->inode) - 1) >>
i_blkbits))
create = 0;
}
But in f2fs_map_blocks(), we only trigger f2fs_balance_fs() if @create
is 1, so in LFS mode, dio overwrite under LFS mode can easily run out
of free segments, result in below panic.
Call Trace:
allocate_segment_by_default+0xa8/0x270 [f2fs]
f2fs_allocate_data_block+0x1ea/0x5c0 [f2fs]
__allocate_data_block+0x306/0x480 [f2fs]
f2fs_map_blocks+0x6f6/0x920 [f2fs]
__get_data_block+0x4f/0xb0 [f2fs]
get_data_block_dio_write+0x50/0x60 [f2fs]
do_blockdev_direct_IO+0xcd5/0x21e0
__blockdev_direct_IO+0x3a/0x3c
f2fs_direct_IO+0x1ff/0x4a0 [f2fs]
generic_file_direct_write+0xd9/0x160
__generic_file_write_iter+0xbb/0x1e0
f2fs_file_write_iter+0xaf/0x220 [f2fs]
__vfs_write+0xd0/0x130
vfs_write+0xb2/0x1b0
SyS_pwrite64+0x69/0xa0
? vtime_user_exit+0x29/0x70
do_syscall_64+0x6e/0x160
entry_SYSCALL64_slow_path+0x25/0x25
RIP: new_curseg+0x36f/0x380 [f2fs] RSP: ffffac570393f7a8
So this patch introduces a parameter map.m_may_create to indicate that
f2fs_map_blocks() is called from write or read path, which can give the
right hint to let f2fs_map_blocks() trigger OPU allocation and call
f2fs_balanc_fs() correctly.
BTW, it disables physical address preallocation for direct IO in
f2fs_preallocate_blocks, which is redundant to OPU allocation of
f2fs_map_blocks.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch adds f2fs_dio_submit_bio() to hook submit_io/end_io functions
in direct IO path, in order to account DIO.
Later, we will add this count into is_idle() to let background GC/Discard
thread be aware of DIO.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch move dir data flush to write checkpoint process, by
doing this, it may reduce some time for dir fsync.
pre:
-f2fs_do_sync_file enter
-file_write_and_wait_range <- flush & wait
-write_checkpoint
-do_checkpoint <- wait all
-f2fs_do_sync_file exit
now:
-f2fs_do_sync_file enter
-write_checkpoint
-block_operations <- flush dir & no wait
-do_checkpoint <- wait all
-f2fs_do_sync_file exit
Signed-off-by: Yunlei He <heyunlei@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Use DEFINE_SHOW_ATTRIBUTE macro to simplify the code.
Signed-off-by: Yangtao Li <tiny.windzz@gmail.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
f2fs_ioc_gc_range skips blocks_per_seg each time, however, f2fs_gc moves
blocks of section each time, so fix it from segment to section.
Signed-off-by: Yunlong Song <yunlong.song@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Add one sysfs entry to control migration granularity of GC in large
section f2fs, it can be tuned to mitigate heavy overhead of migrating
huge number of blocks in large section.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In F2FS_HAS_FEATURE(), we will use F2FS_SB(sb) to get sbi pointer to
access .raw_super field, to avoid unneeded pointer conversion, this
patch changes to F2FS_HAS_FEATURE() accept sbi parameter directly.
Just do cleanup, no logic change.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Section is minimal garbage collection unit of f2fs, in zoned block
device, or ancient block mapping flash device, in order to improve
GC efficiency, we can align GC unit to lower device erase unit,
normally, it consists of multiple of segments.
Once background or foreground GC triggers, it brings a large number
of IOs, which will impact user IO, and also occupy cpu/memory resource
intensively.
So, to reduce impact of GC on large size section, this patch supports
subsectional GC, in one cycle of GC, it only migrate partial segment{s}
in victim section. Currently, by default, we use sbi->segs_per_sec as
migration granularity.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Introduce a wrapper __is_large_section() to clean up codes.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
When sbi->segs_per_sec > 1, and if some segno has 0 valid blocks before
gc starts, do_garbage_collect will skip counting seg_freed++, and this
will cause seg_freed < sbi->segs_per_sec and finally skip sec_freed++.
Signed-off-by: Yunlong Song <yunlong.song@huawei.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Previously, we only account preflush command for flush_merge mode,
so for noflush_merge mode, we can not know in-flight preflush
command count, fix it.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The encrypted file may be corrupted by GC in following case:
Time 1: | segment 1 blkaddr = A | GC -> | segment 2 blkaddr = B |
Encrypted block 1 is moved from blkaddr A of segment 1 to blkaddr B of
segment 2,
Time 2: | segment 1 blkaddr = B | GC -> | segment 3 blkaddr = C |
Before page 1 is written back and if segment 2 become a victim, then
page 1 is moved from blkaddr B of segment 2 to blkaddr Cof segment 3,
during the GC process of Time 2, f2fs should wait for page 1 written back
before reading it, or move_data_block will read a garbage block from
blkaddr B since page is not written back to blkaddr B yet.
Commit 6aa58d8a ("f2fs: readahead encrypted block during GC") introduce
ra_data_block to read encrypted block, but it forgets to add
f2fs_wait_on_page_writeback to avoid racing between GC and flush.
Signed-off-by: Yunlong Song <yunlong.song@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
When project is set, we should use inode limit minus the used count
Signed-off-by: Ye Yin <dbyin@tencent.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
blk_poll() has always kept spinning until it found an IO. This is
fine for SYNC polling, since we need to find one request we have
pending, but in preparation for ASYNC polling it can be beneficial
to just check if we have any entries available or not.
Existing callers are converted to pass in 'spin == true', to retain
the old behavior.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Today, when sb_bread() returns NULL, this can either be because of an
I/O error or because the system failed to allocate the buffer. Since
it's an old interface, changing would require changing many call
sites.
So instead we create our own ext4_sb_bread(), which also allows us to
set the REQ_META flag.
Also fixed a problem in the xattr code where a NULL return in a
function could also mean that the xattr was not found, which could
lead to the wrong error getting returned to userspace.
Fixes: ac27a0ec11 ("ext4: initial copy of files from ext3")
Cc: stable@kernel.org # 2.6.19
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Highlights include:
Bugfixes:
- Fix a NFSv4 state manager deadlock when returning a delegation
- NFSv4.2 copy do not allocate memory under the lock
- flexfiles: Use the correct stateid for IO in the tightly coupled case
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJb+hCNAAoJEA4mA3inWBJc8ZQP/jR+uemJycwgyWINvnn6PEtE
hyiSwL+c3jhBHwX2IroF1KvaHIa8GXMbIWj+DfW1iB2htYnIJYz4IFJOGpfN1S7n
bKCgonV0V06+dFF4DqcL3HcM1L6bo26n16voi3otgY0R5U5HGwB1tocZPCbR6DpK
meiRbrmB6O962zluUlTuu9zFSvsALyZR0h4tYMGYA0MlgWQJVLH6+dufyG2Zgp2Z
OH9tUzRFknD/Q4KrJv7zrMY198mHa+RQovsO2Jc/iE4bbrSMyVNtrPuVJphsP1BD
lZ5SvvWLXjNepUMsDCK+Es7i6dUmtHsGPS6gNDwUwY9/UlwOPYlp44VJzmEYmQcz
/VrrHn3LSoKDSAVNrksghto9O4T1NPnuVja1Q+SHf5hVX5OlsxyDkvX23ZUdgdkW
BeXeNWZuAJdDTI1KU+ahm2ilfUnuFpRGRHUrH2sYczV2okC38cO5YCIRI3Tckz6e
jrhmJcw+zCWv3Yl3h2Rbf8fVRcWJHA+qLWT3Str5nCyZiqPCag7Z7br9r5316zDv
Yma7nITZO7HH1cZUv+byA0PVHU96kDsMhhpxYISrSr4lf2BcZNnjQC/0IHb7qdWz
FgpYzv/BsIi+KxyZKshiR5E60kOmVxv2wIhre8uLOuuabcGsh/wit6URVnQJ+GDp
7klRY1t1P24XaIbgBR9U
=hqbe
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-4.20-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client bugfixes from Trond Myklebust:
- Fix a NFSv4 state manager deadlock when returning a delegation
- NFSv4.2 copy do not allocate memory under the lock
- flexfiles: Use the correct stateid for IO in the tightly coupled case
* tag 'nfs-for-4.20-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
flexfiles: use per-mirror specified stateid for IO
NFSv4.2 copy do not allocate memory under the lock
NFSv4: Fix a NFSv4 state manager deadlock
We found some bugs in the DAX conversion to XArray (and one bug which
predated the XArray conversion). There were a couple of bugs in some of
the higher-level functions, which aren't actually being called in today's
kernel, but surfaced as a result of converting existing radix tree &
IDR users over to the XArray. Some of the other changes to how the
higher-level APIs work were also motivated by converting various users;
again, they're not in use in today's kernel, so changing them has a low
probability of introducing a bug.
Dan can still trigger a bug in the DAX code with hot-offline/online,
and we're working on tracking that down.
-----BEGIN PGP SIGNATURE-----
iQFIBAABCgAyFiEEejHryeLBw/spnjHrDpNsjXcpgj4FAlv542AUHHdpbGx5QGlu
ZnJhZGVhZC5vcmcACgkQDpNsjXcpgj5BoAf/QZzbBcYuYMLMDYofvHKGlmk2yx/a
ObUlxlQtXGHvPp3oC3rdwAvcN/KAMDpU0u+PXab2MnrNw5okhpS6ZwGODlkarNA4
XbVQNGbtEbACr1V3CWc0NzLbYm6JtGpMum0Wx9MVR/VdTnGArBLBYQMYa/c1YhKA
vEBPf+w0j0QoCTAgPiIvq0aksuBQERUvjhlUvoaMY7F4sAhnaW558lvaEcc1xGxq
70+3cRPT6Uh12tEvi0LKP1NNEXebvQSftMvFEUPF2xo5z2v//KEobzv/anbojxQ8
BtxouIGSr4tME9g3xSpd9rTbUcW3bwDAhuWZvpP/ViRwW2UkEQonpApdaw==
=0Ert
-----END PGP SIGNATURE-----
Merge tag 'xarray-4.20-rc4' of git://git.infradead.org/users/willy/linux-dax
Pull XArray updates from Matthew Wilcox:
"We found some bugs in the DAX conversion to XArray (and one bug which
predated the XArray conversion). There were a couple of bugs in some
of the higher-level functions, which aren't actually being called in
today's kernel, but surfaced as a result of converting existing radix
tree & IDR users over to the XArray.
Some of the other changes to how the higher-level APIs work were also
motivated by converting various users; again, they're not in use in
today's kernel, so changing them has a low probability of introducing
a bug.
Dan can still trigger a bug in the DAX code with hot-offline/online,
and we're working on tracking that down"
* tag 'xarray-4.20-rc4' of git://git.infradead.org/users/willy/linux-dax:
XArray tests: Add missing locking
dax: Avoid losing wakeup in dax_lock_mapping_entry
dax: Fix huge page faults
dax: Fix dax_unlock_mapping_entry for PMD pages
dax: Reinstate RCU protection of inode
dax: Make sure the unlocking entry isn't locked
dax: Remove optimisation from dax_lock_mapping_entry
XArray tests: Correct some 64-bit assumptions
XArray: Correct xa_store_range
XArray: Fix Documentation
XArray: Handle NULL pointers differently for allocation
XArray: Unify xa_store and __xa_store
XArray: Add xa_store_bh() and xa_store_irq()
XArray: Turn xa_erase into an exported function
XArray: Unify xa_cmpxchg and __xa_cmpxchg
XArray: Regularise xa_reserve
nilfs2: Use xa_erase_irq
XArray: Export __xa_foo to non-GPL modules
XArray: Fix xa_for_each with a single element at 0
- Numerous corruption fixes for copy on write
- Numerous corruption fixes for blocksize < pagesize writes
- Don't miscalculate AG reservations for small final AGs
- Fix page cache truncation to work properly for reflink and extent
shifting
- Fix use-after-free when retrying failed inode/dquot buffer logging
- Fix corruptions seen when using copy_file_range in directio mode
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAlv1oroACgkQ+H93GTRK
tOu+1RAAnteFwaq3WLDYmSrMia/4m52sxvatxlCCSjCGdvfvuZozTwbTdB6FFfuc
Ql6Z6F2Lx1sHDNJvwBCsO8qPB0qOhjSnBI/wPe2kz/NETGYNp88vHX7OZvkPVONl
jDaCWTcu0BWNiOGi17uTY8sBa8u1izbo5F+pEQIyUjoCgUc9JB2di9dVnUJ0byrh
wZjrmPD95ojqOozqppXfFQ0QIbozpTXR3kyU9S0EhHmbnWJZ9t08Iuhd2LjOoDB4
cUFG/1qDXuFvALyM8m1mA7xSBZpA/glFgNeAtmz53aIOZ9Tl8w8cLJJBRx5AqUDU
bpBU1y08Bm3OIw+uiTMkiPkCQRMDgtiHKlPxuiKqlsNY0KqYgwWlWcbU/OTvHly8
In+CnbEBqLejKyEIz3nEQ4YimfvHbAlC/3V+nx2qO45hvTXA5lEIGAbBmiLW0ni8
6eBXGeIjKAw0zYOoXC4OuKIiHlQh7AHJB25i9xJTzknRI9jqwZFGkxgdl33Vrq8W
gTnfgOhMX2dGmcPrgMgtu+aiBwKf+GJv94/2EJwligExnWXQSsQmGCwKl7ysoE1g
iU/MhJT5IYYP/TDqldkahUPSwD2FN4UFtzNfpeDX3H6kxM1R41l+aerdu64UPNji
G98U+cWyyUmbu9ziLyREM/XyWz4UhNAz7lRId3ryeu8GPUm2AoY=
=TiLQ
-----END PGP SIGNATURE-----
Merge tag 'xfs-4.20-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs fixes from Darrick Wong:
"Dave and I have continued our work fixing corruption problems that can
be found when running long-term burn-in exercisers on xfs. Here are
some patches fixing most of the problems, but there will likely be
more. :/
- Numerous corruption fixes for copy on write
- Numerous corruption fixes for blocksize < pagesize writes
- Don't miscalculate AG reservations for small final AGs
- Fix page cache truncation to work properly for reflink and extent
shifting
- Fix use-after-free when retrying failed inode/dquot buffer logging
- Fix corruptions seen when using copy_file_range in directio mode"
* tag 'xfs-4.20-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
iomap: readpages doesn't zero page tail beyond EOF
vfs: vfs_dedupe_file_range() doesn't return EOPNOTSUPP
iomap: dio data corruption and spurious errors when pipes fill
iomap: sub-block dio needs to zeroout beyond EOF
iomap: FUA is wrong for DIO O_DSYNC writes into unwritten extents
xfs: delalloc -> unwritten COW fork allocation can go wrong
xfs: flush removing page cache in xfs_reflink_remap_prep
xfs: extent shifting doesn't fully invalidate page cache
xfs: finobt AG reserves don't consider last AG can be a runt
xfs: fix transient reference count error in xfs_buf_resubmit_failed_buffers
xfs: uncached buffer tracing needs to print bno
xfs: make xfs_file_remap_range() static
xfs: fix shared extent data corruption due to missing cow reservation
- Fix tasks freezer deadlock in de_thread() that occurs if one
of its sub-threads has been frozen already (Chanho Min).
- Avoid registering a platform device by the ti-cpufreq driver
on platforms that cannot use it (Dave Gerlach).
- Fix a mistake in the ti-opp-supply operating performance points
(OPP) driver that caused an incorrect reference voltage to be
used and make it adjust the minimum voltage dynamically to avoid
hangs or crashes in some cases (Keerthy).
- Fix issues related to compiler flags in the cpupower utility
and correct a linking problem in it by renaming a file with
a duplicate name (Jiri Olsa, Konstantin Khlebnikov).
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCAAGBQJb99ANAAoJEILEb/54YlRxUk4QAIkpXmLdEm0zvQXJZKJxuxu/
8eiglPcoCAAVgR3uRjezTyQeFCDjqGIqmpiipFOkxgAaeb+aiFta2B88B4oZ3OAf
G5yJG4CMs5+Tp/oH2+McX4PMo2WfoEIDoOGVOdtlk4iGAIh9tT5KvNfCUyxHLP7c
cFjdeUQuc8PRPzutESxYHB23FmECCEprVmEFVckQ7vSCnWX6dm1mFteCiNZADH8/
YNgRWRYtyWXlPRNCPJuCvYmVHLgm0Tw6CVhG4ttXT9wIkYyiLK9Flx9X6gsBWdoS
ej1jbJM7R/EAUzHvCjUCSfMKDNWvQySR3gC2m0vG5Goext2UXWieSHaI6BfqjPP2
wTBX9lAKu3iIGISoorjJZAMe4v26Tpbs0v5ApsOGr50WNZKZtRkBwaJ0EFT2hEa8
MsDtUjr+d7CIQ+ZDHmAbOzghpDlSuhGypfkD7IPtA/AY1dy1wJ3BUYya8ucqSdr3
iKkQcndN3ASR1+Fyb3c1cflh9fWe84aeSmYPihcX2m+/D43keYXbj2ANdCH1dF3E
cwAXw9+VteUQ1tihqgJapFogK3VphMbRErPSGXryuyiMxr38g7tgHAd2rId0JG4g
QftqpZa19aEed8tJNaQWWjBaBFgYeAT9BQqcP6CjFOO9klPH//NSiqzQUFMdzsEc
Qql1V/aoOIsflb7gJH3/
=JQxr
-----END PGP SIGNATURE-----
Merge tag 'pm-4.20-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management fixes from Rafael Wysocki:
"These fix two issues in the Operating Performance Points (OPP)
framework, one cpufreq driver issue, one problem related to the tasks
freezer and a few build-related issues in the cpupower utility.
Specifics:
- Fix tasks freezer deadlock in de_thread() that occurs if one of its
sub-threads has been frozen already (Chanho Min).
- Avoid registering a platform device by the ti-cpufreq driver on
platforms that cannot use it (Dave Gerlach).
- Fix a mistake in the ti-opp-supply operating performance points
(OPP) driver that caused an incorrect reference voltage to be used
and make it adjust the minimum voltage dynamically to avoid hangs
or crashes in some cases (Keerthy).
- Fix issues related to compiler flags in the cpupower utility and
correct a linking problem in it by renaming a file with a duplicate
name (Jiri Olsa, Konstantin Khlebnikov)"
* tag 'pm-4.20-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
exec: make de_thread() freezable
cpufreq: ti-cpufreq: Only register platform_device when supported
opp: ti-opp-supply: Correct the supply in _get_optimal_vdd_voltage call
opp: ti-opp-supply: Dynamically update u_volt_min
tools cpupower: Override CFLAGS assignments
tools cpupower debug: Allow to use outside build flags
tools/power/cpupower: fix compilation with STATIC=true
The function dentry_connected calls dput(dentry) to drop the previously
acquired reference to dentry. In this case, dentry can be released.
After that, IS_ROOT(dentry) checks the condition
(dentry == dentry->d_parent), which may result in a use-after-free bug.
This patch directly compares dentry with its parent obtained before
dropping the reference.
Fixes: a056cc8934c("exportfs: stop retrying once we race with
rename/remove")
Signed-off-by: Pan Bian <bianpan2016@163.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The function relocate_block_group calls btrfs_end_transaction to release
trans when update_backref_cache returns 1, and then continues the loop
body. If btrfs_block_rsv_refill fails this time, it will jump out the
loop and the freed trans will be accessed. This may result in a
use-after-free bug. The patch assigns NULL to trans after trans is
released so that it will not be accessed.
Fixes: 0647bf564f ("Btrfs: improve forever loop when doing balance relocation")
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Pan Bian <bianpan2016@163.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
rfc8435 says:
For tight coupling, ffds_stateid provides the stateid to be used by
the client to access the file.
However current implementation replaces per-mirror provided stateid with
by open or lock stateid.
Ensure that per-mirror stateid is used by ff_layout_write_prepare_v4 and
nfs4_ff_layout_prepare_ds.
Signed-off-by: Tigran Mkrtchyan <tigran.mkrtchyan@desy.de>
Signed-off-by: Rick Macklem <rmacklem@uoguelph.ca>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Bruce pointed out that we shouldn't allocate memory while holding
a lock in the nfs4_callback_offload() and handle_async_copy()
that deal with a racing CB_OFFLOAD and reply to COPY case.
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
We have a race between enabling quotas end subvolume creation that cause
subvolume creation to fail with -EINVAL, and the following diagram shows
how it happens:
CPU 0 CPU 1
btrfs_ioctl()
btrfs_ioctl_quota_ctl()
btrfs_quota_enable()
mutex_lock(fs_info->qgroup_ioctl_lock)
btrfs_ioctl()
create_subvol()
btrfs_qgroup_inherit()
-> save fs_info->quota_root
into quota_root
-> stores a NULL value
-> tries to lock the mutex
qgroup_ioctl_lock
-> blocks waiting for
the task at CPU0
-> sets BTRFS_FS_QUOTA_ENABLED in fs_info
-> sets quota_root in fs_info->quota_root
(non-NULL value)
mutex_unlock(fs_info->qgroup_ioctl_lock)
-> checks quota enabled
flag is set
-> returns -EINVAL because
fs_info->quota_root was
NULL before it acquired
the mutex
qgroup_ioctl_lock
-> ioctl returns -EINVAL
Returning -EINVAL to user space will be confusing if all the arguments
passed to the subvolume creation ioctl were valid.
Fix it by grabbing the value from fs_info->quota_root after acquiring
the mutex.
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
make_bad_inode() sets inode->i_mode to S_IFREG if I/O error is detected
in fuse_do_getattr()/fuse_do_setattr(). If the inode is not a regular
file, write_files and queued_writes in fuse_inode are not initialized
and have NULL or invalid pointers written by other members in a union.
So, list_empty() returns false in fuse_destroy_inode(). Add
is_bad_inode() to check if make_bad_inode() was called.
Reported-by: syzbot+b9c89b84423073226299@syzkaller.appspotmail.com
Fixes: ab2257e994 ("fuse: reduce size of struct fuse_inode")
Signed-off-by: Myungho Jung <mhjungk@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
When we read the EOF page of the file via readpages, we need
to zero the region beyond EOF that we either do not read or
should not contain data so that mmap does not expose stale data to
user applications.
However, iomap_adjust_read_range() fails to detect EOF correctly,
and so fsx on 1k block size filesystems fails very quickly with
mapreads exposing data beyond EOF. There are two problems here.
Firstly, when calculating the end block of the EOF byte, we have
to round the size by one to avoid a block aligned EOF from reporting
a block too large. i.e. a size of 1024 bytes is 1 block, which in
index terms is block 0. Therefore we have to calculate the end block
from (isize - 1), not isize.
The second bug is determining if the current page spans EOF, and so
whether we need split it into two half, one for the IO, and the
other for zeroing. Unfortunately, the code that checks whether
we should split the block doesn't actually check if we span EOF, it
just checks if the read spans the /offset in the page/ that EOF
sits on. So it splits every read into two if EOF is not page
aligned, regardless of whether we are reading the EOF block or not.
Hence we need to restrict the "does the read span EOF" check to
just the page that spans EOF, not every page we read.
This patch results in correct EOF detection through readpages:
xfs_vm_readpages: dev 259:0 ino 0x43 nr_pages 24
xfs_iomap_found: dev 259:0 ino 0x43 size 0x66c00 offset 0x4f000 count 98304 type hole startoff 0x13c startblock 1368 blockcount 0x4
iomap_readpage_actor: orig pos 323584 pos 323584, length 4096, poff 0 plen 4096, isize 420864
xfs_iomap_found: dev 259:0 ino 0x43 size 0x66c00 offset 0x50000 count 94208 type hole startoff 0x140 startblock 1497 blockcount 0x5c
iomap_readpage_actor: orig pos 327680 pos 327680, length 94208, poff 0 plen 4096, isize 420864
iomap_readpage_actor: orig pos 331776 pos 331776, length 90112, poff 0 plen 4096, isize 420864
iomap_readpage_actor: orig pos 335872 pos 335872, length 86016, poff 0 plen 4096, isize 420864
iomap_readpage_actor: orig pos 339968 pos 339968, length 81920, poff 0 plen 4096, isize 420864
iomap_readpage_actor: orig pos 344064 pos 344064, length 77824, poff 0 plen 4096, isize 420864
iomap_readpage_actor: orig pos 348160 pos 348160, length 73728, poff 0 plen 4096, isize 420864
iomap_readpage_actor: orig pos 352256 pos 352256, length 69632, poff 0 plen 4096, isize 420864
iomap_readpage_actor: orig pos 356352 pos 356352, length 65536, poff 0 plen 4096, isize 420864
iomap_readpage_actor: orig pos 360448 pos 360448, length 61440, poff 0 plen 4096, isize 420864
iomap_readpage_actor: orig pos 364544 pos 364544, length 57344, poff 0 plen 4096, isize 420864
iomap_readpage_actor: orig pos 368640 pos 368640, length 53248, poff 0 plen 4096, isize 420864
iomap_readpage_actor: orig pos 372736 pos 372736, length 49152, poff 0 plen 4096, isize 420864
iomap_readpage_actor: orig pos 376832 pos 376832, length 45056, poff 0 plen 4096, isize 420864
iomap_readpage_actor: orig pos 380928 pos 380928, length 40960, poff 0 plen 4096, isize 420864
iomap_readpage_actor: orig pos 385024 pos 385024, length 36864, poff 0 plen 4096, isize 420864
iomap_readpage_actor: orig pos 389120 pos 389120, length 32768, poff 0 plen 4096, isize 420864
iomap_readpage_actor: orig pos 393216 pos 393216, length 28672, poff 0 plen 4096, isize 420864
iomap_readpage_actor: orig pos 397312 pos 397312, length 24576, poff 0 plen 4096, isize 420864
iomap_readpage_actor: orig pos 401408 pos 401408, length 20480, poff 0 plen 4096, isize 420864
iomap_readpage_actor: orig pos 405504 pos 405504, length 16384, poff 0 plen 4096, isize 420864
iomap_readpage_actor: orig pos 409600 pos 409600, length 12288, poff 0 plen 4096, isize 420864
iomap_readpage_actor: orig pos 413696 pos 413696, length 8192, poff 0 plen 4096, isize 420864
iomap_readpage_actor: orig pos 417792 pos 417792, length 4096, poff 0 plen 3072, isize 420864
iomap_readpage_actor: orig pos 420864 pos 420864, length 1024, poff 3072 plen 1024, isize 420864
As you can see, it now does full page reads until the last one which
is split correctly at the block aligned EOF, reading 3072 bytes and
zeroing the last 1024 bytes. The original version of the patch got
this right, but it got another case wrong.
The EOF detection crossing really needs to the the original length
as plen, while it starts at the end of the block, will be shortened
as up-to-date blocks are found on the page. This means "orig_pos +
plen" no longer points to the end of the page, and so will not
correctly detect EOF crossing. Hence we have to use the length
passed in to detect this partial page case:
xfs_filemap_fault: dev 259:1 ino 0x43 write_fault 0
xfs_vm_readpage: dev 259:1 ino 0x43 nr_pages 1
xfs_iomap_found: dev 259:1 ino 0x43 size 0x2cc00 offset 0x2c000 count 4096 type hole startoff 0xb0 startblock 282 blockcount 0x4
iomap_readpage_actor: orig pos 180224 pos 181248, length 4096, poff 1024 plen 2048, isize 183296
xfs_iomap_found: dev 259:1 ino 0x43 size 0x2cc00 offset 0x2cc00 count 1024 type hole startoff 0xb3 startblock 285 blockcount 0x1
iomap_readpage_actor: orig pos 183296 pos 183296, length 1024, poff 3072 plen 1024, isize 183296
Heere we see a trace where the first block on the EOF page is up to
date, hence poff = 1024 bytes. The offset into the page of EOF is
3072, so the range we want to read is 1024 - 3071, and the range we
want to zero is 3072 - 4095. You can see this is split correctly
now.
This fixes the stale data beyond EOF problem that fsx quickly
uncovers on 1k block size filesystems.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
It returns EINVAL when the operation is not supported by the
filesystem. Fix it to return EOPNOTSUPP to be consistent with
the man page and clone_file_range().
Clean up the inconsistent error return handling while I'm there.
(I know, lipstick on a pig, but every little bit helps...)
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
When doing direct IO to a pipe for do_splice_direct(), then pipe is
trivial to fill up and overflow as it can only hold 16 pages. At
this point bio_iov_iter_get_pages() then returns -EFAULT, and we
abort the IO submission process. Unfortunately, iomap_dio_rw()
propagates the error back up the stack.
The error is converted from the EFAULT to EAGAIN in
generic_file_splice_read() to tell the splice layers that the pipe
is full. do_splice_direct() completely fails to handle EAGAIN errors
(it aborts on error) and returns EAGAIN to the caller.
copy_file_write() then completely fails to handle EAGAIN as well,
and so returns EAGAIN to userspace, having failed to copy the data
it was asked to.
Avoid this whole steaming pile of fail by having iomap_dio_rw()
silently swallow EFAULT errors and so do short reads.
To make matters worse, iomap_dio_actor() has a stale data exposure
bug bio_iov_iter_get_pages() fails - it does not zero the tail block
that it may have been left uncovered by partial IO. Fix the error
handling case to drop to the sub-block zeroing rather than
immmediately returning the -EFAULT error.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
If we are doing sub-block dio that extends EOF, we need to zero
the unused tail of the block to initialise the data in it it. If we
do not zero the tail of the block, then an immediate mmap read of
the EOF block will expose stale data beyond EOF to userspace. Found
with fsx running sub-block DIO sizes vs MAPREAD/MAPWRITE operations.
Fix this by detecting if the end of the DIO write is beyond EOF
and zeroing the tail if necessary.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
When we write into an unwritten extent via direct IO, we dirty
metadata on IO completion to convert the unwritten extent to
written. However, when we do the FUA optimisation checks, the inode
may be clean and so we issue a FUA write into the unwritten extent.
This means we then bypass the generic_write_sync() call after
unwritten extent conversion has ben done and we don't force the
modified metadata to stable storage.
This violates O_DSYNC semantics. The window of exposure is a single
IO, as the next DIO write will see the inode has dirty metadata and
hence will not use the FUA optimisation. Calling
generic_write_sync() after completion of the second IO will also
sync the first write and it's metadata.
Fix this by avoiding the FUA optimisation when writing to unwritten
extents.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Long saga. There have been days spent following this through dead end
after dead end in multi-GB event traces. This morning, after writing
a trace-cmd wrapper that enabled me to be more selective about XFS
trace points, I discovered that I could get just enough essential
tracepoints enabled that there was a 50:50 chance the fsx config
would fail at ~115k ops. If it didn't fail at op 115547, I stopped
fsx at op 115548 anyway.
That gave me two traces - one where the problem manifested, and one
where it didn't. After refining the traces to have the necessary
information, I found that in the failing case there was a real
extent in the COW fork compared to an unwritten extent in the
working case.
Walking back through the two traces to the point where the CWO fork
extents actually diverged, I found that the bad case had an extra
unwritten extent in it. This is likely because the bug it led me to
had triggered multiple times in those 115k ops, leaving stray
COW extents around. What I saw was a COW delalloc conversion to an
unwritten extent (as they should always be through
xfs_iomap_write_allocate()) resulted in a /written extent/:
xfs_writepage: dev 259:0 ino 0x83 pgoff 0x17000 size 0x79a00 offset 0 length 0
xfs_iext_remove: dev 259:0 ino 0x83 state RC|LF|RF|COW cur 0xffff888247b899c0/2 offset 32 block 152 count 20 flag 1 caller xfs_bmap_add_extent_delay_real
xfs_bmap_pre_update: dev 259:0 ino 0x83 state RC|LF|RF|COW cur 0xffff888247b899c0/1 offset 1 block 4503599627239429 count 31 flag 0 caller xfs_bmap_add_extent_delay_real
xfs_bmap_post_update: dev 259:0 ino 0x83 state RC|LF|RF|COW cur 0xffff888247b899c0/1 offset 1 block 121 count 51 flag 0 caller xfs_bmap_add_ex
Basically, Cow fork before:
0 1 32 52
+H+DDDDDDDDDDDD+UUUUUUUUUUU+
PREV RIGHT
COW delalloc conversion allocates:
1 32
+uuuuuuuuuuuu+
NEW
And the result according to the xfs_bmap_post_update trace was:
0 1 32 52
+H+wwwwwwwwwwwwwwwwwwwwwwww+
PREV
Which is clearly wrong - it should be a merged unwritten extent,
not an unwritten extent.
That lead me to look at the LEFT_FILLING|RIGHT_FILLING|RIGHT_CONTIG
case in xfs_bmap_add_extent_delay_real(), and sure enough, there's
the bug.
It takes the old delalloc extent (PREV) and adds the length of the
RIGHT extent to it, takes the start block from NEW, removes the
RIGHT extent and then updates PREV with the new extent.
What it fails to do is update PREV.br_state. For delalloc, this is
always XFS_EXT_NORM, while in this case we are converting the
delayed allocation to unwritten, so it needs to be updated to
XFS_EXT_UNWRITTEN. This LF|RF|RC case does not do this, and so
the resultant extent is always written.
And that's the bug I've been chasing for a week - a bmap btree bug,
not a reflink/dedupe/copy_file_range bug, but a BMBT bug introduced
with the recent in core extent tree scalability enhancements.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
On a sub-page block size filesystem, fsx is failing with a data
corruption after a series of operations involving copying a file
with the destination offset beyond EOF of the destination of the file:
8093(157 mod 256): TRUNCATE DOWN from 0x7a120 to 0x50000 ******WWWW
8094(158 mod 256): INSERT 0x25000 thru 0x25fff (0x1000 bytes)
8095(159 mod 256): COPY 0x18000 thru 0x1afff (0x3000 bytes) to 0x2f400
8096(160 mod 256): WRITE 0x5da00 thru 0x651ff (0x7800 bytes) HOLE
8097(161 mod 256): COPY 0x2000 thru 0x5fff (0x4000 bytes) to 0x6fc00
The second copy here is beyond EOF, and it is to sub-page (4k) but
block aligned (1k) offset. The clone runs the EOF zeroing, landing
in a pre-existing post-eof delalloc extent. This zeroes the post-eof
extents in the page cache just fine, dirtying the pages correctly.
The problem is that xfs_reflink_remap_prep() now truncates the page
cache over the range that it is copying it to, and rounds that down
to cover the entire start page. This removes the dirty page over the
delalloc extent from the page cache without having written it back.
Hence later, when the page cache is flushed, the page at offset
0x6f000 has not been written back and hence exposes stale data,
which fsx trips over less than 10 operations later.
Fix this by changing xfs_reflink_remap_prep() to use
xfs_flush_unmap_range().
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
When doing an incremental send, due to the need of delaying directory move
(rename) operations we can end up in infinite loop at
apply_children_dir_moves().
An example scenario that triggers this problem is described below, where
directory names correspond to the numbers of their respective inodes.
Parent snapshot:
.
|--- 261/
|--- 271/
|--- 266/
|--- 259/
|--- 260/
| |--- 267
|
|--- 264/
| |--- 258/
| |--- 257/
|
|--- 265/
|--- 268/
|--- 269/
| |--- 262/
|
|--- 270/
|--- 272/
| |--- 263/
| |--- 275/
|
|--- 274/
|--- 273/
Send snapshot:
.
|-- 275/
|-- 274/
|-- 273/
|-- 262/
|-- 269/
|-- 258/
|-- 271/
|-- 268/
|-- 267/
|-- 270/
|-- 259/
| |-- 265/
|
|-- 272/
|-- 257/
|-- 260/
|-- 264/
|-- 263/
|-- 261/
|-- 266/
When processing inode 257 we delay its move (rename) operation because its
new parent in the send snapshot, inode 272, was not yet processed. Then
when processing inode 272, we delay the move operation for that inode
because inode 274 is its ancestor in the send snapshot. Finally we delay
the move operation for inode 274 when processing it because inode 275 is
its new parent in the send snapshot and was not yet moved.
When finishing processing inode 275, we start to do the move operations
that were previously delayed (at apply_children_dir_moves()), resulting in
the following iterations:
1) We issue the move operation for inode 274;
2) Because inode 262 depended on the move operation of inode 274 (it was
delayed because 274 is its ancestor in the send snapshot), we issue the
move operation for inode 262;
3) We issue the move operation for inode 272, because it was delayed by
inode 274 too (ancestor of 272 in the send snapshot);
4) We issue the move operation for inode 269 (it was delayed by 262);
5) We issue the move operation for inode 257 (it was delayed by 272);
6) We issue the move operation for inode 260 (it was delayed by 272);
7) We issue the move operation for inode 258 (it was delayed by 269);
8) We issue the move operation for inode 264 (it was delayed by 257);
9) We issue the move operation for inode 271 (it was delayed by 258);
10) We issue the move operation for inode 263 (it was delayed by 264);
11) We issue the move operation for inode 268 (it was delayed by 271);
12) We verify if we can issue the move operation for inode 270 (it was
delayed by 271). We detect a path loop in the current state, because
inode 267 needs to be moved first before we can issue the move
operation for inode 270. So we delay again the move operation for
inode 270, this time we will attempt to do it after inode 267 is
moved;
13) We issue the move operation for inode 261 (it was delayed by 263);
14) We verify if we can issue the move operation for inode 266 (it was
delayed by 263). We detect a path loop in the current state, because
inode 270 needs to be moved first before we can issue the move
operation for inode 266. So we delay again the move operation for
inode 266, this time we will attempt to do it after inode 270 is
moved (its move operation was delayed in step 12);
15) We issue the move operation for inode 267 (it was delayed by 268);
16) We verify if we can issue the move operation for inode 266 (it was
delayed by 270). We detect a path loop in the current state, because
inode 270 needs to be moved first before we can issue the move
operation for inode 266. So we delay again the move operation for
inode 266, this time we will attempt to do it after inode 270 is
moved (its move operation was delayed in step 12). So here we added
again the same delayed move operation that we added in step 14;
17) We attempt again to see if we can issue the move operation for inode
266, and as in step 16, we realize we can not due to a path loop in
the current state due to a dependency on inode 270. Again we delay
inode's 266 rename to happen after inode's 270 move operation, adding
the same dependency to the empty stack that we did in steps 14 and 16.
The next iteration will pick the same move dependency on the stack
(the only entry) and realize again there is still a path loop and then
again the same dependency to the stack, over and over, resulting in
an infinite loop.
So fix this by preventing adding the same move dependency entries to the
stack by removing each pending move record from the red black tree of
pending moves. This way the next call to get_pending_dir_moves() will
not return anything for the current parent inode.
A test case for fstests, with this reproducer, follows soon.
Signed-off-by: Robbie Ko <robbieko@synology.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
[Wrote changelog with example and more clear explanation]
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When decoding a lower file handle, we first call ovl_check_origin_fh()
with connected=false to get any real lower dentry for overlay inode
cache lookup.
If the real dentry is a disconnected dir dentry, ovl_check_origin_fh()
is called again with connected=true to get a connected real dentry
and find the lower layer the real dentry belongs to.
If the first call returned a connected real dentry, we use it to
lookup an overlay connected dentry, but the first ovl_check_origin_fh()
call with connected=false did not check that the found dentry is under
the root of the layer (see ovl_acceptable()), it only checked that
the found dentry super block matches the uuid of the lower file handle.
In case there are multiple lower layers on the same fs and the found
dentry is not from the top most lower layer, using the layer index
returned from the first ovl_check_origin_fh() is wrong and we end
up failing to decode the file handle.
Fix this by always calling ovl_check_origin_fh() with connected=true
if we got a directory dentry in the first call.
Fixes: 8b58924ad5 ("ovl: lookup in inode cache first when decoding...")
Cc: <stable@vger.kernel.org> # v4.17
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
The extent shifting code uses a flush and invalidate mechainsm prior
to shifting extents around. This is similar to what
xfs_free_file_space() does, but it doesn't take into account things
like page cache vs block size differences, and it will fail if there
is a page that it currently busy.
xfs_flush_unmap_range() handles all of these cases, so just convert
xfs_prepare_shift() to us that mechanism rather than having it's own
special sauce.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The last AG may be very small comapred to all other AGs, and hence
AG reservations based on the superblock AG size may actually consume
more space than the AG actually has. This results on assert failures
like:
XFS: Assertion failed: xfs_perag_resv(pag, XFS_AG_RESV_METADATA)->ar_reserved + xfs_perag_resv(pag, XFS_AG_RESV_RMAPBT)->ar_reserved <= pag->pagf_freeblks + pag->pagf_flcount, file: fs/xfs/libxfs/xfs_ag_resv.c, line: 319
[ 48.932891] xfs_ag_resv_init+0x1bd/0x1d0
[ 48.933853] xfs_fs_reserve_ag_blocks+0x37/0xb0
[ 48.934939] xfs_mountfs+0x5b3/0x920
[ 48.935804] xfs_fs_fill_super+0x462/0x640
[ 48.936784] ? xfs_test_remount_options+0x60/0x60
[ 48.937908] mount_bdev+0x178/0x1b0
[ 48.938751] mount_fs+0x36/0x170
[ 48.939533] vfs_kern_mount.part.43+0x54/0x130
[ 48.940596] do_mount+0x20e/0xcb0
[ 48.941396] ? memdup_user+0x3e/0x70
[ 48.942249] ksys_mount+0xba/0xd0
[ 48.943046] __x64_sys_mount+0x21/0x30
[ 48.943953] do_syscall_64+0x54/0x170
[ 48.944835] entry_SYSCALL_64_after_hwframe+0x49/0xbe
Hence we need to ensure the finobt per-ag space reservations take
into account the size of the last AG rather than treat it like all
the other full size AGs.
Note that both refcountbt and rmapbt already take the size of the AG
into account via reading the AGF length directly.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
When retrying a failed inode or dquot buffer,
xfs_buf_resubmit_failed_buffers() clears all the failed flags from
the inde/dquot log items. In doing so, it also drops all the
reference counts on the buffer that the failed log items hold. This
means it can drop all the active references on the buffer and hence
free the buffer before it queues it for write again.
Putting the buffer on the delwri queue takes a reference to the
buffer (so that it hangs around until it has been written and
completed), but this goes bang if the buffer has already been freed.
Hence we need to add the buffer to the delwri queue before we remove
the failed flags from the log items attached to the buffer to ensure
it always remains referenced during the resubmit process.
Reported-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
'shash' algorithms are always synchronous, so passing CRYPTO_ALG_ASYNC
in the mask to crypto_alloc_shash() has no effect. Many users therefore
already don't pass it, but some still do. This inconsistency can cause
confusion, especially since the way the 'mask' argument works is
somewhat counterintuitive.
Thus, just remove the unneeded CRYPTO_ALG_ASYNC flags.
This patch shouldn't change any actual behavior.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
For cases when the application does not specify aio_reqprio for an aio,
fallback to use get_current_ioprio() to obtain the task I/O priority
last set using ioprio_set() rather than the hardcoded IOPRIO_CLASS_NONE
value.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Adam Manzanares <adam.manzanares@wdc.com>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Fix a deadlock whereby the NFSv4 state manager can get stuck in the
delegation return code, waiting for a layout return to complete in
another thread. If the server reboots before that other thread
completes, then we need to be able to start a second state
manager thread in order to perform recovery.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
xfs_file_remap_range() is only used in fs/xfs/xfs_file.c, so make it
static.
This addresses a gcc warning when -Wmissing-prototypes is enabled.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Page writeback indirectly handles shared extents via the existence
of overlapping COW fork blocks. If COW fork blocks exist, writeback
always performs the associated copy-on-write regardless if the
underlying blocks are actually shared. If the blocks are shared,
then overlapping COW fork blocks must always exist.
fstests shared/010 reproduces a case where a buffered write occurs
over a shared block without performing the requisite COW fork
reservation. This ultimately causes writeback to the shared extent
and data corruption that is detected across md5 checks of the
filesystem across a mount cycle.
The problem occurs when a buffered write lands over a shared extent
that crosses an extent size hint boundary and that also happens to
have a partial COW reservation that doesn't cover the start and end
blocks of the data fork extent.
For example, a buffered write occurs across the file offset (in FSB
units) range of [29, 57]. A shared extent exists at blocks [29, 35]
and COW reservation already exists at blocks [32, 34]. After
accommodating a COW extent size hint of 32 blocks and the existing
reservation at offset 32, xfs_reflink_reserve_cow() allocates 32
blocks of reservation at offset 0 and returns with COW reservation
across the range of [0, 34]. The associated data fork extent is
still [29, 35], however, which isn't fully covered by the COW
reservation.
This leads to a buffered write at file offset 35 over a shared
extent without associated COW reservation. Writeback eventually
kicks in, performs an overwrite of the underlying shared block and
causes the associated data corruption.
Update xfs_reflink_reserve_cow() to accommodate the fact that a
delalloc allocation request may not fully cover the extent in the
data fork. Trim the data fork extent appropriately, just as is done
for shared extent boundaries and/or existing COW reservations that
happen to overlap the start of the data fork extent. This prevents
shared/010 failures due to data corruption on reflink enabled
filesystems.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Pull networking fixes from David Miller:
1) Fix some potentially uninitialized variables and use-after-free in
kvaser_usb can drier, from Jimmy Assarsson.
2) Fix leaks in qed driver, from Denis Bolotin.
3) Socket leak in l2tp, from Xin Long.
4) RSS context allocation fix in bnxt_en from Michael Chan.
5) Fix cxgb4 build errors, from Ganesh Goudar.
6) Route leaks in ipv6 when removing exceptions, from Xin Long.
7) Memory leak in IDR allocation handling of act_pedit, from Davide
Caratti.
8) Use-after-free of bridge vlan stats, from Nikolay Aleksandrov.
9) When MTU is locked, do not force DF bit on ipv4 tunnels. From
Sabrina Dubroca.
10) When NAPI cached skb is reused, we must set it to the proper initial
state which includes skb->pkt_type. From Eric Dumazet.
11) Lockdep and non-linear SKB handling fix in tipc from Jon Maloy.
12) Set RX queue properly in various tuntap receive paths, from Matthew
Cover.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (61 commits)
tuntap: fix multiqueue rx
ipv6: Fix PMTU updates for UDP/raw sockets in presence of VRF
tipc: don't assume linear buffer when reading ancillary data
tipc: fix lockdep warning when reinitilaizing sockets
net-gro: reset skb->pkt_type in napi_reuse_skb()
tc-testing: tdc.py: Guard against lack of returncode in executed command
tc-testing: tdc.py: ignore errors when decoding stdout/stderr
ip_tunnel: don't force DF when MTU is locked
MAINTAINERS: Add entry for CAKE qdisc
net: bridge: fix vlan stats use-after-free on destruction
socket: do a generic_file_splice_read when proto_ops has no splice_read
net: phy: mdio-gpio: Fix working over slow can_sleep GPIOs
Revert "net: phy: mdio-gpio: Fix working over slow can_sleep GPIOs"
net: phy: mdio-gpio: Fix working over slow can_sleep GPIOs
net/sched: act_pedit: fix memory leak when IDR allocation fails
net: lantiq: Fix returned value in case of error in 'xrx200_probe()'
ipv6: fix a dst leak when removing its exception
net: mvneta: Don't advertise 2.5G modes
drivers/net/ethernet/qlogic/qed/qed_rdma.h: fix typo
net/mlx4: Fix UBSAN warning of signed integer overflow
...
For the core poll helper, the task state setting don't need to imply any
atomics, as it's the current task itself that is being modified and
we're not going to sleep.
For IRQ driven, the wakeup path have the necessary barriers to not need
us using the heavy handed version of the task state setting.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Theodore Ts'o reported a v4.19 regression with docker-dropbox:
https://marc.info/?l=linux-fsdevel&m=154070089431116&w=2
"I was rebuilding my dropbox Docker container, and it failed in 4.19
with the following error:
...
dpkg: error: error creating new backup file \
'/var/lib/dpkg/status-old': Invalid cross-device link"
The problem did not reproduce with metacopy feature disabled.
The error was caused by insufficient credentials to set
"trusted.overlay.redirect" xattr on link of a metacopy file.
Reproducer:
echo Y > /sys/module/overlay/parameters/redirect_dir
echo Y > /sys/module/overlay/parameters/metacopy
cd /tmp
mkdir l u w m
chmod 777 l u
touch l/foo
ln l/foo l/link
chmod 666 l/foo
mount -t overlay none -olowerdir=l,upperdir=u,workdir=w m
su fsgqa
ln m/foo m/bar
[ 21.455823] overlayfs: failed to set redirect (-1)
ln: failed to create hard link 'm/bar' => 'm/foo':\
Invalid cross-device link
Reported-by: Theodore Y. Ts'o <tytso@mit.edu>
Reported-by: Maciej Zięba <maciekz82@gmail.com>
Fixes: 4120fe64dc ("ovl: Set redirect on upper inode when it is linked")
Cc: <stable@vger.kernel.org> # v4.19
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
After calling get_unlocked_entry(), you have to call
put_unlocked_entry() to avoid subsequent waiters losing wakeups.
Fixes: c2a7d2a115 ("filesystem-dax: Introduce dax_lock_mapping_entry()")
Cc: stable@vger.kernel.org
Signed-off-by: Matthew Wilcox <willy@infradead.org>
Suspend fails due to the exec family of functions blocking the freezer.
The casue is that de_thread() sleeps in TASK_UNINTERRUPTIBLE waiting for
all sub-threads to die, and we have the deadlock if one of them is frozen.
This also can occur with the schedule() waiting for the group thread leader
to exit if it is frozen.
In our machine, it causes freeze timeout as bellows.
Freezing of tasks failed after 20.010 seconds (1 tasks refusing to freeze, wq_busy=0):
setcpushares-ls D ffffffc00008ed70 0 5817 1483 0x0040000d
Call trace:
[<ffffffc00008ed70>] __switch_to+0x88/0xa0
[<ffffffc000d1c30c>] __schedule+0x1bc/0x720
[<ffffffc000d1ca90>] schedule+0x40/0xa8
[<ffffffc0001cd784>] flush_old_exec+0xdc/0x640
[<ffffffc000220360>] load_elf_binary+0x2a8/0x1090
[<ffffffc0001ccff4>] search_binary_handler+0x9c/0x240
[<ffffffc00021c584>] load_script+0x20c/0x228
[<ffffffc0001ccff4>] search_binary_handler+0x9c/0x240
[<ffffffc0001ce8e0>] do_execveat_common.isra.14+0x4f8/0x6e8
[<ffffffc0001cedd0>] compat_SyS_execve+0x38/0x48
[<ffffffc00008de30>] el0_svc_naked+0x24/0x28
To fix this, make de_thread() freezable. It looks safe and works fine.
Suggested-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Chanho Min <chanho.min@lge.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Pavel Machek <pavel@ucw.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Commit c26f6c6157 ("udf: Fix conversion of 'dstring' fields to UTF8")
started to be more strict when checking whether converted strings are
properly formatted. Sudip reports that there are DVDs where the volume
identification string is actually too long - UDF reports:
[ 632.309320] UDF-fs: incorrect dstring lengths (32/32)
during mount and fails the mount. This is mostly harmless failure as we
don't need volume identification (and even less volume set
identification) for anything. So just truncate the volume identification
string if it is too long and replace it with 'Invalid' if we just cannot
convert it for other reasons. This keeps slightly incorrect media still
mountable.
CC: stable@vger.kernel.org
Fixes: c26f6c6157 ("udf: Fix conversion of 'dstring' fields to UTF8")
Reported-and-tested-by: Sudip Mukherjee <sudipm.mukherjee@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Fix a static code checker warning:
fs/exportfs/expfs.c:171 reconnect_one() warn: passing zero to 'ERR_PTR'
The error path for lookup_one_len_unlocked failure
should set err to PTR_ERR.
Fixes: bbf7a8a356 ("exportfs: move most of reconnect_path to helper function")
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
-----BEGIN PGP SIGNATURE-----
iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAlvx2sAeHHRvcnZhbGRz
QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGycgIAIuxobwt0RRKa0zO
ROS+34JGoC2yU2P9VdEGWdtxS6ANMVQgKPBhWL6s+xR89Kd+V4xSdJLD1pNTxxqP
0DCva0np1/Q4juH+JbU50v/lykoLgteZ0P0LBRGf1y8p3WiLPv45IbnNsMDNYhB2
7a8rOmZYakRY9CPznRDw3X8cJt3sddKgFJHIOGz1OQJVWtCD0KPGcJmQNsbDSagY
Zx6Z5BKSIdjRqaAdN5gDa1Pft3WQo7TpaQGl80lSsgr5LcjmscXA3sClOCy+25Mo
FZLx0PcwP+Efq8RTGzNK51WSOMa6d37hvjDqUAdQBOR0KbyjRyXQwyQVw/MGbPJs
7J3Pzm0=
=56Mt
-----END PGP SIGNATURE-----
Merge tag 'v4.20-rc3' into for-4.21/block
Merge in -rc3 to resolve a few conflicts, but also to get a few
important fixes that have gone into mainline since the block
4.21 branch was forked off (most notably the SCSI queue issue,
which is both a conflict AND needed fix).
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Convert string compares of DT node names to use of_node_name_eq helper
instead. This removes direct access to the node name pointer.
Signed-off-by: Rob Herring <robh@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
In preparation to remove struct device_node.path_component_name, use
full_name instead. kbasename is used so full_name can be used whether it
is the full path or just the node's name and unit-address.
Cc: "David S. Miller" <davem@davemloft.net>
Cc: sparclinux@vger.kernel.org
Signed-off-by: Rob Herring <robh@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The write context should also be freed even when direct IO failed.
Otherwise a memory leak is introduced and entries remain in
oi->ip_unwritten_list causing the following BUG later in unlink path:
ERROR: bug expression: !list_empty(&oi->ip_unwritten_list)
ERROR: Clear inode of 215043, inode has unwritten extents
...
Call Trace:
? __set_current_blocked+0x42/0x68
ocfs2_evict_inode+0x91/0x6a0 [ocfs2]
? bit_waitqueue+0x40/0x33
evict+0xdb/0x1af
iput+0x1a2/0x1f7
do_unlinkat+0x194/0x28f
SyS_unlinkat+0x1b/0x2f
do_syscall_64+0x79/0x1ae
entry_SYSCALL_64_after_hwframe+0x151/0x0
This patch also logs, with frequency limit, direct IO failures.
Link: http://lkml.kernel.org/r/20181102170632.25921-1-wen.gang.wang@oracle.com
Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Changwei Ge <ge.changwei@h3c.com>
Reviewed-by: Joseph Qi <jiangqi903@gmail.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Spock reported that commit 172b06c32b ("mm: slowly shrink slabs with a
relatively small number of objects") leads to a regression on his setup:
periodically the majority of the pagecache is evicted without an obvious
reason, while before the change the amount of free memory was balancing
around the watermark.
The reason behind is that the mentioned above change created some
minimal background pressure on the inode cache. The problem is that if
an inode is considered to be reclaimed, all belonging pagecache page are
stripped, no matter how many of them are there. So, if a huge
multi-gigabyte file is cached in the memory, and the goal is to reclaim
only few slab objects (unused inodes), we still can eventually evict all
gigabytes of the pagecache at once.
The workload described by Spock has few large non-mapped files in the
pagecache, so it's especially noticeable.
To solve the problem let's postpone the reclaim of inodes, which have
more than 1 attached page. Let's wait until the pagecache pages will be
evicted naturally by scanning the corresponding LRU lists, and only then
reclaim the inode structure.
Link: http://lkml.kernel.org/r/20181023164302.20436-1-guro@fb.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Reported-by: Spock <dairinin@gmail.com>
Tested-by: Spock <dairinin@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: <stable@vger.kernel.org> [4.19.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Using xas_load() with a PMD-sized xa_state would work if either a
PMD-sized entry was present or a PTE sized entry was present in the
first 64 entries (of the 512 PTEs in a PMD on x86). If there was no
PTE in the first 64 entries, grab_mapping_entry() would believe there
were no entries present, allocate a PMD-sized entry and overwrite the
PTE in the page cache.
Use xas_find_conflict() instead which turns out to simplify
both get_unlocked_entry() and grab_mapping_entry(). Also remove a
WARN_ON_ONCE from grab_mapping_entry() as it will have already triggered
in get_unlocked_entry().
Fixes: cfc93c6c6c ("dax: Convert dax_insert_pfn_mkwrite to XArray")
Signed-off-by: Matthew Wilcox <willy@infradead.org>
Device DAX PMD pages do not set the PageHead bit for compound pages.
Fix for now by retrieving the PMD bit from the entry, but eventually we
will be passed the page size by the caller.
Reported-by: Dan Williams <dan.j.williams@intel.com>
Fixes: 9f32d22130 ("dax: Convert dax_lock_mapping_entry to XArray")
Signed-off-by: Matthew Wilcox <willy@infradead.org>
If the ioprio capability check fails, we return without putting
the file pointer.
Fixes: d9a08a9e61 ("fs: Add aio iopriority support")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
For the device-dax case, it is possible that the inode can go away
underneath us. The rcu_read_lock() was there to prevent it from
being freed, and not (as I thought) to protect the tree. Bring back
the rcu_read_lock() protection. Also add a little kernel-doc; while
this function is not exported to modules, it is used from outside dax.c
Reported-by: Dan Williams <dan.j.williams@intel.com>
Fixes: 9f32d22130 ("dax: Convert dax_lock_mapping_entry to XArray")
Signed-off-by: Matthew Wilcox <willy@infradead.org>
I wrote the semantics in the commit message, but didn't document it in
the source code. Use a BUG_ON instead (if any code does do this, it's
really buggy; we can't recover and it's worth taking the machine down).
Signed-off-by: Matthew Wilcox <willy@infradead.org>
Skipping some of the revalidation after we sleep can lead to returning
a mapping which has already been freed. Just drop this optimisation.
Reported-by: Dan Williams <dan.j.williams@intel.com>
Fixes: 9f32d22130 ("dax: Convert dax_lock_mapping_entry to XArray")
Signed-off-by: Matthew Wilcox <willy@infradead.org>
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAlvukbUACgkQnJ2qBz9k
QNkJnggAriAyO72fH7nfh2AzvwnkqLi9uE4JQk+ZxQFW9W6a32Y+qIRBsaiau++w
/lOVSGLCCVewzAa4b/lYPdyOS/yG4w6UuEsg5HQPux4JsOccalYudB5ONdeZGd/P
5FPxcsixDQQHP9mhz1gIn2wV7bbytH2DduIHxuKbJfX//xnO3jS/gnJq82rnxb4R
IzqewVjbFiqYQKtyuzIqKqI3yOVqESJhyQ96j6chj5YKNUkMuwU9akhQGR4Zcqen
/cxh+mK+ZzCybgu2Sfgx3noZb5RFYePpcSSpDG5OKJc04IY4m76fyHIi2flBE9AX
HCN60y+VyKFOiNxzsSLXgsE9aj40OA==
=UcJO
-----END PGP SIGNATURE-----
Merge tag 'fsnotify_for_v4.20-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull fsnotify fix from Jan Kara:
"One small fsnotify fix for duplicate events"
* tag 'fsnotify_for_v4.20-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
fanotify: fix handling of events on child sub-directory
Pull bfs2 fixes from Andreas Gruenbacher:
"Fix two bugs leading to leaked buffer head references:
- gfs2: Put bitmap buffers in put_super
- gfs2: Fix iomap buffer head reference counting bug
And one bug leading to significant slow-downs when deleting large
files:
- gfs2: Fix metadata read-ahead during truncate (2)"
* tag 'gfs2-4.20.fixes3' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
gfs2: Fix iomap buffer head reference counting bug
gfs2: Fix metadata read-ahead during truncate (2)
gfs2: Put bitmap buffers in put_super
GFS2 passes the inode buffer head (dibh) from gfs2_iomap_begin to
gfs2_iomap_end in iomap->private. It sets that private pointer in
gfs2_iomap_get. Users of gfs2_iomap_get other than gfs2_iomap_begin
would have to release iomap->private, but this isn't done correctly,
leading to a leak of buffer head references.
To fix this, move the code for setting iomap->private from
gfs2_iomap_get to gfs2_iomap_begin.
Fixes: 64bc06bb32 ("gfs2: iomap buffered write support")
Cc: stable@vger.kernel.org # v4.19+
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Those will go straight to issue inside blk-mq, so don't bother
setting up a block plug for them.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Inherit the iocb IOCB_HIPRI flag, and pass on REQ_HIPRI for
those kinds of requests.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If we're polling for IO on a device that doesn't use interrupts, then
IO completion loop (and wake of task) is done by submitting task itself.
If that is the case, then we don't need to enter the wake_up_process()
function, we can simply mark ourselves as TASK_RUNNING.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iHQEABYIAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCW+3F5wAKCRDh3BK/laaZ
PIFHAPi8jN+wXE24tSEx/oJzCVAXP5tw8IxbiyiEdkvBZeyuAP9nkx0GoUtZodWN
c+JEjDJnyYQhM28VoKauDoWn+dxPDw==
=QvAu
-----END PGP SIGNATURE-----
Merge tag 'fuse-fixes-4.20-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse
Pull fuse fixes from Miklos Szeredi:
"A couple of fixes, all bound for -stable (i.e. not regressions in this
cycle)"
* tag 'fuse-fixes-4.20-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
fuse: fix use-after-free in fuse_direct_IO()
fuse: fix possibly missed wake-up after abort
fuse: fix leaked notify reply
The life-checking function, which is used by kAFS to make sure that a call
is still live in the event of a pending signal, only samples the received
packet serial number counter; it doesn't actually provoke a change in the
counter, rather relying on the server to happen to give us a packet in the
time window.
Fix this by adding a function to force a ping to be transmitted.
kAFS then keeps track of whether there's been a stall, and if so, uses the
new function to ping the server, resetting the timeout to allow the reply
to come back.
If there's a stall, a ping and the call is *still* stalled in the same
place after another period, then the call will be aborted.
Fixes: bc5e3a546d ("rxrpc: Use MSG_WAITALL to tell sendmsg() to temporarily ignore signals")
Fixes: f4d15fb6f9 ("rxrpc: Provide functions for allowing cleaner handling of signals")
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Highlights include:
Stable fixes:
- Don't exit the NFSv4 state manager without clearing NFS4CLNT_MANAGER_RUNNING
Bugfixes:
- Fix an Oops when destroying the RPCSEC_GSS credential cache
- Fix an Oops during delegation callbacks
- Ensure that the NFSv4 state manager exits the loop on SIGKILL
- Fix a bogus get/put in generic_key_to_expire()
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJb7Lh0AAoJEA4mA3inWBJc8uAQAIkrGChs3AFuEQ3G3H9RlxDX
WFsPghRGmDwXf2sD+nWjl0r60v0v5fQaUhW/7EPe2kbVTF/rnjieXNeFOw33ZMFk
MDq03nL1/I25DoNK/qg5GZ2NIltZ9oKKbwaN+0LxXKz69X5qIXYnDzYPHDR/PNTg
Go7PvG8rU31Wd67E2pquwC6zZ6rCPf2BtQjZdzouLAEUWXAMHyJmszpFUxhLMJoz
k6dZouphj8fkMse3cfKLnGDqbQ2bE6+Yb0B6Hi0p5nShYgZTaQNZ9KxrEJF7J05i
cxH6IvLEawEMWXYzGEwr1LUDDrpwveuNTt/OroTgOcSsVpZx1DE0sOZkQ4pt/uTe
c5NzZYKjEOb2DWxoGR2GEDkRasKVBkWvR5MegvyDgyAcXkAjN/6CgYXiniNYDxl6
qk7sIqkJfug7fv+VW5YHwORKnvRIEDlFcwy5yZ0ij/Qa0dqUR3aczINGLwS6kcfn
u7M42UR17FUo2zaI9pZhuijwntbtkXMIETWHGRctK7Mum6u37QSVySNCO2A4knBE
jEy+oYPFCIUqH+ESpNp73otrVt1CTexScIJNsEi1naLmOhjQRW7YjUPEH1Xjg0Ss
OGyqIjOf6ToF6ma39/XZI9miJe08k6x8b0aGUdG29Cko9UvjLH86ODEausSRAyFA
OyZFFuHHAau5FGpNvZfj
=AstN
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-4.20-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client bugfixes from Trond Myklebust:
"Highlights include:
Stable fixes:
- Don't exit the NFSv4 state manager without clearing
NFS4CLNT_MANAGER_RUNNING
Bugfixes:
- Fix an Oops when destroying the RPCSEC_GSS credential cache
- Fix an Oops during delegation callbacks
- Ensure that the NFSv4 state manager exits the loop on SIGKILL
- Fix a bogus get/put in generic_key_to_expire()"
* tag 'nfs-for-4.20-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
NFSv4: Fix an Oops during delegation callbacks
SUNRPC: Fix a bogus get/put in generic_key_to_expire()
SUNRPC: Fix a Oops when destroying the RPCSEC_GSS credential cache
NFSv4: Ensure that the state manager exits the loop on SIGKILL
NFSv4: Don't exit the state manager without clearing NFS4CLNT_MANAGER_RUNNING
inotify_show_fdinfo() is defined in fs/notify/fdinfo.c and declared in
fs/notify/fdinfo.h, but the declaration isn't included at the point of
the definition. Include the header to enforce that the definition
matches the declaration.
This addresses a gcc warning when -Wmissing-prototypes is enabled.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Reusable parameter of mb_cache_entry_create() is bool type,
so it's better to set true instead of 1.
Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Signed-off-by: Jan Kara <jack@suse.cz>
According to comment in dlm_user_request() ua should be freed
in dlm_free_lkb() after successful attach to lkb.
However ua is attached to lkb not in set_lock_args() but later,
inside request_lock().
Fixes 597d0cae0f ("[DLM] dlm: user locks")
Cc: stable@kernel.org # 2.6.19
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: David Teigland <teigland@redhat.com>
If allocation fails on last elements of array need to free already
allocated elements.
v2: just move existing out_rsbtbl label to right place
Fixes 789924ba635f ("dlm: fix race between remove and lookup")
Cc: stable@kernel.org # 3.6
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: David Teigland <teigland@redhat.com>
effort to hit, which might explain why they weren't found sooner.
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJb6zwdAAoJECebzXlCjuG+SLMP/AlpI+vPV7DdCLRWGCY1ZMjk
5pxIS+74mD2EopBYgZY58L1fxWgv2bLOiAs/baAlNpkjTNX3wlxXGTu9IzVdPOn7
3n+W2Rb+mXEFaag7mP8RFpOvt7Yb3p4DObGpg7TKWJZ6r/8xcxQWQO+e0iiS5+XK
EOiaFcGmYlOC1JtrRIL2fr16trXUhT1gz7qAZgKBzebbEdn4FfdsdwHm7nUyRB3I
LhCMV35RfzOBC2C/kQzlHaHYlo0dx5lKMtVzvtgMdpgXr4QXE/7Ke/ANQ7oGfhhO
9uX0Uf18HmeGRejK9QoMha7VWuwh5pyHBq0ppMpGL2jb11BD/l9iXgS+vTxpA2B0
YIiSOnaiDFsEk6hMsFqueVIdaTrarcjg/S2mh2QDjtkXKS3L0W6/7v97JJHu9J4l
6zxiT6Crq2p8pMZ5gY3RI1AYllW/K+TRoccLhO+q19g3q1HWxP6DyeFBgNF66/ha
NtmQP+94IkaCS70zirpEu/OeUMviQgX2x77OReyibHLA4+R+hNHwtR67BLl+xG0G
jmKHfqqX7offFaHmsoD8kK3gpKtit0/py9Hp7gXQg4vU5iL512gI83ICEOEkZMXn
Ppsrl1HyoO/ohY/USpMvRqYHjM1ZGew19ZzD7SId6vUVaYjQIEsjQVnycK3h+gSb
otk5pc3bWPCwa8csOWPs
=+1Ub
-----END PGP SIGNATURE-----
Merge tag 'nfsd-4.20-1' of git://linux-nfs.org/~bfields/linux
Pull nfsd fixes from Bruce Fields:
"Three nfsd bugfixes.
None are new bugs, but they all take a little effort to hit, which
might explain why they weren't found sooner"
* tag 'nfsd-4.20-1' of git://linux-nfs.org/~bfields/linux:
SUNRPC: drop pointless static qualifier in xdr_get_next_encode_buffer()
nfsd: COPY and CLONE operations require the saved filehandle to be set
sunrpc: correct the computation for page_ptr when truncating
Pull namespace fix from Eric Biederman:
"Benjamin Coddington noticed an unkillable busy loop in the kernel that
anyone who is sufficiently motivated can trigger. This bug did not
exist in earlier kernels making this bug a regression.
I have tested the change personally and confirmed that the bug exists
and that the fix works. This fix has been picked up by linux-next and
hopefully the automated testing bots and no problems have been
reported from those sources.
Ordinarily I would let something like this sit a little longer but I
am going to be away at Linux Plumbers the rest of this week and I am
afraid if I don't send the pull request now this fix will get lost"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
mnt: fix __detach_mounts infinite loop
We were using the path name received from user space without checking that
it is null terminated. While btrfs-progs is well behaved and does proper
validation and null termination, someone could call the ioctl and pass
a non-null terminated patch, leading to buffer overrun problems in the
kernel. The ioctl is protected by CAP_SYS_ADMIN.
So just set the last byte of the path to a null character, similar to what
we do in other ioctls (add/remove/resize device, snapshot creation, etc).
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
ext2_xattr_destroy_cache() can handle NULL pointer correctly,
so there is no need to check NULL pointer before calling
ext2_xattr_destroy_cache().
Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Signed-off-by: Jan Kara <jack@suse.cz>
If the server sends a CB_GETATTR or a CB_RECALL while the filesystem is
being unmounted, then we can Oops when releasing the inode in
nfs4_callback_getattr() and nfs4_callback_recall().
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Technically dlm_config_nodes() could return error and keep nodes
uninitialized. After that on the fail path of we'll call kfree()
for that uninitialized value.
The patch is simple - we should just initialize nodes with NULL.
Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David Teigland <teigland@redhat.com>
A new event mask FAN_OPEN_EXEC_PERM has been defined. This allows users
to receive events and grant access to files that are intending to be
opened for execution. Events of FAN_OPEN_EXEC_PERM type will be
generated when a file has been opened by using either execve(),
execveat() or uselib() system calls.
This acts in the same manner as previous permission event mask, meaning
that an access response is required from the user application in order
to permit any further operations on the file.
Signed-off-by: Matthew Bobrowski <mbobrowski@mbobrowski.org>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
A new event mask FAN_OPEN_EXEC has been defined so that users have the
ability to receive events specifically when a file has been opened with
the intent to be executed. Events of FAN_OPEN_EXEC type will be
generated when a file has been opened using either execve(), execveat()
or uselib() system calls.
The feature is implemented within fsnotify_open() by generating the
FAN_OPEN_EXEC event type if __FMODE_EXEC is set within file->f_flags.
Signed-off-by: Matthew Bobrowski <mbobrowski@mbobrowski.org>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Modify fanotify_should_send_event() so that it now returns a mask for
an event that contains ONLY flags for the event types that have been
specifically requested by the user. Flags that may have been included
within the event mask, but have not been explicitly requested by the
user will not be present in the returned value.
As an example, given the situation where a user requests events of type
FAN_OPEN. Traditionally, the event mask returned within an event that
occurred on a filesystem object that has been marked for monitoring and is
opened, will only ever have the FAN_OPEN bit set. With the introduction of
the new flags like FAN_OPEN_EXEC, and perhaps any other future event
flags, there is a possibility of the returned event mask containing more
than a single bit set, despite having only requested the single event type.
Prior to these modifications performed to fanotify_should_send_event(), a
user would have received a bundled event mask containing flags FAN_OPEN
and FAN_OPEN_EXEC in the instance that a file was opened for execution via
execve(), for example. This means that a user would receive event types
in the returned event mask that have not been requested. This runs the
possibility of breaking existing systems and causing other unforeseen
issues.
To mitigate this possibility, fanotify_should_send_event() has been
modified to return the event mask containing ONLY event types explicitly
requested by the user. This means that we will NOT report events that the
user did no set a mask for, and we will NOT report events that the user
has set an ignore mask for.
The function name fanotify_should_send_event() has also been updated so
that it's more relevant to what it has been designed to do.
Signed-off-by: Matthew Bobrowski <mbobrowski@mbobrowski.org>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
After the simplification of the fast fsync patch done recently by commit
b5e6c3e170 ("btrfs: always wait on ordered extents at fsync time") and
commit e7175a6927 ("btrfs: remove the wait ordered logic in the
log_one_extent path"), we got a very short time window where we can get
extents logged without writeback completing first or extents logged
without logging the respective data checksums. Both issues can only happen
when doing a non-full (fast) fsync.
As soon as we enter btrfs_sync_file() we trigger writeback, then lock the
inode and then wait for the writeback to complete before starting to log
the inode. However before we acquire the inode's lock and after we started
writeback, it's possible that more writes happened and dirtied more pages.
If that happened and those pages get writeback triggered while we are
logging the inode (for example, the VM subsystem triggering it due to
memory pressure, or another concurrent fsync), we end up seeing the
respective extent maps in the inode's list of modified extents and will
log matching file extent items without waiting for the respective
ordered extents to complete, meaning that either of the following will
happen:
1) We log an extent after its writeback finishes but before its checksums
are added to the csum tree, leading to -EIO errors when attempting to
read the extent after a log replay.
2) We log an extent before its writeback finishes.
Therefore after the log replay we will have a file extent item pointing
to an unwritten extent (and without the respective data checksums as
well).
This could not happen before the fast fsync patch simplification, because
for any extent we found in the list of modified extents, we would wait for
its respective ordered extent to finish writeback or collect its checksums
for logging if it did not complete yet.
Fix this by triggering writeback again after acquiring the inode's lock
and before waiting for ordered extents to complete.
Fixes: e7175a6927 ("btrfs: remove the wait ordered logic in the log_one_extent path")
Fixes: b5e6c3e170 ("btrfs: always wait on ordered extents at fsync time")
CC: stable@vger.kernel.org # 4.19+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When a metadata read is served the endio routine btree_readpage_end_io_hook
is called which eventually runs the tree-checker. If tree-checker fails
to validate the read eb then it sets EXTENT_BUFFER_CORRUPT flag. This
leads to btree_read_extent_buffer_pages wrongly assuming that all
available copies of this extent buffer are wrong and failing prematurely.
Fix this modify btree_read_extent_buffer_pages to read all copies of
the data.
This failure was exhibitted in xfstests btrfs/124 which would
spuriously fail its balance operations. The reason was that when balance
was run following re-introduction of the missing raid1 disk
__btrfs_map_block would map the read request to stripe 0, which
corresponded to devid 2 (the disk which is being removed in the test):
item 2 key (FIRST_CHUNK_TREE CHUNK_ITEM 3553624064) itemoff 15975 itemsize 112
length 1073741824 owner 2 stripe_len 65536 type DATA|RAID1
io_align 65536 io_width 65536 sector_size 4096
num_stripes 2 sub_stripes 1
stripe 0 devid 2 offset 2156920832
dev_uuid 8466c350-ed0c-4c3b-b17d-6379b445d5c8
stripe 1 devid 1 offset 3553624064
dev_uuid 1265d8db-5596-477e-af03-df08eb38d2ca
This caused read requests for a checksum item that to be routed to the
stale disk which triggered the aforementioned logic involving
EXTENT_BUFFER_CORRUPT flag. This then triggered cascading failures of
the balance operation.
Fixes: a826d6dcb3 ("Btrfs: check items for correctness as we search")
CC: stable@vger.kernel.org # 4.4+
Suggested-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If we exit the NFSv4 state manager due to a umount, then we can end up
leaving the NFS4CLNT_MANAGER_RUNNING flag set. If another mount causes
the nfs4_client to be rereferenced before it is destroyed, then we end
up never being able to recover state.
Fixes: 47c2199b6e ("NFSv4.1: Ensure state manager thread dies on last ...")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: stable@vger.kernel.org # v4.15+
lockdep_assert_held() is better suited to checking locking requirements,
since it only checks if the current thread holds the lock regardless of
whether someone else does. This is also a step towards possibly removing
spin_is_locked().
Signed-off-by: Lance Roy <ldr709@gmail.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: <linux-fsdevel@vger.kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
Since commit ff17fa561a ("d_invalidate(): unhash immediately")
immediately unhashes the dentry, we'll never return the mountpoint in
lookup_mountpoint(), which can lead to an unbreakable loop in
d_invalidate().
I have reports of NFS clients getting into this condition after the server
removes an export of an existing mount created through follow_automount(),
but I suspect there are various other ways to produce this problem if we
hunt down users of d_invalidate(). For example, it is possible to get into
this state by using XFS' d_invalidate() call in xfs_vn_unlink():
truncate -s 100m img{1,2}
mkfs.xfs -q -n version=ci img1
mkfs.xfs -q -n version=ci img2
mkdir -p /mnt/xfs
mount img1 /mnt/xfs
mkdir /mnt/xfs/sub1
mount img2 /mnt/xfs/sub1
cat > /mnt/xfs/sub1/foo &
umount -l /mnt/xfs/sub1
mount img2 /mnt/xfs/sub1
mount --make-private /mnt/xfs
mkdir /mnt/xfs/sub2
mount --move /mnt/xfs/sub1 /mnt/xfs/sub2
rmdir /mnt/xfs/sub1
Fix this by moving the check for an unlinked dentry out of the
detach_mounts() path.
Fixes: ff17fa561a ("d_invalidate(): unhash immediately")
Cc: stable@vger.kernel.org
Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAlvoGIUACgkQxWXV+ddt
WDta6g//UJSLnVskCUwh8VyMdd47QArQnaLJowOH7wQn4Nqj+2hf04mCq/kv05ed
OneTezzONZc/qW9fiJGS+Dp77ln4JIDA1hWHtb/A4t9pYlksSQllJ3oiDUVsCp3q
2EbzrjuNz3iQO6TjKlaHX473CLCMQMXS2OXOUnCkF2maMJSdr86oi+j1UiSnud1/
C7uMYM3hG8nkfEfjjb1COpkS2MmzYcPruF5RDcbT/WOUfylTsjjX1E7rK/ZEqS9P
SUcp4uoZe9BNoyWMASLaM7oHE82day4X9MwQoCQFRcm0kq4CnRAZ8X4lBl+M70iW
7Olii/wNZ2SRiJf3jac/rpxoBHvEskXTHyiHTEmdHp4n1L1pL9GzGYIePQcX7uV1
Tb6ImdUUKCC//fPqyeB7cEk5yxqahmlFD3qZVs6GnQkzKrPE+ChLx+7PgcJC/XVh
C5ogNmJm+NvFOuTrYk9zSXg85B8gWHescDJrvNKVizIjw3nKmqiC+dXZljhzw+p8
HscK9EXsiS8jW9ClfJljXzIa4SeA/i7fQGe4tCKfIrCQ+OqUxWpFCEoxygchinfF
Rw90fJ0jX083oXsnfFcVdQpQ+SLSKka/aIRMvi58WRgLU3trci5NNN4TFg8TYRKP
xBDF/iF3sqXajc+xsjoqLhLioZL3Pa5VDNuhsFdois9M5JSRekU=
=K14u
-----END PGP SIGNATURE-----
Merge tag 'for-4.20-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"Several fixes to recent release (4.19, fixes tagged for stable) and
other fixes"
* tag 'for-4.20-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
Btrfs: fix missing delayed iputs on unmount
Btrfs: fix data corruption due to cloning of eof block
Btrfs: fix infinite loop on inode eviction after deduplication of eof block
Btrfs: fix deadlock on tree root leaf when finding free extent
btrfs: avoid link error with CONFIG_NO_AUTO_INLINE
btrfs: tree-checker: Fix misleading group system information
Btrfs: fix missing data checksums after a ranged fsync (msync)
btrfs: fix pinned underflow after transaction aborted
Btrfs: fix cur_offset in the error case for nocow
error return cleanup paths.
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAlvoFrEACgkQ8vlZVpUN
gaMTSQf+Ogrvm7pfWtXf+RkmhhuyR26T+Hwxgl51m5bKetJBjEsh0qOaIfo7etwG
aLc1x/pWng2VTCHk4z0Ij9KS8YwLK3sQCBYZoJFyT/R09yGgAhLm+xP5j38WLqrX
h4GxVgekHSATkG95N/So7F7pQiz7gDowgbaYFW3PooXPoHJnCnTzcr7TGFAQBZAw
iR+8+KtH5E8IcC7Jj40nemk7Wib45DgaeGpP5P9Ct/Jw7hW+Mwhf56NYOWkLdHyy
4Kt7rm1Sbxam8k3nksNmIwx28bw+S0Ew1zZgkwgAcKcHaWdrv3TtGPkOA26AH+S3
UVeORM7xH+zXslIOyFK+7sXUZr5LiQ==
=BaBl
-----END PGP SIGNATURE-----
Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 fixes from Ted Ts'o:
"A large number of ext4 bug fixes, mostly buffer and memory leaks on
error return cleanup paths"
* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
ext4: missing !bh check in ext4_xattr_inode_write()
ext4: fix buffer leak in __ext4_read_dirblock() on error path
ext4: fix buffer leak in ext4_expand_extra_isize_ea() on error path
ext4: fix buffer leak in ext4_xattr_move_to_block() on error path
ext4: release bs.bh before re-using in ext4_xattr_block_find()
ext4: fix buffer leak in ext4_xattr_get_block() on error path
ext4: fix possible leak of s_journal_flag_rwsem in error path
ext4: fix possible leak of sbi->s_group_desc_leak in error path
ext4: remove unneeded brelse call in ext4_xattr_inode_update_ref()
ext4: avoid possible double brelse() in add_new_gdb() on error path
ext4: avoid buffer leak in ext4_orphan_add() after prior errors
ext4: avoid buffer leak on shutdown in ext4_mark_iloc_dirty()
ext4: fix possible inode leak in the retry loop of ext4_resize_fs()
ext4: fix missing cleanup if ext4_alloc_flex_bg_array() fails while resizing
ext4: add missing brelse() update_backups()'s error path
ext4: add missing brelse() add_new_gdb_meta_bg()'s error path
ext4: add missing brelse() in set_flexbg_block_bitmap()'s error path
ext4: avoid potential extra brelse in setup_new_flex_group_blocks()
Pull namespace fixes from Eric Biederman:
"I believe all of these are simple obviously correct bug fixes. These
fall into two groups:
- Fixing the implementation of MNT_LOCKED which prevents lesser
privileged users from seeing unders mounts created by more
privileged users.
- Fixing the extended uid and group mapping in user namespaces.
As well as ensuring the code looks correct I have spot tested these
changes as well and in my testing the fixes are working"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
mount: Prevent MNT_DETACH from disconnecting locked mounts
mount: Don't allow copying MNT_UNBINDABLE|MNT_LOCKED mounts
mount: Retest MNT_LOCKED in do_umount
userns: also map extents in the reverse map to kernel IDs
Fixes gcc '-Wunused-but-set-variable' warning:
fs/sysv/inode.c: In function '__sysv_write_inode':
fs/sysv/inode.c:239:6: warning:
variable 'err' set but not used [-Wunused-but-set-variable]
__sysv_write_inode should return 'err' instead of 0
Fixes: 05459ca81a ("repair sysv_write_inode(), switch sysv to simple_fsync()")
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
cleanup.
-----BEGIN PGP SIGNATURE-----
iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAlvluIATHGlkcnlvbW92
QGdtYWlsLmNvbQAKCRBKf944AhHzi/KDB/9ftmDVzZr8U9ubFIHfOKZQsxqElOAc
U/naOKU9PLZNsJkBRZNQMklS5OPAiWBPf/9bWTt+TV9jy8ljjt+Vnxmgqj8StqZY
da449b8uwDRWOY/3hzBNqDshmx3lWxI1+JIDcJPM2SkSASnBg6E1Usl0/xBp/a+r
dLLTUBJrxHMWtjXqclXk2iE1+Ehh5AMdqcwNKuqEJ3rg9OIt8PN/vDQN9dJk4kAX
4xwFoBY0WjUACf5r3+VhP/6yNxuLIIPKjygfkdYzc2LTDVXOr1SY5X0V26v6nN0K
UTqmn4g1uIrzaYCaPmzDYHKT7JYHQUPMMu9TaXkWt+MZxEsZ+z6r8QSU
=T9mn
-----END PGP SIGNATURE-----
Merge tag 'ceph-for-4.20-rc2' of https://github.com/ceph/ceph-client
Pull Ceph fixes from Ilya Dryomov:
"Two CephFS fixes (copy_file_range and quota) and a small feature bit
cleanup"
* tag 'ceph-for-4.20-rc2' of https://github.com/ceph/ceph-client:
libceph: assume argonaut on the server side
ceph: quota: fix null pointer dereference in quota check
ceph: add destination file data sync before doing any remote copy
According to Ted Ts'o ext4_getblk() called in ext4_xattr_inode_write()
should not return bh = NULL
The only time that bh could be NULL, then, would be in the case of
something really going wrong; a programming error elsewhere (perhaps a
wild pointer dereference) or I/O error causing on-disk file system
corruption (although that would be highly unlikely given that we had
*just* allocated the blocks and so the metadata blocks in question
probably would still be in the cache).
Fixes: e50e5129f3 ("ext4: xattr-in-inode support")
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org # 4.13
In async IO blocking case the additional reference to the io is taken for
it to survive fuse_aio_complete(). In non blocking case this additional
reference is not needed, however we still reference io to figure out
whether to wait for completion or not. This is wrong and will lead to
use-after-free. Fix it by storing blocking information in separate
variable.
This was spotted by KASAN when running generic/208 fstest.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Reported-by: Zorro Lang <zlang@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Fixes: 744742d692 ("fuse: Add reference counting for fuse_io_priv")
Cc: <stable@vger.kernel.org> # v4.6
In current fuse_drop_waiting() implementation it's possible that
fuse_wait_aborted() will not be woken up in the unlikely case that
fuse_abort_conn() + fuse_wait_aborted() runs in between checking
fc->connected and calling atomic_dec(&fc->num_waiting).
Do the atomic_dec_and_test() unconditionally, which also provides the
necessary barrier against reordering with the fc->connected check.
The explicit smp_mb() in fuse_wait_aborted() is not actually needed, since
the spin_unlock() in fuse_abort_conn() provides the necessary RELEASE
barrier after resetting fc->connected. However, this is not a performance
sensitive path, and adding the explicit barrier makes it easier to
document.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Fixes: b8f95e5d13 ("fuse: umount should wait for all requests")
Cc: <stable@vger.kernel.org> #v4.19
fuse_request_send_notify_reply() may fail if the connection was reset for
some reason (e.g. fs was unmounted). Don't leak request reference in this
case. Besides leaking memory, this resulted in fc->num_waiting not being
decremented and hence fuse_wait_aborted() left in a hanging and unkillable
state.
Fixes: 2d45ba381a ("fuse: add retrieve request")
Fixes: b8f95e5d13 ("fuse: umount should wait for all requests")
Reported-and-tested-by: syzbot+6339eda9cb4ebbc4c37b@syzkaller.appspotmail.com
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: <stable@vger.kernel.org> #v2.6.36
The previous attempt to fix for metadata read-ahead during truncate was
incorrect: for files with a height > 2 (1006989312 bytes with a block
size of 4096 bytes), read-ahead requests were not being issued for some
of the indirect blocks discovered while walking the metadata tree,
leading to significant slow-downs when deleting large files. Fix that.
In addition, only issue read-ahead requests in the first pass through
the meta-data tree, while deallocating data blocks.
Fixes: c3ce5aa9b0 ("gfs2: Fix metadata read-ahead during truncate")
Cc: stable@vger.kernel.org # v4.16+
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
gfs2_put_super calls gfs2_clear_rgrpd to destroy the gfs2_rgrpd objects
attached to the resource group glocks. That function should release the
buffers attached to the gfs2_bitmap objects (bi_bh), but the call to
gfs2_rgrp_brelse for doing that is missing.
When gfs2_releasepage later runs across these buffers which are still
referenced, it refuses to free them. This causes the pages the buffers
are attached to to remain referenced as well. With enough mount/unmount
cycles, the system will eventually run out of memory.
Fix this by adding the missing call to gfs2_rgrp_brelse in
gfs2_clear_rgrpd.
(Also fix a gfs2_rgrp_relse -> gfs2_rgrp_brelse typo in a comment.)
Fixes: 39b0f1e929 ("GFS2: Don't brelse rgrp buffer_heads every allocation")
Cc: stable@vger.kernel.org # v4.2+
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Before this patch, recovery would cause all callbacks to be delayed,
put on a queue, and afterward they were all queued to the callback
work queue. This patch does the same thing, but occasionally takes
a break after 25 of them so it won't swamp the CPU at the expense
of other RT processes like corosync.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
Make sure we have a saved filehandle, otherwise we'll oops with a null
pointer dereference in nfs4_preprocess_stateid_op().
Signed-off-by: Scott Mayhew <smayhew@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
No one is running pre-argonaut. In addition one of the argonaut
features (NOSRCADDR) has been required since day one (and a half,
2.6.34 vs 2.6.35) of the kernel client.
Allow for the possibility of reusing these feature bits later.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>
This patch fixes a possible null pointer dereference in
check_quota_exceeded, detected by the static checker smatch, with the
following warning:
fs/ceph/quota.c:240 check_quota_exceeded()
error: we previously assumed 'realm' could be null (see line 188)
Fixes: b7a2921765 ("ceph: quota: support for ceph.quota.max_files")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Luis Henriques <lhenriques@suse.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
If we try to copy into a file that was just written, any data that is
remote copied will be overwritten by our buffered writes once they are
flushed. When this happens, the call to invalidate_inode_pages2_range
will also return a -EBUSY error.
This patch fixes this by also sync'ing the destination file before
starting any copy.
Fixes: 503f82a993 ("ceph: support copy_file_range file operation")
Signed-off-by: Luis Henriques <lhenriques@suse.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
When an event is reported on a sub-directory and the parent inode has
a mark mask with FS_EVENT_ON_CHILD|FS_ISDIR, the event will be sent to
fsnotify() even if the event type is not in the parent mark mask
(e.g. FS_OPEN).
Further more, if that event happened on a mount or a filesystem with
a mount/sb mark that does have that event type in their mask, the "on
child" event will be reported on the mount/sb mark. That is not
desired, because user will get a duplicate event for the same action.
Note that the event reported on the victim inode is never merged with
the event reported on the parent inode, because of the check in
should_merge(): old_fsn->inode == new_fsn->inode.
Fix this by looking for a match of an actual event type (i.e. not just
FS_ISDIR) in parent's inode mark mask and by not reporting an "on child"
event to group if event type is only found on mount/sb marks.
[backport hint: The bug seems to have always been in fanotify, but this
patch will only apply cleanly to v4.19.y]
Cc: <stable@vger.kernel.org> # v4.19
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
If filesystem has already mounted as read-only, then we don't have
to do it again.
Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Timothy Baldwin <timbaldwin@fastmail.co.uk> wrote:
> As per mount_namespaces(7) unprivileged users should not be able to look under mount points:
>
> Mounts that come as a single unit from more privileged mount are locked
> together and may not be separated in a less privileged mount namespace.
>
> However they can:
>
> 1. Create a mount namespace.
> 2. In the mount namespace open a file descriptor to the parent of a mount point.
> 3. Destroy the mount namespace.
> 4. Use the file descriptor to look under the mount point.
>
> I have reproduced this with Linux 4.16.18 and Linux 4.18-rc8.
>
> The setup:
>
> $ sudo sysctl kernel.unprivileged_userns_clone=1
> kernel.unprivileged_userns_clone = 1
> $ mkdir -p A/B/Secret
> $ sudo mount -t tmpfs hide A/B
>
>
> "Secret" is indeed hidden as expected:
>
> $ ls -lR A
> A:
> total 0
> drwxrwxrwt 2 root root 40 Feb 12 21:08 B
>
> A/B:
> total 0
>
>
> The attack revealing "Secret":
>
> $ unshare -Umr sh -c "exec unshare -m ls -lR /proc/self/fd/4/ 4<A"
> /proc/self/fd/4/:
> total 0
> drwxr-xr-x 3 root root 60 Feb 12 21:08 B
>
> /proc/self/fd/4/B:
> total 0
> drwxr-xr-x 2 root root 40 Feb 12 21:08 Secret
>
> /proc/self/fd/4/B/Secret:
> total 0
I tracked this down to put_mnt_ns running passing UMOUNT_SYNC and
disconnecting all of the mounts in a mount namespace. Fix this by
factoring drop_mounts out of drop_collected_mounts and passing
0 instead of UMOUNT_SYNC.
There are two possible behavior differences that result from this.
- No longer setting UMOUNT_SYNC will no longer set MNT_SYNC_UMOUNT on
the vfsmounts being unmounted. This effects the lazy rcu walk by
kicking the walk out of rcu mode and forcing it to be a non-lazy
walk.
- No longer disconnecting locked mounts will keep some mounts around
longer as they stay because the are locked to other mounts.
There are only two users of drop_collected mounts: audit_tree.c and
put_mnt_ns.
In audit_tree.c the mounts are private and there are no rcu lazy walks
only calls to iterate_mounts. So the changes should have no effect
except for a small timing effect as the connected mounts are disconnected.
In put_mnt_ns there may be references from process outside the mount
namespace to the mounts. So the mounts remaining connected will
be the bug fix that is needed. That rcu walks are allowed to continue
appears not to be a problem especially as the rcu walk change was about
an implementation detail not about semantics.
Cc: stable@vger.kernel.org
Fixes: 5ff9d8a65c ("vfs: Lock in place mounts from more privileged users")
Reported-by: Timothy Baldwin <timbaldwin@fastmail.co.uk>
Tested-by: Timothy Baldwin <timbaldwin@fastmail.co.uk>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Jonathan Calmels from NVIDIA reported that he's able to bypass the
mount visibility security check in place in the Linux kernel by using
a combination of the unbindable property along with the private mount
propagation option to allow a unprivileged user to see a path which
was purposefully hidden by the root user.
Reproducer:
# Hide a path to all users using a tmpfs
root@castiana:~# mount -t tmpfs tmpfs /sys/devices/
root@castiana:~#
# As an unprivileged user, unshare user namespace and mount namespace
stgraber@castiana:~$ unshare -U -m -r
# Confirm the path is still not accessible
root@castiana:~# ls /sys/devices/
# Make /sys recursively unbindable and private
root@castiana:~# mount --make-runbindable /sys
root@castiana:~# mount --make-private /sys
# Recursively bind-mount the rest of /sys over to /mnnt
root@castiana:~# mount --rbind /sys/ /mnt
# Access our hidden /sys/device as an unprivileged user
root@castiana:~# ls /mnt/devices/
breakpoint cpu cstate_core cstate_pkg i915 intel_pt isa kprobe
LNXSYSTM:00 msr pci0000:00 platform pnp0 power software system
tracepoint uncore_arb uncore_cbox_0 uncore_cbox_1 uprobe virtual
Solve this by teaching copy_tree to fail if a mount turns out to be
both unbindable and locked.
Cc: stable@vger.kernel.org
Fixes: 5ff9d8a65c ("vfs: Lock in place mounts from more privileged users")
Reported-by: Jonathan Calmels <jcalmels@nvidia.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
It was recently pointed out that the one instance of testing MNT_LOCKED
outside of the namespace_sem is in ksys_umount.
Fix that by adding a test inside of do_umount with namespace_sem and
the mount_lock held. As it helps to fail fails the existing test is
maintained with an additional comment pointing out that it may be racy
because the locks are not held.
Cc: stable@vger.kernel.org
Reported-by: Al Viro <viro@ZenIV.linux.org.uk>
Fixes: 5ff9d8a65c ("vfs: Lock in place mounts from more privileged users")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
In copy_result_to_user(), we first create a struct dlm_lock_result, which
contains a struct dlm_lksb, the last member of which is a pointer to the
lvb. Unfortunately, we copy the entire struct dlm_lksb to the result
struct, which is then copied to userspace at the end of the function,
leaking the contents of sb_lvbptr, which is a valid kernel pointer in some
cases (indeed, later in the same function the data it points to is copied
to userspace).
It is an error to leak kernel pointers to userspace, as it undermines KASLR
protections (see e.g. 65eea8edc3 ("floppy: Do not copy a kernel pointer to
user memory in FDGETPRM ioctl") for another example of this).
Signed-off-by: Tycho Andersen <tycho@tycho.ws>
Signed-off-by: David Teigland <teigland@redhat.com>
kobject doesn't like zero length object names, so let's test for that.
Signed-off-by: Tycho Andersen <tycho@tycho.ws>
Signed-off-by: David Teigland <teigland@redhat.com>
dlm_config_nodes() does not allocate nodes on failure, so we should not
free() nodes when it fails.
Signed-off-by: Tycho Andersen <tycho@tycho.ws>
Signed-off-by: David Teigland <teigland@redhat.com>
We use IOCB_HIPRI to poll for IO in the caller instead of scheduling.
This information is not available for (or after) IO submission. The
driver may make different queue choices based on the type of IO, so
make the fact that we will poll for this IO known to the lower layers
as well.
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There's a race between close_ctree() and cleaner_kthread().
close_ctree() sets btrfs_fs_closing(), and the cleaner stops when it
sees it set, but this is racy; the cleaner might have already checked
the bit and could be cleaning stuff. In particular, if it deletes unused
block groups, it will create delayed iputs for the free space cache
inodes. As of "btrfs: don't run delayed_iputs in commit", we're no
longer running delayed iputs after a commit. Therefore, if the cleaner
creates more delayed iputs after delayed iputs are run in
btrfs_commit_super(), we will leak inodes on unmount and get a busy
inode crash from the VFS.
Fix it by parking the cleaner before we actually close anything. Then,
any remaining delayed iputs will always be handled in
btrfs_commit_super(). This also ensures that the commit in close_ctree()
is really the last commit, so we can get rid of the commit in
cleaner_kthread().
The fstest/generic/475 followed by 476 can trigger a crash that
manifests as a slab corruption caused by accessing the freed kthread
structure by a wake up function. Sample trace:
[ 5657.077612] BUG: unable to handle kernel NULL pointer dereference at 00000000000000cc
[ 5657.079432] PGD 1c57a067 P4D 1c57a067 PUD da10067 PMD 0
[ 5657.080661] Oops: 0000 [#1] PREEMPT SMP
[ 5657.081592] CPU: 1 PID: 5157 Comm: fsstress Tainted: G W 4.19.0-rc8-default+ #323
[ 5657.083703] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626cc-prebuilt.qemu-project.org 04/01/2014
[ 5657.086577] RIP: 0010:shrink_page_list+0x2f9/0xe90
[ 5657.091937] RSP: 0018:ffffb5c745c8f728 EFLAGS: 00010287
[ 5657.092953] RAX: 0000000000000074 RBX: ffffb5c745c8f830 RCX: 0000000000000000
[ 5657.094590] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff9a8747fdf3d0
[ 5657.095987] RBP: ffffb5c745c8f9e0 R08: 0000000000000000 R09: 0000000000000000
[ 5657.097159] R10: ffff9a8747fdf5e8 R11: 0000000000000000 R12: ffffb5c745c8f788
[ 5657.098513] R13: ffff9a877f6ff2c0 R14: ffff9a877f6ff2c8 R15: dead000000000200
[ 5657.099689] FS: 00007f948d853b80(0000) GS:ffff9a877d600000(0000) knlGS:0000000000000000
[ 5657.101032] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 5657.101953] CR2: 00000000000000cc CR3: 00000000684bd000 CR4: 00000000000006e0
[ 5657.103159] Call Trace:
[ 5657.103776] shrink_inactive_list+0x194/0x410
[ 5657.104671] shrink_node_memcg.constprop.84+0x39a/0x6a0
[ 5657.105750] shrink_node+0x62/0x1c0
[ 5657.106529] try_to_free_pages+0x1a4/0x500
[ 5657.107408] __alloc_pages_slowpath+0x2c9/0xb20
[ 5657.108418] __alloc_pages_nodemask+0x268/0x2b0
[ 5657.109348] kmalloc_large_node+0x37/0x90
[ 5657.110205] __kmalloc_node+0x236/0x310
[ 5657.111014] kvmalloc_node+0x3e/0x70
Fixes: 30928e9baa ("btrfs: don't run delayed_iputs in commit")
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add trace ]
Signed-off-by: David Sterba <dsterba@suse.com>
bs.bh was taken in previous ext4_xattr_block_find() call,
it should be released before re-using
Fixes: 7e01c8e542 ("ext3/4: fix uninitialized bs in ...")
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org # 2.6.26
Fixes: bfe0a5f47a ("ext4: add more mount time checks of the superblock")
Reported-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org # 4.18
ext4_mark_iloc_dirty() callers expect that it releases iloc->bh
even if it returns an error.
Fixes: 0db1ff222d ("ext4: add shutdown bit and check for it")
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org # 4.11
generic/070 on 64k block size filesystems is failing with a verifier
corruption on writeback or an attribute leaf block:
[ 94.973083] XFS (pmem0): Metadata corruption detected at xfs_attr3_leaf_verify+0x246/0x260, xfs_attr3_leaf block 0x811480
[ 94.975623] XFS (pmem0): Unmount and run xfs_repair
[ 94.976720] XFS (pmem0): First 128 bytes of corrupted metadata buffer:
[ 94.978270] 000000004b2e7b45: 00 00 00 00 00 00 00 00 3b ee 00 00 00 00 00 00 ........;.......
[ 94.980268] 000000006b1db90b: 00 00 00 00 00 81 14 80 00 00 00 00 00 00 00 00 ................
[ 94.982251] 00000000433f2407: 22 7b 5c 82 2d 5c 47 4c bb 31 1c 37 fa a9 ce d6 "{\.-\GL.1.7....
[ 94.984157] 0000000010dc7dfb: 00 00 00 00 00 81 04 8a 00 0a 18 e8 dd 94 01 00 ................
[ 94.986215] 00000000d5a19229: 00 a0 dc f4 fe 98 01 68 f0 d8 07 e0 00 00 00 00 .......h........
[ 94.988171] 00000000521df36c: 0c 2d 32 e2 fe 20 01 00 0c 2d 58 65 fe 0c 01 00 .-2.. ...-Xe....
[ 94.990162] 000000008477ae06: 0c 2d 5b 66 fe 8c 01 00 0c 2d 71 35 fe 7c 01 00 .-[f.....-q5.|..
[ 94.992139] 00000000a4a6bca6: 0c 2d 72 37 fc d4 01 00 0c 2d d8 b8 f0 90 01 00 .-r7.....-......
[ 94.994789] XFS (pmem0): xfs_do_force_shutdown(0x8) called from line 1453 of file fs/xfs/xfs_buf.c. Return address = ffffffff815365f3
This is failing this check:
end = ichdr.freemap[i].base + ichdr.freemap[i].size;
if (end < ichdr.freemap[i].base)
>>>>> return __this_address;
if (end > mp->m_attr_geo->blksize)
return __this_address;
And from the buffer output above, the freemap array is:
freemap[0].base = 0x00a0
freemap[0].size = 0xdcf4 end = 0xdd94
freemap[1].base = 0xfe98
freemap[1].size = 0x0168 end = 0x10000
freemap[2].base = 0xf0d8
freemap[2].size = 0x07e0 end = 0xf8b8
These all look valid - the block size is 0x10000 and so from the
last check in the above verifier fragment we know that the end
of freemap[1] is valid. The problem is that end is declared as:
uint16_t end;
And (uint16_t)0x10000 = 0. So we have a verifier bug here, not a
corruption. Fix the verifier to use uint32_t types for the check and
hence avoid the overflow.
Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=201577
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Use DUMP_PREFIX_OFFSET when printing hex dumps of corrupt buffers
because modern Linux now prints a 32-bit hash of our 64-bit pointer when
using DUMP_PREFIX_ADDRESS:
00000000b4bb4297: 00 00 00 00 00 00 00 00 3b ee 00 00 00 00 00 00 ........;.......
00000005ec77e26: 00 00 00 00 02 d0 5a 00 00 00 00 00 00 00 00 00 ......Z.........
000000015938018: 21 98 e8 b4 fd de 4c 07 bc ea 3c e5 ae b4 7c 48 !.....L...<...|H
This is totally worthless for a sequential dump since we probably only
care about tracking the buffer offsets and afaik there's no way to
recover the actual pointer from the hashed value.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
In this function, once 'buf' has been allocated, we unconditionally
return 0.
However, 'error' is set to some error codes in several error handling
paths.
Before commit 232b51948b ("xfs: simplify the xfs_getbmap interface")
this was not an issue because all error paths were returning directly,
but now that some cleanup at the end may be needed, we must propagate the
error code.
Fixes: 232b51948b ("xfs: simplify the xfs_getbmap interface")
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
We currently allow cloning a range from a file which includes the last
block of the file even if the file's size is not aligned to the block
size. This is fine and useful when the destination file has the same size,
but when it does not and the range ends somewhere in the middle of the
destination file, it leads to corruption because the bytes between the EOF
and the end of the block have undefined data (when there is support for
discard/trimming they have a value of 0x00).
Example:
$ mkfs.btrfs -f /dev/sdb
$ mount /dev/sdb /mnt
$ export foo_size=$((256 * 1024 + 100))
$ xfs_io -f -c "pwrite -S 0x3c 0 $foo_size" /mnt/foo
$ xfs_io -f -c "pwrite -S 0xb5 0 1M" /mnt/bar
$ xfs_io -c "reflink /mnt/foo 0 512K $foo_size" /mnt/bar
$ od -A d -t x1 /mnt/bar
0000000 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5
*
0524288 3c 3c 3c 3c 3c 3c 3c 3c 3c 3c 3c 3c 3c 3c 3c 3c
*
0786528 3c 3c 3c 3c 00 00 00 00 00 00 00 00 00 00 00 00
0786544 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
*
0790528 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5
*
1048576
The bytes in the range from 786532 (512Kb + 256Kb + 100 bytes) to 790527
(512Kb + 256Kb + 4Kb - 1) got corrupted, having now a value of 0x00 instead
of 0xb5.
This is similar to the problem we had for deduplication that got recently
fixed by commit de02b9f6bb ("Btrfs: fix data corruption when
deduplicating between different files").
Fix this by not allowing such operations to be performed and return the
errno -EINVAL to user space. This is what XFS is doing as well at the VFS
level. This change however now makes us return -EINVAL instead of
-EOPNOTSUPP for cases where the source range maps to an inline extent and
the destination range's end is smaller then the destination file's size,
since the detection of inline extents is done during the actual process of
dropping file extent items (at __btrfs_drop_extents()). Returning the
-EINVAL error is done early on and solely based on the input parameters
(offsets and length) and destination file's size. This makes us consistent
with XFS and anyone else supporting cloning since this case is now checked
at a higher level in the VFS and is where the -EINVAL will be returned
from starting with kernel 4.20 (the VFS changed was introduced in 4.20-rc1
by commit 07d19dc9fb ("vfs: avoid problematic remapping requests into
partial EOF block"). So this change is more geared towards stable kernels,
as it's unlikely the new VFS checks get removed intentionally.
A test case for fstests follows soon, as well as an update to filter
existing tests that expect -EOPNOTSUPP to accept -EINVAL as well.
CC: <stable@vger.kernel.org> # 4.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When we are writing out a free space cache, during the transaction commit
phase, we can end up in a deadlock which results in a stack trace like the
following:
schedule+0x28/0x80
btrfs_tree_read_lock+0x8e/0x120 [btrfs]
? finish_wait+0x80/0x80
btrfs_read_lock_root_node+0x2f/0x40 [btrfs]
btrfs_search_slot+0xf6/0x9f0 [btrfs]
? evict_refill_and_join+0xd0/0xd0 [btrfs]
? inode_insert5+0x119/0x190
btrfs_lookup_inode+0x3a/0xc0 [btrfs]
? kmem_cache_alloc+0x166/0x1d0
btrfs_iget+0x113/0x690 [btrfs]
__lookup_free_space_inode+0xd8/0x150 [btrfs]
lookup_free_space_inode+0x5b/0xb0 [btrfs]
load_free_space_cache+0x7c/0x170 [btrfs]
? cache_block_group+0x72/0x3b0 [btrfs]
cache_block_group+0x1b3/0x3b0 [btrfs]
? finish_wait+0x80/0x80
find_free_extent+0x799/0x1010 [btrfs]
btrfs_reserve_extent+0x9b/0x180 [btrfs]
btrfs_alloc_tree_block+0x1b3/0x4f0 [btrfs]
__btrfs_cow_block+0x11d/0x500 [btrfs]
btrfs_cow_block+0xdc/0x180 [btrfs]
btrfs_search_slot+0x3bd/0x9f0 [btrfs]
btrfs_lookup_inode+0x3a/0xc0 [btrfs]
? kmem_cache_alloc+0x166/0x1d0
btrfs_update_inode_item+0x46/0x100 [btrfs]
cache_save_setup+0xe4/0x3a0 [btrfs]
btrfs_start_dirty_block_groups+0x1be/0x480 [btrfs]
btrfs_commit_transaction+0xcb/0x8b0 [btrfs]
At cache_save_setup() we need to update the inode item of a block group's
cache which is located in the tree root (fs_info->tree_root), which means
that it may result in COWing a leaf from that tree. If that happens we
need to find a free metadata extent and while looking for one, if we find
a block group which was not cached yet we attempt to load its cache by
calling cache_block_group(). However this function will try to load the
inode of the free space cache, which requires finding the matching inode
item in the tree root - if that inode item is located in the same leaf as
the inode item of the space cache we are updating at cache_save_setup(),
we end up in a deadlock, since we try to obtain a read lock on the same
extent buffer that we previously write locked.
So fix this by using the tree root's commit root when searching for a
block group's free space cache inode item when we are attempting to load
a free space cache. This is safe since block groups once loaded stay in
memory forever, as well as their caches, so after they are first loaded
we will never need to read their inode items again. For new block groups,
once they are created they get their ->cached field set to
BTRFS_CACHE_FINISHED meaning we will not need to read their inode item.
Reported-by: Andrew Nelson <andrew.s.nelson@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/CAPTELenq9x5KOWuQ+fa7h1r3nsJG8vyiTH8+ifjURc_duHh2Wg@mail.gmail.com/
Fixes: 9d66e233c7 ("Btrfs: load free space cache if it exists")
Tested-by: Andrew Nelson <andrew.s.nelson@gmail.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Note: this patch fixes a problem in a feature outside of btrfs ("kernel
hacking: add a config option to disable compiler auto-inlining") and is
applied ahead of time due to cross-subsystem dependencies.
On 32-bit ARM with gcc-8, I see a link error with the addition of the
CONFIG_NO_AUTO_INLINE option:
fs/btrfs/super.o: In function `btrfs_statfs':
super.c:(.text+0x67b8): undefined reference to `__aeabi_uldivmod'
super.c:(.text+0x67fc): undefined reference to `__aeabi_uldivmod'
super.c:(.text+0x6858): undefined reference to `__aeabi_uldivmod'
super.c:(.text+0x6920): undefined reference to `__aeabi_uldivmod'
super.c:(.text+0x693c): undefined reference to `__aeabi_uldivmod'
fs/btrfs/super.o:super.c:(.text+0x6958): more undefined references to `__aeabi_uldivmod' follow
So far this is the only file that shows the behavior, so I'd propose
to just work around it by marking the functions as 'static inline'
that normally get inlined here.
The reference to __aeabi_uldivmod comes from a div_u64() which has an
optimization for a constant division that uses a straight '/' operator
when the result should be known to the compiler. My interpretation is
that as we turn off inlining, gcc still expects the result to be constant
but fails to use that constant value.
Link: https://lkml.kernel.org/r/20181103153941.1881966-1-arnd@arndb.de
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Changbin Du <changbin.du@gmail.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
[ add the note ]
Signed-off-by: David Sterba <dsterba@suse.com>
block_group_err shows the group system as a decimal value with a '0x'
prefix, which is somewhat misleading.
Fix it to print hexadecimal, as was intended.
Fixes: fce466eab7 ("btrfs: tree-checker: Verify block_group_item")
CC: stable@vger.kernel.org # 4.19+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Shaokun Zhang <zhangshaokun@hisilicon.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Recently we got a massive simplification for fsync, where for the fast
path we no longer log new extents while their respective ordered extents
are still running.
However that simplification introduced a subtle regression for the case
where we use a ranged fsync (msync). Consider the following example:
CPU 0 CPU 1
mmap write to range [2Mb, 4Mb[
mmap write to range [512Kb, 1Mb[
msync range [512K, 1Mb[
--> triggers fast fsync
(BTRFS_INODE_NEEDS_FULL_SYNC
not set)
--> creates extent map A for this
range and adds it to list of
modified extents
--> starts ordered extent A for
this range
--> waits for it to complete
writeback triggered for range
[2Mb, 4Mb[
--> create extent map B and
adds it to the list of
modified extents
--> creates ordered extent B
--> start looking for and logging
modified extents
--> logs extent maps A and B
--> finds checksums for extent A
in the csum tree, but not for
extent B
fsync (msync) finishes
--> ordered extent B
finishes and its
checksums are added
to the csum tree
<power cut>
After replaying the log, we have the extent covering the range [2Mb, 4Mb[
but do not have the data checksum items covering that file range.
This happens because at the very beginning of an fsync (btrfs_sync_file())
we start and wait for IO in the given range [512Kb, 1Mb[ and therefore
wait for any ordered extents in that range to complete before we start
logging the extents. However if right before we start logging the extent
in our range [512Kb, 1Mb[, writeback is started for any other dirty range,
such as the range [2Mb, 4Mb[ due to memory pressure or a concurrent fsync
or msync (btrfs_sync_file() starts writeback before acquiring the inode's
lock), an ordered extent is created for that other range and a new extent
map is created to represent that range and added to the inode's list of
modified extents.
That means that we will see that other extent in that list when collecting
extents for logging (done at btrfs_log_changed_extents()) and log the
extent before the respective ordered extent finishes - namely before the
checksum items are added to the checksums tree, which is where
log_extent_csums() looks for the checksums, therefore making us log an
extent without logging its checksums. Before that massive simplification
of fsync, this wasn't a problem because besides looking for checkums in
the checksums tree, we also looked for them in any ordered extent still
running.
The consequence of data checksums missing for a file range is that users
attempting to read the affected file range will get -EIO errors and dmesg
reports the following:
[10188.358136] BTRFS info (device sdc): no csum found for inode 297 start 57344
[10188.359278] BTRFS warning (device sdc): csum failed root 5 ino 297 off 57344 csum 0x98f94189 expected csum 0x00000000 mirror 1
So fix this by skipping extents outside of our logging range at
btrfs_log_changed_extents() and leaving them on the list of modified
extents so that any subsequent ranged fsync may collect them if needed.
Also, if we find a hole extent outside of the range still log it, just
to prevent having gaps between extent items after replaying the log,
otherwise fsck will complain when we are not using the NO_HOLES feature
(fstest btrfs/056 triggers such case).
Fixes: e7175a6927 ("btrfs: remove the wait ordered logic in the log_one_extent path")
CC: stable@vger.kernel.org # 4.19+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When the cow_file_range fails, the related resources are unlocked
according to the range [start..end), so the unlock cannot be repeated in
run_delalloc_nocow.
In some cases (e.g. cur_offset <= end && cow_start != -1), cur_offset is
not updated correctly, so move the cur_offset update before
cow_file_range.
kernel BUG at mm/page-writeback.c:2663!
Internal error: Oops - BUG: 0 [#1] SMP
CPU: 3 PID: 31525 Comm: kworker/u8:7 Tainted: P O
Hardware name: Realtek_RTD1296 (DT)
Workqueue: writeback wb_workfn (flush-btrfs-1)
task: ffffffc076db3380 ti: ffffffc02e9ac000 task.ti: ffffffc02e9ac000
PC is at clear_page_dirty_for_io+0x1bc/0x1e8
LR is at clear_page_dirty_for_io+0x14/0x1e8
pc : [<ffffffc00033c91c>] lr : [<ffffffc00033c774>] pstate: 40000145
sp : ffffffc02e9af4f0
Process kworker/u8:7 (pid: 31525, stack limit = 0xffffffc02e9ac020)
Call trace:
[<ffffffc00033c91c>] clear_page_dirty_for_io+0x1bc/0x1e8
[<ffffffbffc514674>] extent_clear_unlock_delalloc+0x1e4/0x210 [btrfs]
[<ffffffbffc4fb168>] run_delalloc_nocow+0x3b8/0x948 [btrfs]
[<ffffffbffc4fb948>] run_delalloc_range+0x250/0x3a8 [btrfs]
[<ffffffbffc514c0c>] writepage_delalloc.isra.21+0xbc/0x1d8 [btrfs]
[<ffffffbffc516048>] __extent_writepage+0xe8/0x248 [btrfs]
[<ffffffbffc51630c>] extent_write_cache_pages.isra.17+0x164/0x378 [btrfs]
[<ffffffbffc5185a8>] extent_writepages+0x48/0x68 [btrfs]
[<ffffffbffc4f5828>] btrfs_writepages+0x20/0x30 [btrfs]
[<ffffffc00033d758>] do_writepages+0x30/0x88
[<ffffffc0003ba0f4>] __writeback_single_inode+0x34/0x198
[<ffffffc0003ba6c4>] writeback_sb_inodes+0x184/0x3c0
[<ffffffc0003ba96c>] __writeback_inodes_wb+0x6c/0xc0
[<ffffffc0003bac20>] wb_writeback+0x1b8/0x1c0
[<ffffffc0003bb0f0>] wb_workfn+0x150/0x250
[<ffffffc0002b0014>] process_one_work+0x1dc/0x388
[<ffffffc0002b02f0>] worker_thread+0x130/0x500
[<ffffffc0002b6344>] kthread+0x10c/0x110
[<ffffffc000284590>] ret_from_fork+0x10/0x40
Code: d503201f a9025bb5 a90363b7 f90023b9 (d4210000)
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Robbie Ko <robbieko@synology.com>
Signed-off-by: David Sterba <dsterba@suse.com>
- Full filesystem authentication feature,
UBIFS is now able to have the whole filesystem structure
authenticated plus user data encrypted and authenticated.
- Minor cleanups
-----BEGIN PGP SIGNATURE-----
iQJKBAABCAA0FiEEdgfidid8lnn52cLTZvlZhesYu8EFAlvaF2IWHHJpY2hhcmRA
c2lnbWEtc3Rhci5hdAAKCRBm+VmF6xi7wUb/D/0Z/jN80LtxoIlQzmfoBnVSnaXv
BDvdDFHTwV+zu4XCvUyJzBnwzNjDxNK2XD5hAgiqCoTk5sr4KUi5+zfft5XMW40w
T1m5mQNhjwmcI/J/5m2gSHbOSB8Hkc0HIybknS+5ZJDa1OZUkxejLcmpK5Wk+bxp
Ak1cOn5GIJKRQMrUudhySkQaBe0DnNmHSACePSb5AYGlnRy6eJ26ANR2mU7PFg1V
NBVbOQjMrYIV9qq9m+vtTNsLXidcaRf474fg7lshodmDBISy9g83Oq8FaPzYTJVJ
rkvdsRzCrXeApSH2LJ8Gb1AvIAlvJa2Va+anXh8NrSBySfzTKrIPtIONkpF7zxOC
8naZcRNvTqWcMfaTKGK+SGWUqGlHxdGOo5NkkKrn0jsO6HJ8kYAXKFGx65MsiCLv
xPlKc543ZLSscw3JJqLXVoXr2hmwhUHMJwwaPngFmdgm88bog62feUgFpYOU/1dj
1s2+q3jSUqfuS4oInjAmeX/Yq9dss/6dMo73ikbekIGRtijUfCMBWFyINdE0oWPu
ZUdOOifYrozIG7wWEo6ZzCI1PIyPvYfKcXVMWimPmu9Xi5AnbCDMQmPYVF5YMM0R
jexN9gVyFQQjz940reFJi0EkIJjwCycWLWft6P6cLDc/rRUUP4ibNYv3JL8WvhHn
Eb9V6InXhcyuX4eopA==
=lq2m
-----END PGP SIGNATURE-----
Merge tag 'tags/upstream-4.20-rc1' of git://git.infradead.org/linux-ubifs
Pull UBIFS updates from Richard Weinberger:
- Full filesystem authentication feature, UBIFS is now able to have the
whole filesystem structure authenticated plus user data encrypted and
authenticated.
- Minor cleanups
* tag 'tags/upstream-4.20-rc1' of git://git.infradead.org/linux-ubifs: (26 commits)
ubifs: Remove unneeded semicolon
Documentation: ubifs: Add authentication whitepaper
ubifs: Enable authentication support
ubifs: Do not update inode size in-place in authenticated mode
ubifs: Add hashes and HMACs to default filesystem
ubifs: authentication: Authenticate super block node
ubifs: Create hash for default LPT
ubfis: authentication: Authenticate master node
ubifs: authentication: Authenticate LPT
ubifs: Authenticate replayed journal
ubifs: Add auth nodes to garbage collector journal head
ubifs: Add authentication nodes to journal
ubifs: authentication: Add hashes to index nodes
ubifs: Add hashes to the tree node cache
ubifs: Create functions to embed a HMAC in a node
ubifs: Add helper functions for authentication support
ubifs: Add separate functions to init/crc a node
ubifs: Format changes for authentication support
ubifs: Store read superblock node
ubifs: Drop write_node
...
Fixes: 33afdcc540 ("ext4: add a function which sets up group blocks ...")
Cc: stable@kernel.org # 3.3
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Currently bh is set to NULL only during first iteration of for cycle,
then this pointer is not cleared after end of using.
Therefore rollback after errors can lead to extra brelse(bh) call,
decrements bh counter and later trigger an unexpected warning in __brelse()
Patch moves brelse() calls in body of cycle to exclude requirement of
brelse() call in rollback.
Fixes: 33afdcc540 ("ext4: add a function which sets up group blocks ...")
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org # 3.3+
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAlvdEHkACgkQiiy9cAdy
T1ERqgwAoaYkwZhee/73MxC5N2JnBpRjP8Wwu6jo7q89DY35zI8NvLfy48VXNmnk
8Sp+zPWXxFgFLxBIGIOEwvfrT2kl84lFHprAAkZRKQcz5ZdeqIgoKYWw46Ezglxn
pyBpGFKNVKVSb0ZbAJgswB0KTCSqW2Hx6mf/lGxx1c2ao1kyXrvN4J2v9bK2UCbV
KC9xAgkkteypiTsB/afWQMAybDtran1TdOS7rm4aNT697KJ4jOPSQ7Bhh4CxT+UC
xSFtobCc/v42Si3oq0DNmFNQOdfj40Tg3RjHRHBOmcqN2b+kmGa/4mCUbVsFL08G
8gpeAHpRDnVwBAQsa6y5LuojOQ0szJzkyM9SZHTyCG3sQRGX3b6L+CYafobzrxnk
81Z0PtnDNvaoqG/6th/cQX/vz9LoiO5KG2dF1c9z+mK4m4D7JObuxaqcsrpb0uZl
fWn8AU900QOP/MrciWhj3II+pyYOV3iqKp0cpUgrP1a2P/Qxu3FkXGc4he/lWfKB
vSXGXf88
=Ilu6
-----END PGP SIGNATURE-----
Merge tag '4.20-rc1-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs fixes and updates from Steve French:
"Three small fixes (one Kerberos related, one for stable, and another
fixes an oops in xfstest 377), two helpful debugging improvements,
three patches for cifs directio and some minor cleanup"
* tag '4.20-rc1-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6:
cifs: fix signed/unsigned mismatch on aio_read patch
cifs: don't dereference smb_file_target before null check
CIFS: Add direct I/O functions to file_operations
CIFS: Add support for direct I/O write
CIFS: Add support for direct I/O read
smb3: missing defines and structs for reparse point handling
smb3: allow more detailed protocol info on open files for debugging
smb3: on kerberos mount if server doesn't specify auth type use krb5
smb3: add trace point for tree connection
cifs: fix spelling mistake, EACCESS -> EACCES
cifs: fix return value for cifs_listxattr
syzbot is reporting too large memory allocation at bfs_fill_super() [1].
Since file system image is corrupted such that bfs_sb->s_start == 0,
bfs_fill_super() is trying to allocate 8MB of continuous memory. Fix
this by adding a sanity check on bfs_sb->s_start, __GFP_NOWARN and
printf().
[1] https://syzkaller.appspot.com/bug?id=16a87c236b951351374a84c8a32f40edbc034e96
Link: http://lkml.kernel.org/r/1525862104-3407-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reported-by: syzbot <syzbot+71c6b5d68e91149fc8a4@syzkaller.appspotmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Tigran Aivazian <aivazian.tigran@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
ocfs2_defrag_extent() might leak allocated clusters. When the file
system has insufficient space, the number of claimed clusters might be
less than the caller wants. If that happens, the original code might
directly commit the transaction without returning clusters.
This patch is based on code in ocfs2_add_clusters_in_btree().
[akpm@linux-foundation.org: include localalloc.h, reduce scope of data_ac]
Link: http://lkml.kernel.org/r/20180904041621.16874-3-lchen@suse.com
Signed-off-by: Larry Chen <lchen@suse.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Cc: Changwei Ge <ge.changwei@h3c.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The handling of timestamps outside of the 1970..2038 range in the dlm
glue is rather inconsistent: on 32-bit architectures, this has always
wrapped around to negative timestamps in the 1902..1969 range, while on
64-bit kernels all timestamps are interpreted as positive 34 bit numbers
in the 1970..2514 year range.
Now that the VFS code handles 64-bit timestamps on all architectures, we
can make the behavior more consistent here, and return the same result
that we had on 64-bit already, making the file system y2038 safe in the
process. Outside of dlmglue, it already uses 64-bit on-disk timestamps
anway, so that part is fine.
For consistency, I'm changing ocfs2_pack_timespec() to clamp anything
outside of the supported range to the minimum and maximum values. This
avoids a possible ambiguity of values before 1970 in particular, which
used to be interpreted as times at the end of the 2514 range previously.
Link: http://lkml.kernel.org/r/20180619155826.4106487-1-arnd@arndb.de
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Cc: Changwei Ge <ge.changwei@h3c.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
ocfs2_read_blocks() and ocfs2_read_blocks_sync() are both used to read
several blocks from disk. Currently, the input argument *bhs* can be
NULL or NOT. It depends on the caller's behavior. If the function
fails in reading blocks from disk, the corresponding bh will be assigned
to NULL and put.
Obviously, above process for non-NULL input bh is not appropriate.
Because the caller doesn't even know its bhs are put and re-assigned.
If buffer head is managed by caller, ocfs2_read_blocks and
ocfs2_read_blocks_sync() should not evaluate it to NULL. It will cause
caller accessing illegal memory, thus crash.
Link: http://lkml.kernel.org/r/HK2PR06MB045285E0F4FBB561F9F2F9B3D5680@HK2PR06MB0452.apcprd06.prod.outlook.com
Signed-off-by: Changwei Ge <ge.changwei@h3c.com>
Reviewed-by: Guozhonghua <guozhonghua@h3c.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Cc: Changwei Ge <ge.changwei@h3c.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Somehow, file system metadata was corrupted, which causes
ocfs2_check_dir_entry() to fail in function ocfs2_dir_foreach_blk_el().
According to the original design intention, if above happens we should
skip the problematic block and continue to retrieve dir entry. But
there is obviouse misuse of brelse around related code.
After failure of ocfs2_check_dir_entry(), current code just moves to
next position and uses the problematic buffer head again and again
during which the problematic buffer head is released for multiple times.
I suppose, this a serious issue which is long-lived in ocfs2. This may
cause other file systems which is also used in a the same host insane.
So we should also consider about bakcporting this patch into linux
-stable.
Link: http://lkml.kernel.org/r/HK2PR06MB045211675B43EED794E597B6D56E0@HK2PR06MB0452.apcprd06.prod.outlook.com
Signed-off-by: Changwei Ge <ge.changwei@h3c.com>
Suggested-by: Changkuo Shi <shi.changkuo@h3c.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When -EIOCBQUEUED returns, it means that aio_complete() will be called
from dio_complete(), which is an asynchronous progress against
write_iter. Generally, IO is a very slow progress than executing
instruction, but we still can't take the risk to access a freed iocb.
And we do face a BUG crash issue. Using the crash tool, iocb is
obviously freed already.
crash> struct -x kiocb ffff881a350f5900
struct kiocb {
ki_filp = 0xffff881a350f5a80,
ki_pos = 0x0,
ki_complete = 0x0,
private = 0x0,
ki_flags = 0x0
}
And the backtrace shows:
ocfs2_file_write_iter+0xcaa/0xd00 [ocfs2]
aio_run_iocb+0x229/0x2f0
do_io_submit+0x291/0x540
SyS_io_submit+0x10/0x20
system_call_fastpath+0x16/0x75
Link: http://lkml.kernel.org/r/1523361653-14439-1-git-send-email-ge.changwei@h3c.com
Signed-off-by: Changwei Ge <ge.changwei@h3c.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
During one dead node's recovery by other node, quota recovery work will
be queued. We should avoid calling quota when it is not supported, so
check the quota flags.
Link: http://lkml.kernel.org/r/71604351584F6A4EBAE558C676F37CA401071AC9FB@H3CMLB12-EX.srv.huawei-3com.com
Signed-off-by: guozhonghua <guozhonghua@h3c.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Cc: Changwei Ge <ge.changwei@h3c.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Remove ocfs2_is_o2cb_active(). We have similar functions to identify
which cluster stack is being used via osb->osb_cluster_stack.
Secondly, the current implementation of ocfs2_is_o2cb_active() is not
totally safe. Based on the design of stackglue, we need to get
ocfs2_stack_lock before using ocfs2_stack related data structures, and
that active_stack pointer can be NULL in the case of mount failure.
Link: http://lkml.kernel.org/r/1495441079-11708-1-git-send-email-ghe@suse.com
Signed-off-by: Gang He <ghe@suse.com>
Reviewed-by: Joseph Qi <jiangqi903@gmail.com>
Reviewed-by: Eric Ren <zren@suse.com>
Acked-by: Changwei Ge <ge.changwei@h3c.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The patch "CIFS: Add support for direct I/O read" had
a signed/unsigned mismatch (ssize_t vs. size_t) in the
return from one function. Similar trivial change
in aio_write
Signed-off-by: Long Li <longli@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reported-by: Julia Lawall <julia.lawall@lip6.fr>
There is a null check on dst_file->private data which suggests
it can be potentially null. However, before this check, pointer
smb_file_target is derived from dst_file->private and dereferenced
in the call to tlink_tcon, hence there is a potential null pointer
deference.
Fix this by assigning smb_file_target and target_tcon after the
null pointer sanity checks.
Detected by CoverityScan, CID#1475302 ("Dereference before null check")
Fixes: 04b38d6012 ("vfs: pull btrfs clone API to vfs layer")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
With direct read/write functions implemented, add them to file_operations.
Dircet I/O is used under two conditions:
1. When mounting with "cache=none", CIFS uses direct I/O for all user file
data transfer.
2. When opening a file with O_DIRECT, CIFS uses direct I/O for all data
transfer on this file.
Signed-off-by: Long Li <longli@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
With direct I/O write, user supplied buffers are pinned to the memory and data
are transferred directly from user buffers to the transport layer.
Change in v3: add support for kernel AIO
Change in v4:
Refactor common write code to __cifs_writev for direct and non-direct I/O.
Retry on direct I/O failure.
Signed-off-by: Long Li <longli@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
With direct I/O read, we transfer the data directly from transport layer to
the user data buffer.
Change in v3: add support for kernel AIO
Change in v4:
Refactor common read code to __cifs_readv for direct and non-direct I/O.
Retry on direct I/O failure.
Signed-off-by: Long Li <longli@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
We were missing some structs from MS-FSCC relating to
reparse point handling. Add them to protocol defines
in smb2pdu.h
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
In order to debug complex problems it is often helpful to
have detailed information on the client and server view
of the open file information. Add the ability for root to
view the list of smb3 open files and dump the persistent
handle and other info so that it can be more easily
correlated with server logs.
Sample output from "cat /proc/fs/cifs/open_files"
# Version:1
# Format:
# <tree id> <persistent fid> <flags> <count> <pid> <uid> <filename> <mid>
0x5 0x800000378 0x8000 1 7704 0 some-file 0x14
0xcb903c0c 0x84412e67 0x8000 1 7754 1001 rofile 0x1a6d
0xcb903c0c 0x9526b767 0x8000 1 7720 1000 file 0x1a5b
0xcb903c0c 0x9ce41a21 0x8000 1 7715 0 smallfile 0xd67
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Some servers (e.g. Azure) do not include a spnego blob in the SMB3
negotiate protocol response, so on kerberos mounts ("sec=krb5")
we can fail, as we expected the server to list its supported
auth types (OIDs in the spnego blob in the negprot response).
Change this so that on krb5 mounts we default to trying krb5 if the
server doesn't list its supported protocol mechanisms.
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
CC: Stable <stable@vger.kernel.org>
Trivial fix to a spelling mistake of the error access name EACCESS,
rename to EACCES
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
If the application buffer was too small to fit all the names
we would still count the number of bytes and return this for
listxattr. This would then trigger a BUG in usercopy.c
Fix the computation of the size so that we return -ERANGE
correctly when the buffer is too small.
This fixes the kernel BUG for xfstest generic/377
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAlvchGgQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpj/1D/4kEQx4ncnFoZk8QshHV1L++rH3BbcLjQDd
Wbh9ZSIQdI/gHTzS6bE7x3YfcbpMWPMO3+jFawdfRiFTEjlF8vQ+mnJ+Btb3z4D6
mGEeFGVhHExlp2a0x/Ma8YWVNlMB7BE8Tq73bZEVMY+9lbpmDW/vp7Sfa87LBDKQ
ZmY+My+VdHN7qLtQ7t3W/HtpbU+kcXMMd3ICjK4i+ofXy6mynk4+oQ2jwyXc5L86
UCJCsTsSRr3CgbnkW/uprHo0XHk8i7O/4C3oR+x4pAIxCCa9g+vmw0EO9fvi/2iQ
qe8jKdm7Y09xu/TiPBa7iz45tdh0cNMJKo3OezmSF9Np+r69KL5C/U4GRPKN3Iwm
keoqn14ScABkYMSe4ys1AdEgKD6bNUaW3r/lJxTH2oUR23mjnCLp7c4WD/G+MlbB
CzoakQyCHTZmDFLr2Kc8bkjmpil2T2UFfmLIDAu30LWIYeSGpiIO/V+g1foJMF2f
06ERltNvgX1BJjoh4NSWySLEf1ZtkUU60NeATRol6gwhnIyLrHsgfm6OEhqlW/7x
Xc1BWyzX7K6c3Dskk/u5aSRyXOyRC9KkMt3/2XexeDNHkte9yMH0IgSvopPBuER8
+iPvPjNp7ychTKZB3zpSnlqGgePTjbufIEBtO3OyUmDZKjUqxahtxkQfmPhoclu+
XdR4ArcqNg==
=0zM4
-----END PGP SIGNATURE-----
Merge tag 'for-linus-20181102' of git://git.kernel.dk/linux-block
Pull block layer fixes from Jens Axboe:
"The biggest part of this pull request is the revert of the blkcg
cleanup series. It had one fix earlier for a stacked device issue, but
another one was reported. Rather than play whack-a-mole with this,
revert the entire series and try again for the next kernel release.
Apart from that, only small fixes/changes.
Summary:
- Indentation fixup for mtip32xx (Colin Ian King)
- The blkcg cleanup series revert (Dennis Zhou)
- Two NVMe fixes. One fixing a regression in the nvme request
initialization in this merge window, causing nvme-fc to not work.
The other is a suspend/resume p2p resource issue (James, Keith)
- Fix sg discard merge, allowing us to merge in cases where we didn't
before (Jianchao Wang)
- Call rq_qos_exit() after the queue is frozen, preventing a hang
(Ming)
- Fix brd queue setup, fixing an oops if we fail setting up all
devices (Ming)"
* tag 'for-linus-20181102' of git://git.kernel.dk/linux-block:
nvme-pci: fix conflicting p2p resource adds
nvme-fc: fix request private initialization
blkcg: revert blkcg cleanups series
block: brd: associate with queue until adding disk
block: call rq_qos_exit() after queue is frozen
mtip32xx: clean an indentation issue, remove extraneous tabs
block: fix the DISCARD request merge
Rework the vfs_clone_file_range and vfs_dedupe_file_range infrastructure to use
a common .remap_file_range method and supply generic bounds and sanity checking
functions that are shared with the data write path. The current VFS
infrastructure has problems with rlimit, LFS file sizes, file time stamps,
maximum filesystem file sizes, stripping setuid bits, etc and so they are
addressed in these commits.
We also introduce the ability for the ->remap_file_range methods to return short
clones so that clones for vfs_copy_file_range() don't get rejected if the entire
range can't be cloned. It also allows filesystems to sliently skip deduplication
of partial EOF blocks if they are not capable of doing so without requiring
errors to be thrown to userspace.
All existing filesystems are converted to user the new .remap_file_range method,
and both XFS and ocfs2 are modified to make use of the new generic checking
infrastructure.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJb29gEAAoJEK3oKUf0dfodpOAQAL2VbHjvKXEwNMDTKscSRMmZ
Z0xXo3gamFKQ+VGOqy2g2lmAYQs9SAnTuCGTJ7zIAp7u+q8gzUy5FzKAwLS4Id6L
8siaY6nzlicfO04d0MdXnWz0f3xykChgzfdQfVUlUi7WrDioBUECLPmx4a+USsp1
DQGjLOZfoOAmn2rijdnH9RTEaHqg+8mcTaLN9TRav4gGqrWxldFKXw2y6ouFC7uo
/hxTRNXR9VI+EdbDelwBNXl9nU9gQA0WLOvRKwgUrtv6bSJohTPsmXt7EbBtNcVR
cl3zDNc1sLD1bLaRLEUAszI/33wXaaQgom1iB51obIcHHef+JxRNG/j6rUMfzxZI
VaauGv5EIvtaKN0LTAqVVLQ8t2MQFYfOr8TykmO+1UFog204aKRANdVMHDSjxD/0
dTGKJGcq+HnKQ+JHDbTdvuXEL8sUUl1FiLjOQbZPw63XmuddLKFUA2TOjXn6htbU
1h1MG5d9KjGLpabp2BQheczD08NuSmcrOBNt7IoeI3+nxr3HpMwprfB9TyaERy9X
iEgyVXmjjc9bLLRW7A2wm77aW64NvPs51wKMnvuNgNwnCewrGS6cB8WVj2zbQjH1
h3f3nku44s9ctNPSBzb/sJLnpqmZQ5t0oSmrMSN+5+En6rNTacoJCzxHRJBA7z/h
Z+C6y1GTZw0euY6Zjiwu
=CE/A
-----END PGP SIGNATURE-----
Merge tag 'xfs-4.20-merge-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull vfs dedup fixes from Dave Chinner:
"This reworks the vfs data cloning infrastructure.
We discovered many issues with these interfaces late in the 4.19 cycle
- the worst of them (data corruption, setuid stripping) were fixed for
XFS in 4.19-rc8, but a larger rework of the infrastructure fixing all
the problems was needed. That rework is the contents of this pull
request.
Rework the vfs_clone_file_range and vfs_dedupe_file_range
infrastructure to use a common .remap_file_range method and supply
generic bounds and sanity checking functions that are shared with the
data write path. The current VFS infrastructure has problems with
rlimit, LFS file sizes, file time stamps, maximum filesystem file
sizes, stripping setuid bits, etc and so they are addressed in these
commits.
We also introduce the ability for the ->remap_file_range methods to
return short clones so that clones for vfs_copy_file_range() don't get
rejected if the entire range can't be cloned. It also allows
filesystems to sliently skip deduplication of partial EOF blocks if
they are not capable of doing so without requiring errors to be thrown
to userspace.
Existing filesystems are converted to user the new remap_file_range
method, and both XFS and ocfs2 are modified to make use of the new
generic checking infrastructure"
* tag 'xfs-4.20-merge-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (28 commits)
xfs: remove [cm]time update from reflink calls
xfs: remove xfs_reflink_remap_range
xfs: remove redundant remap partial EOF block checks
xfs: support returning partial reflink results
xfs: clean up xfs_reflink_remap_blocks call site
xfs: fix pagecache truncation prior to reflink
ocfs2: remove ocfs2_reflink_remap_range
ocfs2: support partial clone range and dedupe range
ocfs2: fix pagecache truncation prior to reflink
ocfs2: truncate page cache for clone destination file before remapping
vfs: clean up generic_remap_file_range_prep return value
vfs: hide file range comparison function
vfs: enable remap callers that can handle short operations
vfs: plumb remap flags through the vfs dedupe functions
vfs: plumb remap flags through the vfs clone functions
vfs: make remap_file_range functions take and return bytes completed
vfs: remap helper should update destination inode metadata
vfs: pass remap flags to generic_remap_checks
vfs: pass remap flags to generic_remap_file_range_prep
vfs: combine the clone and dedupe into a single remap_file_range
...
Pull misc vfs updates from Al Viro:
"No common topic, really - a handful of assorted stuff; the least
trivial bits are Mark's dedupe patches"
* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
fs/exofs: only use true/false for asignment of bool type variable
fs/exofs: fix potential memory leak in mount option parsing
Delete invalid assignment statements in do_sendfile
iomap: remove duplicated include from iomap.c
vfs: dedupe should return EPERM if permission is not granted
vfs: allow dedupe of user owned read-only files
ntfs: don't open-code ERR_CAST
ext4: don't open-code ERR_CAST
Pull AFS updates from Al Viro:
"AFS series, with some iov_iter bits included"
* 'work.afs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (26 commits)
missing bits of "iov_iter: Separate type from direction and use accessor functions"
afs: Probe multiple fileservers simultaneously
afs: Fix callback handling
afs: Eliminate the address pointer from the address list cursor
afs: Allow dumping of server cursor on operation failure
afs: Implement YFS support in the fs client
afs: Expand data structure fields to support YFS
afs: Get the target vnode in afs_rmdir() and get a callback on it
afs: Calc callback expiry in op reply delivery
afs: Fix FS.FetchStatus delivery from updating wrong vnode
afs: Implement the YFS cache manager service
afs: Remove callback details from afs_callback_break struct
afs: Commit the status on a new file/dir/symlink
afs: Increase to 64-bit volume ID and 96-bit vnode ID for YFS
afs: Don't invoke the server to read data beyond EOF
afs: Add a couple of tracepoints to log I/O errors
afs: Handle EIO from delivery function
afs: Fix TTL on VL server and address lists
afs: Implement VL server rotation
afs: Improve FS server rotation error handling
...
This is an effort to disentangle the include/linux/compiler*.h headers
and bring them up to date.
The main idea behind the series is to use feature checking macros
(i.e. __has_attribute) instead of compiler version checks (e.g. GCC_VERSION),
which are compiler-agnostic (so they can be shared, reducing the size
of compiler-specific headers) and version-agnostic.
Other related improvements have been performed in the headers as well,
which on top of the use of __has_attribute it has amounted to a significant
simplification of these headers (e.g. GCC_VERSION is now only guarding
a few non-attribute macros).
This series should also help the efforts to support compiling the kernel
with clang and icc. A fair amount of documentation and comments have also
been added, clarified or removed; and the headers are now more readable,
which should help kernel developers in general.
The series was triggered due to the move to gcc >= 4.6. In turn, this series
has also triggered Sparse to gain the ability to recognize __has_attribute
on its own.
Finally, the __nonstring variable attribute series has been also applied
on top; plus two related patches from Nick Desaulniers for unreachable()
that came a bit afterwards.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEPjU5OPd5QIZ9jqqOGXyLc2htIW0FAlvNpywACgkQGXyLc2ht
IW1aiQ/+P8SJOa3GkiH37/nrIbk/wgMNytbs+gxE5YPaU1DP74Mn1prJ4XhQQic9
/mt8GnitZwzEHWdsGEUk+ZQwnIa7ZEAmpecbAF206AMRbNxa14T5YwBx4bqWFjZp
sP4zPTHt3JCKL8TM+z26o152UbF2kc4WSxHjEjSFaqEnR2E5D0MwFeGPzc8fgWmS
pNyn3CidzB0TS1UF008YXhiJO6HIhFNPyhPawlhwbbdsdlhZ4u0JmwfqP4EvjRFM
kyzdQ9CDe+AgTTD9Y8HhtoUClaa7SJzFWNzpKIJMWt8jpKWYZQ/+WtwKg2cf+v3M
uwktcs3RI1dYrjcITLz4VJ0oVaRFnyGgXvMP4yqWQx429hqnd09WXhMioXQ1htoI
H0vpPIAPsK+dqVA9sP3JzMq4h6+dE7P364lkbThbVpYAGKZ52qaLt9ixT1mw1Q9f
a683ji6o02IVOGUNZ/3KAb5MqdhewNEDdZILZYRfm4AL1Em3WW9QVtIosHPviLgc
16VjA02wKdxIcg+1LZMTNhfybztnSCf7SuQurpH1zEqFDGzrXwB7nYFplEY7DrrD
cqhOA1fMQa++oQR+D40QDoY2ybqPOyvJG7z17pvtt+6jXep4yy2a3Bxf+ClK0nto
5yT7v9ikXJr84FOkk7OvktLlAWvcykvAdfvDepBZhpqhuX82tHY=
=Y8WB
-----END PGP SIGNATURE-----
Merge tag 'compiler-attributes-for-linus-4.20-rc1' of https://github.com/ojeda/linux
Pull compiler attribute updates from Miguel Ojeda:
"This is an effort to disentangle the include/linux/compiler*.h headers
and bring them up to date.
The main idea behind the series is to use feature checking macros
(i.e. __has_attribute) instead of compiler version checks (e.g.
GCC_VERSION), which are compiler-agnostic (so they can be shared,
reducing the size of compiler-specific headers) and version-agnostic.
Other related improvements have been performed in the headers as well,
which on top of the use of __has_attribute it has amounted to a
significant simplification of these headers (e.g. GCC_VERSION is now
only guarding a few non-attribute macros).
This series should also help the efforts to support compiling the
kernel with clang and icc. A fair amount of documentation and comments
have also been added, clarified or removed; and the headers are now
more readable, which should help kernel developers in general.
The series was triggered due to the move to gcc >= 4.6. In turn, this
series has also triggered Sparse to gain the ability to recognize
__has_attribute on its own.
Finally, the __nonstring variable attribute series has been also
applied on top; plus two related patches from Nick Desaulniers for
unreachable() that came a bit afterwards"
* tag 'compiler-attributes-for-linus-4.20-rc1' of https://github.com/ojeda/linux:
compiler-gcc: remove comment about gcc 4.5 from unreachable()
compiler.h: update definition of unreachable()
Compiler Attributes: ext4: remove local __nonstring definition
Compiler Attributes: auxdisplay: panel: use __nonstring
Compiler Attributes: enable -Wstringop-truncation on W=1 (gcc >= 8)
Compiler Attributes: add support for __nonstring (gcc >= 8)
Compiler Attributes: add MAINTAINERS entry
Compiler Attributes: add Doc/process/programming-language.rst
Compiler Attributes: remove uses of __attribute__ from compiler.h
Compiler Attributes: KENTRY used twice the "used" attribute
Compiler Attributes: use feature checks instead of version checks
Compiler Attributes: add missing SPDX ID in compiler_types.h
Compiler Attributes: remove unneeded sparse (__CHECKER__) tests
Compiler Attributes: homogenize __must_be_array
Compiler Attributes: remove unneeded tests
Compiler Attributes: always use the extra-underscores syntax
Compiler Attributes: remove unused attributes
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCW9tlWwAKCRDh3BK/laaZ
PEszAQDnwCMuh4WwhmS4d4X9/rLwzxBqFcHbBUMxvUSpD2LSvQD/fbV3EVkcTGUc
DVfV2e9Zdy2vq36fcR3EMa1oUGNzpgc=
=oscW
-----END PGP SIGNATURE-----
Merge tag 'ovl-update-4.20' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs
Pull overlayfs updates from Miklos Szeredi:
"A mix of fixes and cleanups"
* tag 'ovl-update-4.20' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
ovl: automatically enable redirect_dir on metacopy=on
ovl: check whiteout in ovl_create_over_whiteout()
ovl: using posix_acl_xattr_size() to get size instead of posix_acl_to_xattr()
ovl: abstract ovl_inode lock with a helper
ovl: remove the 'locked' argument of ovl_nlink_{start,end}
ovl: relax requirement for non null uuid of lower fs
ovl: fold copy-up helpers into callers
ovl: untangle copy up call chain
ovl: relax permission checking on underlying layers
ovl: fix recursive oi->lock in ovl_link()
vfs: fix FIGETBSZ ioctl on an overlayfs file
ovl: clean up error handling in ovl_get_tmpfile()
ovl: fix error handling in ovl_verify_set_fh()
Current behavior is to automatically disable metacopy if redirect_dir is
not enabled and proceed with the mount.
If "metacopy=on" mount option was given, then this behavior can confuse the
user: no mount failure, yet metacopy is disabled.
This patch makes metacopy=on imply redirect_dir=on.
The converse is also true: turning off full redirect with redirect_dir=
{off|follow|nofollow} will disable metacopy.
If both metacopy=on and redirect_dir={off|follow|nofollow} is specified,
then mount will fail, since there's no way to correctly resolve the
conflict.
Reported-by: Daniel Walsh <dwalsh@redhat.com>
Fixes: d5791044d2 ("ovl: Provide a mount option metacopy=on/off...")
Cc: <stable@vger.kernel.org> # v4.19
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
- Introduces the stackleak gcc plugin ported from grsecurity by Alexander
Popov, with x86 and arm64 support.
-----BEGIN PGP SIGNATURE-----
Comment: Kees Cook <kees@outflux.net>
iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAlvQvn4WHGtlZXNjb29r
QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJpSfD/sErFreuPT1beSw994Lr9Zx4k9v
ERsuXxWBENaJOJXbOOHMfVEcEeG/1uhPSp7hlw/dpHfh0anATTrcYqm8RNKbfK+k
o06+JK14OJfpm5Ghq/7OizhdNLCMT8wMU3XZtWfy65VSJGjEFx8Y48vMeQtpWtUK
ylSzi9JV6j2iUBF9oibtiT53+yqsqAtX80X1G7HRCgv9kxuKMhZr+Q5oGV6+ViyQ
Azj8mNn06iRnhHKd17WxDJr0GjSibzz4weS/9XgP3t3EcNWJo1EgBlD2KV3tOfP5
nzmqfqTqrcjxs/tyjdh6vVCSlYucNtyCQGn63qyShQYSg6mZwclR2fY8YSTw6PWw
GfYWFOWru9z+qyQmwFkQ9bSQS2R+JIT0oBCj9VmtF9XmPCy7K2neJsQclzSPBiCW
wPgXVQS4IA4684O5CmDOVMwmDpGvhdBNUR6cqSzGLxQOHY1csyXubMNUsqU3g9xk
Ob4pEy/xrrIw4WpwHcLHSEW5gV1/OLhsT0fGRJJiC947L3cN5s9EZp7FLbIS0zlk
qzaXUcLmn6AgcfkYwg5cI3RMLaN2V0eDCMVTWZJ1wbrmUV9chAaOnTPTjNqLOTht
v3b1TTxXG4iCpMmOFf59F8pqgAwbBDlfyNSbySZ/Pq5QH69udz3Z9pIUlYQnSJHk
u6q++2ReDpJXF81rBw==
=Ks6B
-----END PGP SIGNATURE-----
Merge tag 'stackleak-v4.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull stackleak gcc plugin from Kees Cook:
"Please pull this new GCC plugin, stackleak, for v4.20-rc1. This plugin
was ported from grsecurity by Alexander Popov. It provides efficient
stack content poisoning at syscall exit. This creates a defense
against at least two classes of flaws:
- Uninitialized stack usage. (We continue to work on improving the
compiler to do this in other ways: e.g. unconditional zero init was
proposed to GCC and Clang, and more plugin work has started too).
- Stack content exposure. By greatly reducing the lifetime of valid
stack contents, exposures via either direct read bugs or unknown
cache side-channels become much more difficult to exploit. This
complements the existing buddy and heap poisoning options, but
provides the coverage for stacks.
The x86 hooks are included in this series (which have been reviewed by
Ingo, Dave Hansen, and Thomas Gleixner). The arm64 hooks have already
been merged through the arm64 tree (written by Laura Abbott and
reviewed by Mark Rutland and Will Deacon).
With VLAs having been removed this release, there is no need for
alloca() protection, so it has been removed from the plugin"
* tag 'stackleak-v4.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
arm64: Drop unneeded stackleak_check_alloca()
stackleak: Allow runtime disabling of kernel stack erasing
doc: self-protection: Add information about STACKLEAK feature
fs/proc: Show STACKLEAK metrics in the /proc file system
lkdtm: Add a test for STACKLEAK
gcc-plugins: Add STACKLEAK plugin for tracking the kernel stack
x86/entry: Add STACKLEAK erasing the kernel stack at the end of syscalls
Trivial fix to a spelling mistake of the error access name EACCESS,
rename to EACCES
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCW9m8rQAKCRDh3BK/laaZ
POTeAP9DthScqnVxrRiyvORwffjTLOCijY4yAatgxTU5MO6TQgD/eeO62Exq5Cij
4uXCSNIzPVPKiimunVKYoDM8KmcNtAQ=
=z92F
-----END PGP SIGNATURE-----
Merge tag 'fuse-update-4.20' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse
Pull fuse updates from Miklos Szeredi:
"As well as the usual bug fixes, this adds the following new features:
- cached readdir and readlink
- max I/O size increased from 128k to 1M
- improved performance and scalability of request queues
- copy_file_range support
The only non-fuse bits are trivial cleanups of macros in
<linux/bitops.h>"
* tag 'fuse-update-4.20' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: (31 commits)
fuse: enable caching of symlinks
fuse: only invalidate atime in direct read
fuse: don't need GETATTR after every READ
fuse: allow fine grained attr cache invaldation
bitops: protect variables in bit_clear_unless() macro
bitops: protect variables in set_mask_bits() macro
fuse: realloc page array
fuse: add max_pages to init_out
fuse: allocate page array more efficiently
fuse: reduce size of struct fuse_inode
fuse: use iversion for readdir cache verification
fuse: use mtime for readdir cache verification
fuse: add readdir cache version
fuse: allow using readdir cache
fuse: allow caching readdir
fuse: extract fuse_emit() helper
fuse: add FOPEN_CACHE_DIR
fuse: split out readdir.c
fuse: Use hash table to link processing request
fuse: kill req->intr_unique
...
- a series that fixes some old memory allocation issues in libceph
(myself). We no longer allocate memory in places where allocation
failures cannot be handled and BUG when the allocation fails.
- support for copy_file_range() syscall (Luis Henriques). If size and
alignment conditions are met, it leverages RADOS copy-from operation.
Otherwise, a local copy is performed.
- a patch that reduces memory requirement of ceph_sync_read() from the
size of the entire read to the size of one object (Zheng Yan).
- fallocate() syscall is now restricted to FALLOC_FL_PUNCH_HOLE (Luis
Henriques)
-----BEGIN PGP SIGNATURE-----
iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAlvZ6AcTHGlkcnlvbW92
QGdtYWlsLmNvbQAKCRBKf944AhHzi8H+B/9V/QB1BX5Q2DvkS3mcLNI2NphrppaD
VBuviwoIzaBm1paCrx40J/pCtsK1Fybl5dBAh1W0SDxEGR8JUA8GJw+oemtOS6pZ
DwjOF9S7uhzf5M3nQ9SvAbIudBISMZQRi22Y8fWs3k+yaECIz1J/pe7RiKo/GBAB
NnlbrZ1AYSB02chchVCSmWTApeIRp9JXnaM9xLMJWGVLL/vONjt3ltJ/w9haGYz8
FPFLPFeWobWqFElnOUomxU8Cv84DgPtH8si0UAn16jveractpFJWO4X6LDs/ZYDk
/MccfsB3EK9BCJdLJMoI0/lXxE33z3/MehmJDs9xGSX/N4N7UTF8Ve1b
=U91e
-----END PGP SIGNATURE-----
Merge tag 'ceph-for-4.20-rc1' of git://github.com/ceph/ceph-client
Pull ceph updates from Ilya Dryomov:
"The highlights are:
- a series that fixes some old memory allocation issues in libceph
(myself). We no longer allocate memory in places where allocation
failures cannot be handled and BUG when the allocation fails.
- support for copy_file_range() syscall (Luis Henriques). If size and
alignment conditions are met, it leverages RADOS copy-from
operation. Otherwise, a local copy is performed.
- a patch that reduces memory requirement of ceph_sync_read() from
the size of the entire read to the size of one object (Zheng Yan).
- fallocate() syscall is now restricted to FALLOC_FL_PUNCH_HOLE (Luis
Henriques)"
* tag 'ceph-for-4.20-rc1' of git://github.com/ceph/ceph-client: (25 commits)
ceph: new mount option to disable usage of copy-from op
ceph: support copy_file_range file operation
libceph: support the RADOS copy-from operation
ceph: add non-blocking parameter to ceph_try_get_caps()
libceph: check reply num_data_items in setup_request_data()
libceph: preallocate message data items
libceph, rbd, ceph: move ceph_osdc_alloc_messages() calls
libceph: introduce alloc_watch_request()
libceph: assign cookies in linger_submit()
libceph: enable fallback to ceph_msg_new() in ceph_msgpool_get()
ceph: num_ops is off by one in ceph_aio_retry_work()
libceph: no need to call osd_req_opcode_valid() in osd_req_encode_op()
ceph: set timeout conditionally in __cap_delay_requeue
libceph: don't consume a ref on pagelist in ceph_msg_data_add_pagelist()
libceph: introduce ceph_pagelist_alloc()
libceph: osd_req_op_cls_init() doesn't need to take opcode
libceph: bump CEPH_MSG_MAX_DATA_LEN
ceph: only allow punch hole mode in fallocate
ceph: refactor ceph_sync_read()
ceph: check if LOOKUPNAME request was aborted when filling trace
...
Merge more updates from Andrew Morton:
- the rest of MM
- lib/bitmap updates
- hfs updates
- fatfs updates
- various other misc things
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (94 commits)
mm/gup.c: fix __get_user_pages_fast() comment
mm: Fix warning in insert_pfn()
memory-hotplug.rst: add some details about locking internals
powerpc/powernv: hold device_hotplug_lock when calling memtrace_offline_pages()
powerpc/powernv: hold device_hotplug_lock when calling device_online()
mm/memory_hotplug: fix online/offline_pages called w.o. mem_hotplug_lock
mm/memory_hotplug: make add_memory() take the device_hotplug_lock
mm/memory_hotplug: make remove_memory() take the device_hotplug_lock
mm/memblock.c: warn if zero alignment was requested
memblock: stop using implicit alignment to SMP_CACHE_BYTES
docs/boot-time-mm: remove bootmem documentation
mm: remove include/linux/bootmem.h
memblock: replace BOOTMEM_ALLOC_* with MEMBLOCK variants
mm: remove nobootmem
memblock: rename __free_pages_bootmem to memblock_free_pages
memblock: rename free_all_bootmem to memblock_free_all
memblock: replace free_bootmem_late with memblock_free_late
memblock: replace free_bootmem{_node} with memblock_free
mm: nobootmem: remove bootmem allocation APIs
memblock: replace alloc_bootmem with memblock_alloc
...
Move remaining definitions and declarations from include/linux/bootmem.h
into include/linux/memblock.h and remove the redundant header.
The includes were replaced with the semantic patch below and then
semi-automated removal of duplicated '#include <linux/memblock.h>
@@
@@
- #include <linux/bootmem.h>
+ #include <linux/memblock.h>
[sfr@canb.auug.org.au: dma-direct: fix up for the removal of linux/bootmem.h]
Link: http://lkml.kernel.org/r/20181002185342.133d1680@canb.auug.org.au
[sfr@canb.auug.org.au: powerpc: fix up for removal of linux/bootmem.h]
Link: http://lkml.kernel.org/r/20181005161406.73ef8727@canb.auug.org.au
[sfr@canb.auug.org.au: x86/kaslr, ACPI/NUMA: fix for linux/bootmem.h removal]
Link: http://lkml.kernel.org/r/20181008190341.5e396491@canb.auug.org.au
Link: http://lkml.kernel.org/r/1536927045-23536-30-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Ley Foon Tan <lftan@altera.com>
Cc: Mark Salter <msalter@redhat.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Palmer Dabbelt <palmer@sifive.com>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Richard Kuo <rkuo@codeaurora.org>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Serge Semin <fancer.lancer@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
All architecures use memblock for early memory management. There is no need
for the CONFIG_HAVE_MEMBLOCK configuration option.
[rppt@linux.vnet.ibm.com: of/fdt: fixup #ifdefs]
Link: http://lkml.kernel.org/r/20180919103457.GA20545@rapoport-lnx
[rppt@linux.vnet.ibm.com: csky: fixups after bootmem removal]
Link: http://lkml.kernel.org/r/20180926112744.GC4628@rapoport-lnx
[rppt@linux.vnet.ibm.com: remove stale #else and the code it protects]
Link: http://lkml.kernel.org/r/1538067825-24835-1-git-send-email-rppt@linux.vnet.ibm.com
Link: http://lkml.kernel.org/r/1536927045-23536-4-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Tested-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Ley Foon Tan <lftan@altera.com>
Cc: Mark Salter <msalter@redhat.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Palmer Dabbelt <palmer@sifive.com>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Richard Kuo <rkuo@codeaurora.org>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Serge Semin <fancer.lancer@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add the fat-specific inode_operation ->update_time() and
fat_truncate_time() function to truncate the inode timestamps from 1
nanosecond to the appropriate granularity.
Link: http://lkml.kernel.org/r/38af1ba3c3cf0d7381ce7b63077ef8af75901532.1538363961.git.sorenson@redhat.com
Signed-off-by: Frank Sorenson <sorenson@redhat.com>
Acked-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "fat: timestamp updates", v5.
fat/msdos timestamps are stored on-disk with several different
granularities, some of them lower resolution than timespec64_trunc() can
provide. In addition, they are only truncated as they are written to
disk, so the timestamps in-memory for new or modified files/directories
may be different from the same timestamps after a remount, as the
now-truncated times are re-read from the on-disk format.
These patches allow finer granularity for the timestamps where possible
and add fat-specific ->update_time inode operation and fat_truncate_time
functions to truncate each timestamp correctly, giving consistent times
across remounts.
This patch (of 4):
Move the calculation of the number of seconds in the timezone offset to a
common function.
Link: http://lkml.kernel.org/r/3671ff8cff5eeedbb85ebda5e4de0728920db4f6.1538363961.git.sorenson@redhat.com
Signed-off-by: Frank Sorenson <sorenson@redhat.com>
Acked-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The file namei.c seems to have been renamed to namei_msdos.c, so I decided
to update the comment with the correct name, and expand it a bit to tell
the reader what to look for.
Link: http://lkml.kernel.org/r/20180928194947.23932-1-mihir@cs.utexas.edu
Signed-off-by: Mihir Mehta <mihir@cs.utexas.edu>
Acked-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
cafa0010cd ("Raise the minimum required gcc version to 4.6") bumped the
minimum GCC version to 4.6 for all architectures.
The workaround code in fs/reiserfs/Makefile is obsolete now.
Link: http://lkml.kernel.org/r/1535337230-13222-1-git-send-email-yamada.masahiro@socionext.com
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
fill_with_dentries() failed to propagate errors up to
reiserfs_for_each_xattr() properly. Plumb them through.
Note that reiserfs_for_each_xattr() is only used by
reiserfs_delete_xattrs() and reiserfs_chown_xattrs(). The result of
reiserfs_delete_xattrs() is discarded anyway, the only difference there is
whether a warning is printed to dmesg. The result of
reiserfs_chown_xattrs() does matter because it can block chowning of the
file to which the xattrs belong; but either way, the resulting state can
have misaligned ownership, so my patch doesn't improve things greatly.
Credit for making me look at this code goes to Al Viro, who pointed out
that the ->actor calling convention is suboptimal and should be changed.
Link: http://lkml.kernel.org/r/20180802163335.83312-1-jannh@google.com
Signed-off-by: Jann Horn <jannh@google.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Jeff Mahoney <jeffm@suse.com>
Cc: Eric Biggers <ebiggers@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently extent and index i are both being incremented causing an array
out of bounds read on extent[i]. Fix this by removing the extraneous
increment of extent.
Ernesto said:
: This is only triggered when deleting a file with a resource fork. I
: may be wrong because the documentation isn't clear, but I don't think
: you can create those under linux. So I guess nobody was testing them.
:
: > A disk space leak, perhaps?
:
: That's what it looks like in general. hfs_free_extents() won't do
: anything if the block count doesn't add up, and the error will be
: ignored. Now, if the block count randomly does add up, we could see
: some corruption.
Detected by CoverityScan, CID#711541 ("Out of bounds read")
Link: http://lkml.kernel.org/r/20180831140538.31566-1-colin.king@canonical.com
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Ernesto A. Fernndez <ernesto.mnd.fernandez@gmail.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Hin-Tak Leung <htl10@users.sourceforge.net>
Cc: Vyacheslav Dubeyko <slava@dubeyko.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The vfs takes care of updating mtime on ftruncate(), but on truncate() it
must be done by the module.
Link: http://lkml.kernel.org/r/e1611eda2985b672ed2d8677350b4ad8c2d07e8a.1539316825.git.ernesto.mnd.fernandez@gmail.com
Signed-off-by: Ernesto A. Fernández <ernesto.mnd.fernandez@gmail.com>
Reviewed-by: Vyacheslav Dubeyko <slava@dubeyko.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The vfs takes care of updating ctime and mtime on ftruncate(), but on
truncate() it must be done by the module.
This patch can be tested with xfstests generic/313.
Link: http://lkml.kernel.org/r/9beb0913eea37288599e8e1b7cec8768fb52d1b8.1539316825.git.ernesto.mnd.fernandez@gmail.com
Signed-off-by: Ernesto A. Fernández <ernesto.mnd.fernandez@gmail.com>
Reviewed-by: Vyacheslav Dubeyko <slava@dubeyko.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Direct writes to empty inodes fail with EIO. The generic direct-io code
is in part to blame (a patch has been submitted as "direct-io: allow
direct writes to empty inodes"), but hfs is worse affected than the other
filesystems because the fallback to buffered I/O doesn't happen.
The problem is the return value of hfs_get_block() when called with
!create. Change it to be more consistent with the other modules.
Link: http://lkml.kernel.org/r/4538ab8c35ea37338490525f0f24cbc37227528c.1539195310.git.ernesto.mnd.fernandez@gmail.com
Signed-off-by: Ernesto A. Fernández <ernesto.mnd.fernandez@gmail.com>
Reviewed-by: Vyacheslav Dubeyko <slava@dubeyko.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Direct writes to empty inodes fail with EIO. The generic direct-io code
is in part to blame (a patch has been submitted as "direct-io: allow
direct writes to empty inodes"), but hfsplus is worse affected than the
other filesystems because the fallback to buffered I/O doesn't happen.
The problem is the return value of hfsplus_get_block() when called with
!create. Change it to be more consistent with the other modules.
Link: http://lkml.kernel.org/r/2cd1301404ec7cf1e39c8f11a01a4302f1460ad6.1539195310.git.ernesto.mnd.fernandez@gmail.com
Signed-off-by: Ernesto A. Fernández <ernesto.mnd.fernandez@gmail.com>
Reviewed-by: Vyacheslav Dubeyko <slava@dubeyko.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Inserting a new record in a btree may require splitting several of its
nodes. If we hit ENOSPC halfway through, the new nodes will be left
orphaned and their records will be lost. This could mean lost inodes or
extents.
Henceforth, check the available disk space before making any changes.
This still leaves the potential problem of corruption on ENOMEM.
There is no need to reserve space before deleting a catalog record, as we
do for hfsplus. This difference is because hfs index nodes have fixed
length keys.
Link: http://lkml.kernel.org/r/ab5fc8a7d5ffccfd5f27b1cf2cb4ceb6c110da74.1536269131.git.ernesto.mnd.fernandez@gmail.com
Signed-off-by: Ernesto A. Fernández <ernesto.mnd.fernandez@gmail.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Inserting or deleting a record in a btree may require splitting several of
its nodes. If we hit ENOSPC halfway through, the new nodes will be left
orphaned and their records will be lost. This could mean lost inodes,
extents or xattrs.
Henceforth, check the available disk space before making any changes.
This still leaves the potential problem of corruption on ENOMEM.
The patch can be tested with xfstests generic/027.
Link: http://lkml.kernel.org/r/4596eef22fbda137b4ffa0272d92f0da15364421.1536269129.git.ernesto.mnd.fernandez@gmail.com
Signed-off-by: Ernesto A. Fernández <ernesto.mnd.fernandez@gmail.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
hfs_brec_update_parent() may hit BUG_ON() if the first record of both a
leaf node and its parent are changed, and if this forces the parent to
be split. It is not possible for this to happen on a valid hfs
filesystem because the index nodes have fixed length keys.
For reasons I ignore, the hfs module does have support for a number of
hfsplus features. A corrupt btree header may report variable length
keys and trigger this BUG, so it's better to fix it.
Link: http://lkml.kernel.org/r/cf9b02d57f806217a2b1bf5db8c3e39730d8f603.1535682463.git.ernesto.mnd.fernandez@gmail.com
Signed-off-by: Ernesto A. Fernández <ernesto.mnd.fernandez@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Viacheslav Dubeyko <slava@dubeyko.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This bug is triggered whenever hfs_brec_update_parent() needs to split
the root node. The height of the btree is not increased, which leaves
the new node orphaned and its records lost. It is not possible for this
to happen on a valid hfs filesystem because the index nodes have fixed
length keys.
For reasons I ignore, the hfs module does have support for a number of
hfsplus features. A corrupt btree header may report variable length
keys and trigger this bug, so it's better to fix it.
Link: http://lkml.kernel.org/r/9750b1415685c4adca10766895f6d5ef12babdb0.1535682463.git.ernesto.mnd.fernandez@gmail.com
Signed-off-by: Ernesto A. Fernández <ernesto.mnd.fernandez@gmail.com>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Creating, renaming or deleting a file may hit BUG_ON() if the first
record of both a leaf node and its parent are changed, and if this
forces the parent to be split. This bug is triggered by xfstests
generic/027, somewhat rarely; here is a more reliable reproducer:
truncate -s 50M fs.iso
mkfs.hfsplus fs.iso
mount fs.iso /mnt
i=1000
while [ $i -le 2400 ]; do
touch /mnt/$i &>/dev/null
((++i))
done
i=2400
while [ $i -ge 1000 ]; do
mv /mnt/$i /mnt/$(perl -e "print $i x61") &>/dev/null
((--i))
done
The issue is that a newly created bnode is being put twice. Reset
new_node to NULL in hfs_brec_update_parent() before reaching goto again.
Link: http://lkml.kernel.org/r/5ee1db09b60373a15890f6a7c835d00e76bf601d.1535682461.git.ernesto.mnd.fernandez@gmail.com
Signed-off-by: Ernesto A. Fernández <ernesto.mnd.fernandez@gmail.com>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Creating, renaming or deleting a file may cause catalog corruption and
data loss. This bug is randomly triggered by xfstests generic/027, but
here is a faster reproducer:
truncate -s 50M fs.iso
mkfs.hfsplus fs.iso
mount fs.iso /mnt
i=100
while [ $i -le 150 ]; do
touch /mnt/$i &>/dev/null
((++i))
done
i=100
while [ $i -le 150 ]; do
mv /mnt/$i /mnt/$(perl -e "print $i x82") &>/dev/null
((++i))
done
umount /mnt
fsck.hfsplus -n fs.iso
The bug is triggered whenever hfs_brec_update_parent() needs to split the
root node. The height of the btree is not increased, which leaves the new
node orphaned and its records lost.
Link: http://lkml.kernel.org/r/26d882184fc43043a810114258f45277752186c7.1535682461.git.ernesto.mnd.fernandez@gmail.com
Signed-off-by: Ernesto A. Fernández <ernesto.mnd.fernandez@gmail.com>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Kaixuxia repors that it's possible to crash overlayfs by removing the
whiteout on the upper layer before creating a directory over it. This is a
reproducer:
mkdir lower upper work merge
touch lower/file
mount -t overlay overlay -olowerdir=lower,upperdir=upper,workdir=work merge
rm merge/file
ls -al merge/file
rm upper/file
ls -al merge/
mkdir merge/file
Before commencing with a vfs_rename(..., RENAME_EXCHANGE) verify that the
lookup of "upper" is positive and is a whiteout, and return ESTALE
otherwise.
Reported by: kaixuxia <xiakaixu1987@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Fixes: e9be9d5e76 ("overlay filesystem")
Cc: <stable@vger.kernel.org> # v3.18
already supported COPY, by copying a limited amount of data and then
returning a short result, letting the client resend. The asynchronous
protocol should offer better performance at the expense of some
complexity.
The other highlight is Trond's work to convert the duplicate reply cache
to a red-black tree, and to move it and some other server caches to RCU.
(Previously these have meant taking global spinlocks on every RPC.)
Otherwise, some RDMA work and miscellaneous bugfixes.
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJb2KWzAAoJECebzXlCjuG+gcQP/3DldB86CFxgSFx0t+h+s+TV
CdYJDPyLyRkEMiD+4dCPPuhueve+j5BPHVsDbn98FTWrEn131NMIs6uhU/VGTtAU
6a8f/ExtZ5U7s39MJCzlk2ozVElBc3QPp7p3p9NKn0Wi0PXbVgjuIqR5o2vwa8Si
KOVdLm6ylfav/HTH8DO6zFPJRsTgTwcJOivXXshjpglMKAcw8AuqSsGgBrDeGpgU
u91Vi0EM1vt96+CA6a01mTgC/sFX7EqGvxUUHOrKWf5cIjnpT3FDvouYPxi+GH8Z
SIDlaMQyXF5m4m6MhELNTP4v97XAHyPJtvLkEe5lggTyABPiA2heo9e8onysWkzV
1v8OZHCVFa1UL34mDlnFxbFCYVr7FFKMGjTBR/ntinobPfAbWRCO1Hdd+bBGPDD4
byf7ctDVp7KQ2bSatIdlYavikuGDHWFDZHzPHlqkD3gpIZSNvhe26sV3NZqIFlXO
cMUega2Y5mXmULauHhxAcNGtDK7dF5hHoMWKJy0DNxiyDiDLylwDOIfwt1De3Q7V
ycd/wUytUS2LkAhyS2mvoDK6eXTBAeQwzmXAqveh6rewwO83HC/t9mtKBBDomvKG
xRpRPmmbj9ijbwkilEBmijjR47wrihmEVIFahznEerZ+//QOfVVOB0MNtzIyU9/k
CnP1ZNvOs3LR1pxxwFa8
=TTo0
-----END PGP SIGNATURE-----
Merge tag 'nfsd-4.20' of git://linux-nfs.org/~bfields/linux
Pull nfsd updates from Bruce Fields:
"Olga added support for the NFSv4.2 asynchronous copy protocol. We
already supported COPY, by copying a limited amount of data and then
returning a short result, letting the client resend. The asynchronous
protocol should offer better performance at the expense of some
complexity.
The other highlight is Trond's work to convert the duplicate reply
cache to a red-black tree, and to move it and some other server caches
to RCU. (Previously these have meant taking global spinlocks on every
RPC)
Otherwise, some RDMA work and miscellaneous bugfixes"
* tag 'nfsd-4.20' of git://linux-nfs.org/~bfields/linux: (30 commits)
lockd: fix access beyond unterminated strings in prints
nfsd: Fix an Oops in free_session()
nfsd: correctly decrement odstate refcount in error path
svcrdma: Increase the default connection credit limit
svcrdma: Remove try_module_get from backchannel
svcrdma: Remove ->release_rqst call in bc reply handler
svcrdma: Reduce max_send_sges
nfsd: fix fall-through annotations
knfsd: Improve lookup performance in the duplicate reply cache using an rbtree
knfsd: Further simplify the cache lookup
knfsd: Simplify NFS duplicate replay cache
knfsd: Remove dead code from nfsd_cache_lookup
SUNRPC: Simplify TCP receive code
SUNRPC: Replace the cache_detail->hash_lock with a regular spinlock
SUNRPC: Remove non-RCU protected lookup
NFS: Fix up a typo in nfs_dns_ent_put
NFS: Lockless DNS lookups
knfsd: Lockless lookup of NFSv4 identities.
SUNRPC: Lockless server RPCSEC_GSS context lookup
knfsd: Allow lockless lookups of the exports
...
plus trivial indentation fixes.
-----BEGIN PGP SIGNATURE-----
iQEcBAABAgAGBQJb2KX1AAoJEH9KYoIL9GO3lfcH/R1eClPzhBbOj/MZWgxoB5VN
lsOFl2XYc8bhThxOygqKcLpa2De5Q6ebrvLFgQ43erO7MaXzI4mswM5+azIlLXx3
iVuE1NIYze5g92yvb4mLeDHGVid4EjGoG1tiGRxuU18j02Nze1B7t22tBzcYUCyi
buJfx0A37aMepd/+cy3Qp4G03hgaNMama1220AR0S0kkORIBZFzKQOAKN6r8DGa/
05QhmtJJQsLJJxyLDv6lKmy0Ef42COeDICpYUlQ1LvoxJJBAblDBzlkYl7ulORwV
f147xPV+v/jlE8CktOtN31S8x+XRvbbqm9sKLB0XKnA9vz89WAl1BzoZ/7FZf/Y=
=aGIT
-----END PGP SIGNATURE-----
Merge tag 'cramfs_fixes' of git://git.linaro.org/people/nicolas.pitre/linux
Pull cramfs fixes from Nicolas Pitre:
"Make the Cramfs code more robust against filesystem corruptions, plus
trivial indentation fixes"
* tag 'cramfs_fixes' of git://git.linaro.org/people/nicolas.pitre/linux:
Cramfs: trivial whitespace fixes
Cramfs: fix abad comparison when wrap-arounds occur
It is possible for corrupted filesystem images to produce very large
block offsets that may wrap when a length is added, and wrongly pass
the buffer size test.
Reported-by: Anatoly Trosinenko <anatoly.trosinenko@gmail.com>
Signed-off-by: Nicolas Pitre <nico@linaro.org>
Cc: stable@vger.kernel.org
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAlvYVlMACgkQxWXV+ddt
WDv9xxAAmN+R9y+wOKjPkDoM7jr8hRR12YnTC8R4X8oD8QTnSXWOrmfO2prYpe7d
RyUxpuhqY+q+qvCxkp+BREa86a0zswhn/Z6HfLbHn4CaEhtchkMKR/gFOiYeL2B1
ZIJtqgnqOGP3N1oxfn3Zr586W3ECUJq+4EUD/1OWCxZHvn1DWWd7L3VL0884hAhE
kDVWhMdBm0nX1SOet/8haI0N98NLdyltsGdz80ooi65qR52YE4u2IoqXEg2z0AEM
EApA6vQeOIIuZaRznIl2xFiIMbQCoMRb2sQgwIPmWoXrfboJUHyHfFrKRv5gGUHg
DXjOXTvVdu9EEqm+1HughwZL/KRkr+OcXHHWwP+v51zsiyfbic+fegpM6a+Z0NjD
LCo5D1NSLulhpZHr14F3qM27+LYHEC4xxXrrzRoVq4DCoSq7xgj3ip49uXe1F4Rw
AyLeJGGOp8aqvPiD0BfgMVi4+YhWJUd/ob9Ldn9z+2y0XGQ2FDM58iCt+49+YIQi
e2ywGaHt3aXghPAo/mvnckfZMLNZ7DJPwA7K6ayJ3N23dqGW2CORkKrGy7xVGoZn
2AjIN1pSRLlknQJZsa6Yp1mPxnrBQfutTVxxUfKOtmEzydxMVS0g92+Lu/JRb4pu
F/tpq/lC7dpTvP08EWw0sLjIhLeqMKzbXk38pSfUm39yDgQ10e8=
=CiDs
-----END PGP SIGNATURE-----
Merge tag 'for-4.20-part2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull more btrfs updates from David Sterba:
"This contains a few minor updates and fixes that were under testing or
arrived shortly after the merge window freeze, mostly stable material"
* tag 'for-4.20-part2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
Btrfs: fix use-after-free when dumping free space
Btrfs: fix use-after-free during inode eviction
btrfs: move the dio_sem higher up the callchain
btrfs: don't run delayed_iputs in commit
btrfs: fix insert_reserved error handling
btrfs: only free reserved extent if we didn't insert it
btrfs: don't use ctl->free_space for max_extent_size
btrfs: set max_extent_size properly
btrfs: reset max_extent_size properly
MAINTAINERS: update my email address for btrfs
btrfs: delayed-ref: extract find_first_ref_head from find_ref_head
Btrfs: fix deadlock when writing out free space caches
Btrfs: fix assertion on fsync of regular file when using no-holes feature
Btrfs: fix null pointer dereference on compressed write path error
Now that the vfs remap helper dirties the inode [cm]time for us, xfs no
longer needs to do that on its own.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Since xfs_file_remap_range is a thin wrapper, move the contents of
xfs_reflink_remap_range into the shell. This cuts down on the vfs
calls being made from internal xfs code.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Now that we've moved the partial EOF block checks to the VFS helpers, we
can remove the redundant functionality from XFS.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Back when the XFS reflink code only supported clone_file_range, we were
only able to return zero or negative error codes to userspace. However,
now that copy_file_range (which returns bytes copied) can use XFS'
clone_file_range, we have the opportunity to return partial results.
For example, if userspace sends a 1GB clone request and we run out of
space halfway through, we at least can tell userspace that we completed
512M of that request like a regular write.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Move the offset <-> blocks unit conversions into
xfs_reflink_remap_blocks to make the call site less ugly.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Prior to remapping blocks, it is necessary to remove pages from the
destination file's page cache. Unfortunately, the truncation is not
aggressive enough -- if page size > block size, we'll end up zeroing
subpage blocks instead of removing them. So, round the start offset
down and the end offset up to page boundaries. We already wrote all
the dirty data so the larger range shouldn't be a problem.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Since ocfs2_remap_file_range is a thin shell around
ocfs2_remap_remap_range, move everything from the latter into the
former.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Change the ocfs2 remap code to allow for returning partial results.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Prior to remapping blocks, it is necessary to remove pages from the
destination file's page cache. Unfortunately, the truncation is not
aggressive enough -- if page size > block size, we'll end up zeroing
subpage blocks instead of removing them. So, round the start offset
down and the end offset up to page boundaries. We already wrote all
the dirty data so the larger range should be fine.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
When cloning blocks into another file, truncate the page cache before we
start remapping blocks so that concurrent reads wait for us to finish.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Since the remap prep function can update the length of the remap
request, we can change this function to return the usual return status
instead of the odd behavior it has now.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
There are no callers of vfs_dedupe_file_range_compare, so we might as
well make it a static helper and remove the export.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Plumb in a remap flag that enables the filesystem remap handler to
shorten remapping requests for callers that can handle it. Now
copy_file_range can report partial success (in case we run up against
alignment problems, resource limits, etc.).
We also enable CAN_SHORTEN for fideduperange to maintain existing
userspace-visible behavior where xfs/btrfs shorten the dedupe range to
avoid stale post-eof data exposure.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Plumb a remap_flags argument through the vfs_dedupe_file_range_one
functions so that dedupe can take advantage of it.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Plumb a remap_flags argument through the {do,vfs}_clone_file_range
functions so that clone can take advantage of it.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Change the remap_file_range functions to take a number of bytes to
operate upon and return the number of bytes they operated on. This is a
requirement for allowing fs implementations to return short clone/dedupe
results to the user, which will enable us to obey resource limits in a
graceful manner.
A subsequent patch will enable copy_file_range to signal to the
->clone_file_range implementation that it can handle a short length,
which will be returned in the function's return value. For now the
short return is not implemented anywhere so the behavior won't change --
either copy_file_range manages to clone the entire range or it tries an
alternative.
Neither clone ioctl can take advantage of this, alas.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Extend generic_remap_file_range_prep to handle inode metadata updates
when remapping into a file. If the operation can possibly alter the
file contents, we must update the ctime and mtime and remove security
privileges, just like we do for regular file writes.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Pass the same remap flags to generic_remap_checks for consistency.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Plumb the remap flags through the filesystem from the vfs function
dispatcher all the way to the prep function to prepare for behavior
changes in subsequent patches.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Combine the clone_file_range and dedupe_file_range operations into a
single remap_file_range file operation dispatch since they're
fundamentally the same operation. The differences between the two can
be made in the prep functions.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Since we use clone_verify_area for both clone and dedupe range checks,
rename the function to make it clear that it's for both.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
The vfs_clone_file_prep is a generic function to be called by filesystem
implementations only. Rename the prefix to generic_ and make it more
clear that it applies to remap operations, not just clones.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Don't bother calling the filesystem for a zero-length dedupe request;
we can return zero and exit.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
A deduplication data corruption is exposed in XFS and btrfs. It is
caused by extending the block match range to include the partial EOF
block, but then allowing unknown data beyond EOF to be considered a
"match" to data in the destination file because the comparison is only
made to the end of the source file. This corrupts the destination file
when the source extent is shared with it.
The VFS remapping prep functions only support whole block dedupe, but
we still need to appear to support whole file dedupe correctly. Hence
if the dedupe request includes the last block of the souce file, don't
include it in the actual dedupe operation. If the rest of the range
dedupes successfully, then reject the entire request. A subsequent
patch will enable us to shorten dedupe requests correctly.
When reflinking sub-file ranges, a data corruption can occur when the
source file range includes a partial EOF block. This shares the unknown
data beyond EOF into the second file at a position inside EOF, exposing
stale data in the second file.
If the reflink request includes the last block of the souce file, only
proceed with the reflink operation if it lands at or past the
destination file's current EOF. If it lands within the destination file
EOF, reject the entire request with -EINVAL and make the caller go the
hard way. A subsequent patch will enable us to shorten reflink requests
correctly.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
If a remap caller asks us to remap to the source file's EOF and the
source file length leaves us with a zero byte request, exit early.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Move the file range checks from vfs_clone_file_prep into a separate
generic_remap_checks function so that all the checks are collected in a
central location. This forms the basis for adding more checks from
generic_write_checks that will make cloning's input checking more
consistent with write input checking.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
vfs_clone_file_prep_inodes cannot return 0 if it is asked to remap from
a zero byte file because that's what btrfs does.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJb12wyAAoJEAhfPr2O5OEVQG4P/3QlXjec6qlhbo6UPs54E2sC
bVdZfp3mobo8NmLRt791Yh9cc0rN45Tlf2BT8XEmCyI6+NB++obU/j0LW5XT7sp7
oE8IgeRraVFWH/Xl9lTgP15Cs6v43eyvP12xgRWBmr+TYugLHDVTheGBvU/COb3d
yaykUULezuOMLA3HsPbz5EJOmU5rZ/Wa1w1sAiNJY/cRohfVb3kO4593enwUTMSx
yHJ+AVjl/Dn3RV4yLwoybpxPH6XIb3KoLg/6Fx8bOlKy1sg0mcWpzQ1CvMUNpXTF
kdwTw3ri1bfYnjChZewuKoJU8Wcw0Gt7pkqAhULN1ieo84MNA3bNor56pdRPaOZW
KxzlXZRS6xgYW8bzZ51N0Ku6fwSt3AWRE7TeKcrHF84Yb8vOtPS15sp3qc+9o9rb
EDV/lJLcz4bbi3W28di5WMFaN7LHxCHnRV7GvrcNQm6Im62CBFZHiI7jKjMv3tXp
Taes0utMPGfWuY6fv4LmuBzFG4nGB6/H4RiVvL1cLkjnx/FJtWGH+1uOcKDraKeI
ENBrK0VYrNH7nCDGNehiamStcVK+27tS+xsuqoZkGz6RA8vAxYBXTIZULXA98BPA
f6NC32ZNJaruxh4qh5tUy+LKPGzXs0sWa9kfgKmFfaOndFLMjGTXHpAT5AYJMbNe
iqKi/4aXD4aKAWTA7PPg
=Cc6D
-----END PGP SIGNATURE-----
Merge tag 'media/v4.20-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media
Pull media updates from Mauro Carvalho Chehab:
- new dvb frontend driver: lnbh29
- new sensor drivers: imx319 and imx 355
- some old soc_camera driver renames to avoid conflict with new
drivers
- new i.MX Pixel Pipeline (PXP) mem-to-mem platform driver
- a new V4L2 frontend for the FWHT codec
- several other improvements, bug fixes, code cleanups, etc
* tag 'media/v4.20-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media: (289 commits)
media: rename soc_camera I2C drivers
media: cec: forgot to cancel delayed work
media: vivid: Support 480p for webcam capture
media: v4l2-tpg: fix kernel oops when enabling HFLIP and OSD
media: vivid: Add 16-bit bayer to format list
media: v4l2-tpg-core: Add 16-bit bayer
media: pvrusb2: replace `printk` with `pr_*`
media: venus: vdec: fix decoded data size
media: cx231xx: fix potential sign-extension overflow on large shift
media: dt-bindings: media: rcar_vin: add device tree support for r8a7744
media: isif: fix a NULL pointer dereference bug
media: exynos4-is: make const array config_ids static
media: cx23885: make const array addr_list static
media: ivtv: make const array addr_list static
media: bttv-input: make const array addr_list static
media: cx18: Don't check for address of video_dev
media: dw9807-vcm: Fix probe error handling
media: dw9714: Remove useless error message
media: dw9714: Fix error handling in probe function
media: cec: name for RC passthrough device does not need 'RC for'
...
printk format used %*s instead of %.*s, so hostname_len does not limit
the number of bytes accessed from hostname.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
alloc_init_deleg() both allocates an nfs4_delegation, and
bumps the refcount on odstate. So after this point, we need to
put_clnt_odstate() and nfs4_put_stid() to not leave the odstate
refcount inappropriately bumped.
Signed-off-by: Andrew Elble <aweits@rit.edu>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Cc: stable@vger.kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Replace "fallthru" with a proper "fall through" annotation.
Also, add an annotation were it is expected to fall through.
These fixes are part of the ongoing efforts to enabling
-Wimplicit-fallthrough
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Use an rbtree to ensure the lookup/insert of an entry in a DRC bucket is
O(log(N)).
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Order the structure so that the key can be compared using memcmp().
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Simplify the duplicate replay cache by initialising the preallocated
cache entry, so that we can use it as a key for the cache lookup.
Note that the 99.999% case we want to optimise for is still the one
where the lookup fails, and we have to add this entry to the cache,
so preinitialising should not cause a performance penalty.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
The preallocated cache entry is always set to type RC_NOCACHE, and that
type isn't changed until we later call nfsd_cache_update().
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
call_rcu() needs to take a first argument of type (struct rcu_head *).
Fixes: fd497f1e40d9 ("NFS: Lockless DNS lookups")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Enable RCU protected lookup in the legacy DNS resolver.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Enable RCU protected lookups of the NFSv4 idmap.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Convert structs svc_expkey and svc_export to allow RCU protected lookups.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAlvW43UACgkQnJ2qBz9k
QNnqtQgA2uzRlz6U7UayQl9JiYKd2XbuojmAE+irdSL5+4OzpqkOsRfLzGSKAfvs
ekv1eVv+4+PS90FUNbvwmX/OzZ9wi3e5d3/qfnJ7l2ZsMfKuc9aW/9I8EXPXkpAB
O7NgoTtOZMXTJXMhseMmha2JfpbQZZ566NzCDfhOfPKqylbjTEM58pcY382VGRDX
Iv1DNwrzPw7PaOOYO3P/vWLeb4GGpMkdG61eoTBMi6SKb/5QMc6MS+WqYzmdLZWE
aP4tK8VhC0L47i0myXzWOHMrjQysq+E24CuQ6zG2O4bFRZj1fT+hiST9SwyUim2+
Ne8P5gnHJiBSZgYtKBoETNI0jORzCA==
=9DMi
-----END PGP SIGNATURE-----
Merge tag 'filesystems_for_v4.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull ext2 and udf updates from Jan Kara:
"Small ext2 cleanups and a couple of udf fixes"
* tag 'filesystems_for_v4.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
ext2: remove redundant building macro check
udf: Drop pack pragma from udf_sb.h
udf: Drop freed bitmap / table support
udf: Fix crash during mount
udf: Prevent write-unsupported filesystem to be remounted read-write
ext2: cache NULL when both default_acl and acl are NULL
udf: remove unused variables group_start and nr_groups
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAlvWyDMACgkQnJ2qBz9k
QNnifgf+PXybPXX3KxtRUmK4u2zX2JMTwzuE0wmLxM6I08tf7rzLrBIbOY7iXka/
nzW6IK+KnA5HtPTEUbxqNBAvWpUAvPLZ/v20d0t/QTMJcz8yfhpvM9O2mjQAGMH8
EBmjjEhZaso8uOIAPhUg9um1QdQoYWa329fsoQuHor9kjKmDg+3RmtdH0jbRzQ6B
RNAY1WNFbm+7MH7Fu3AB/jLqqkwZhoPcu7TwXP6m+va6xAvzEYUOQQB9rPEIaY2Z
+q0B9LhwFIAnWPCI7dxw3CBTndoR2u1vkpnGw5FFhJgnMG4L1QMPoCCYPIZEIXg/
VuGZQ0/mayCtO+JWw+VDJF3jQFrHxA==
=J6tx
-----END PGP SIGNATURE-----
Merge tag 'for_v4.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull fsnotify updates from Jan Kara:
"Amir's patches to implement superblock fanotify watches, Xiaoming's
patch to enable reporting of thread IDs in fanotify events instead of
TGIDs (sadly the patch got mis-attributed to Amir and I've noticed
only now), and a fix of possible oops on umount caused by fsnotify
infrastructure"
* tag 'for_v4.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
fsnotify: Fix busy inodes during unmount
fs: group frequently accessed fields of struct super_block together
fanotify: support reporting thread id instead of process id
fanotify: add BUILD_BUG_ON() to count the bits of fanotify constants
fsnotify: convert runtime BUG_ON() to BUILD_BUG_ON()
fanotify: deprecate uapi FAN_ALL_* constants
fanotify: simplify handling of FAN_ONDIR
fsnotify: generalize handling of extra event flags
fanotify: fix collision of internal and uapi mark flags
fanotify: store fanotify_init() flags in group's fanotify_data
fanotify: add API to attach/detach super block mark
fsnotify: send path type events to group with super block marks
fsnotify: add super block object type
* Finish removing the custom 9p request cache mechanism
* Embed part of the fcall in the request to have better slab
performance (msize usually is power of two aligned)
* syzkaller fixes:
- add a refcount to 9p requests to avoid use after free
- a few double free issues
* A few coverity fixes
* Some old patches that were in the bugzilla:
- do not trust pdu content for size header
- mount option for lock retry interval
----------------------------------------------------------------
Dan Carpenter (1):
9p: potential NULL dereference
Dinu-Razvan Chis-Serban (1):
9p locks: add mount option for lock retry interval
Dominique Martinet (12):
9p/xen: fix check for xenbus_read error in front_probe
v9fs_dir_readdir: fix double-free on p9stat_read error
9p: clear dangling pointers in p9stat_free
9p: embed fcall in req to round down buffer allocs
9p: add a per-client fcall kmem_cache
9p/rdma: do not disconnect on down_interruptible EAGAIN
9p: acl: fix uninitialized iattr access
9p/rdma: remove useless check in cm_event_handler
9p: p9dirent_read: check network-provided name length
9p locks: fix glock.client_id leak in do_lock
9p/trans_fd: abort p9_read_work if req status changed
9p/trans_fd: put worker reqs on destroy
Gertjan Halkes (1):
9p: do not trust pdu content for stat item size
Gustavo A. R. Silva (1):
9p: fix spelling mistake in fall-through annotation
Matthew Wilcox (2):
9p: Use a slab for allocating requests
9p: Remove p9_idpool
Tomas Bortoli (3):
9p: rename p9_free_req() function
9p: Add refcount to p9_req_t
9p: Rename req to rreq in trans_fd
fs/9p/acl.c | 2 +-
fs/9p/v9fs.c | 21 +++++
fs/9p/v9fs.h | 1 +
fs/9p/vfs_dir.c | 19 +---
fs/9p/vfs_file.c | 24 +++++-
include/net/9p/9p.h | 12 +--
include/net/9p/client.h | 71 ++++++---------
net/9p/Makefile | 1 -
net/9p/client.c | 551 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++---------------------------------------------------------
net/9p/mod.c | 9 +-
net/9p/protocol.c | 20 ++++-
net/9p/trans_fd.c | 64 +++++++++-----
net/9p/trans_rdma.c | 37 ++++----
net/9p/trans_virtio.c | 44 +++++++---
net/9p/trans_xen.c | 17 ++--
net/9p/util.c | 140 ------------------------------
16 files changed, 482 insertions(+), 551 deletions(-)
delete mode 100644 net/9p/util.c
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE/IPbcYBuWt0zoYhOq06b7GqY5nAFAlvWYE0ACgkQq06b7GqY
5nBBDxAAkz4rGRsnRpx9D1Bt+81eARTqsWcoPM138zEbwiJehe5aSF8x/CGyirDW
+PoES/efho50sDfaUL9eBMgyy95UMN7LepMTcjNE1tywA+tRsH3dOL4757+lBwZc
FGsHvfse4b5ZzmMxSo2L5KbzpNiUEXCJGEvNlBSbeWcSSUtCfOShMVx9IC6r7JhX
pbsoczQ6W+384V4WgBxBZZf99EryxSeYJ9KMU+9b13ndzKICJqYeaEdItt28uYUC
d05iXDEDQZD3B8R/1Y08ktMqmHLXP9kZdJ/w8HhCMRAxWKUD1pfDwJ45gIwhCb5R
KxdxZ2tPvc8ihGngNJ4VaJoQQFTVzDgbyWXjwAFYkFUuPFgmuAgx3qApOhUFHCdM
8vgJd2u0attNwfOM3VsJApLQu09wmRw1LcmJ9re5OJ/YLiAUF3vw2oTpGhGlKRRI
bMBI9rwXHzeoVI7ank5mh1NMVRpMbzL9A8uqux85gcpXL36V3XmZtKRDe1AAa5jB
JR6dciAUABm9m+Ilt1Pl/faSjmysUOb1wZ0XU/cvVy5EBMUjbHv3x0TuDmrVwDas
puUXEM8soVf9e/NUYqTOozwFIle6KFcyPmae/OopATLgkGES4cjDy9cLh5SgnERM
vMcN/Z+DQO98zM/fPRy2B4qklflgTKI6VI/gSLRCWGpn49Iwhhg=
=g1OF
-----END PGP SIGNATURE-----
Merge tag '9p-for-4.20' of git://github.com/martinetd/linux
Pull 9p updates from Dominique Martinet:
"Highlights this time around are the end of Matthew's work to remove
the custom 9p request cache and use a slab directly for requests, with
some extra patches on my end to not degrade performance, but it's a
very good cleanup.
Tomas and I fixed a few more syzkaller bugs (refcount is the big one),
and I had a go at the coverity bugs and at some of the bugzilla
reports we had open for a while.
I'm a bit disappointed that I couldn't get much reviews for a few of
my own patches, but the big ones got some and it's all been soaking in
linux-next for quite a while so I think it should be OK.
Summary:
- Finish removing the custom 9p request cache mechanism
- Embed part of the fcall in the request to have better slab
performance (msize usually is power of two aligned)
- syzkaller fixes:
* add a refcount to 9p requests to avoid use after free
* a few double free issues
- A few coverity fixes
- Some old patches that were in the bugzilla:
* do not trust pdu content for size header
* mount option for lock retry interval"
* tag '9p-for-4.20' of git://github.com/martinetd/linux: (21 commits)
9p/trans_fd: put worker reqs on destroy
9p/trans_fd: abort p9_read_work if req status changed
9p: potential NULL dereference
9p locks: fix glock.client_id leak in do_lock
9p: p9dirent_read: check network-provided name length
9p/rdma: remove useless check in cm_event_handler
9p: acl: fix uninitialized iattr access
9p locks: add mount option for lock retry interval
9p: do not trust pdu content for stat item size
9p: Rename req to rreq in trans_fd
9p: fix spelling mistake in fall-through annotation
9p/rdma: do not disconnect on down_interruptible EAGAIN
9p: Add refcount to p9_req_t
9p: rename p9_free_req() function
9p: add a per-client fcall kmem_cache
9p: embed fcall in req to round down buffer allocs
9p: Remove p9_idpool
9p: Use a slab for allocating requests
9p: clear dangling pointers in p9stat_free
v9fs_dir_readdir: fix double-free on p9stat_read error
...
Pull XArray conversion from Matthew Wilcox:
"The XArray provides an improved interface to the radix tree data
structure, providing locking as part of the API, specifying GFP flags
at allocation time, eliminating preloading, less re-walking the tree,
more efficient iterations and not exposing RCU-protected pointers to
its users.
This patch set
1. Introduces the XArray implementation
2. Converts the pagecache to use it
3. Converts memremap to use it
The page cache is the most complex and important user of the radix
tree, so converting it was most important. Converting the memremap
code removes the only other user of the multiorder code, which allows
us to remove the radix tree code that supported it.
I have 40+ followup patches to convert many other users of the radix
tree over to the XArray, but I'd like to get this part in first. The
other conversions haven't been in linux-next and aren't suitable for
applying yet, but you can see them in the xarray-conv branch if you're
interested"
* 'xarray' of git://git.infradead.org/users/willy/linux-dax: (90 commits)
radix tree: Remove multiorder support
radix tree test: Convert multiorder tests to XArray
radix tree tests: Convert item_delete_rcu to XArray
radix tree tests: Convert item_kill_tree to XArray
radix tree tests: Move item_insert_order
radix tree test suite: Remove multiorder benchmarking
radix tree test suite: Remove __item_insert
memremap: Convert to XArray
xarray: Add range store functionality
xarray: Move multiorder_check to in-kernel tests
xarray: Move multiorder_shrink to kernel tests
xarray: Move multiorder account test in-kernel
radix tree test suite: Convert iteration test to XArray
radix tree test suite: Convert tag_tagged_items to XArray
radix tree: Remove radix_tree_clear_tags
radix tree: Remove radix_tree_maybe_preload_order
radix tree: Remove split/join code
radix tree: Remove radix_tree_update_node_t
page cache: Finish XArray conversion
dax: Convert page fault handlers to XArray
...
Merge updates from Andrew Morton:
- a few misc things
- ocfs2 updates
- most of MM
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (132 commits)
hugetlbfs: dirty pages as they are added to pagecache
mm: export add_swap_extent()
mm: split SWP_FILE into SWP_ACTIVATED and SWP_FS
tools/testing/selftests/vm/map_fixed_noreplace.c: add test for MAP_FIXED_NOREPLACE
mm: thp: relocate flush_cache_range() in migrate_misplaced_transhuge_page()
mm: thp: fix mmu_notifier in migrate_misplaced_transhuge_page()
mm: thp: fix MADV_DONTNEED vs migrate_misplaced_transhuge_page race condition
mm/kasan/quarantine.c: make quarantine_lock a raw_spinlock_t
mm/gup: cache dev_pagemap while pinning pages
Revert "x86/e820: put !E820_TYPE_RAM regions into memblock.reserved"
mm: return zero_resv_unavail optimization
mm: zero remaining unavailable struct pages
tools/testing/selftests/vm/gup_benchmark.c: add MAP_HUGETLB option
tools/testing/selftests/vm/gup_benchmark.c: add MAP_SHARED option
tools/testing/selftests/vm/gup_benchmark.c: allow user specified file
tools/testing/selftests/vm/gup_benchmark.c: fix 'write' flag usage
mm/gup_benchmark.c: add additional pinning methods
mm/gup_benchmark.c: time put_page()
mm: don't raise MEMCG_OOM event due to failed high-order allocation
mm/page-writeback.c: fix range_cyclic writeback vs writepages deadlock
...
The page cache and most shrinkable slab caches hold data that has been
read from disk, but there are some caches that only cache CPU work, such
as the dentry and inode caches of procfs and sysfs, as well as the subset
of radix tree nodes that track non-resident page cache.
Currently, all these are shrunk at the same rate: using DEFAULT_SEEKS for
the shrinker's seeks setting tells the reclaim algorithm that for every
two page cache pages scanned it should scan one slab object.
This is a bogus setting. A virtual inode that required no IO to create is
not twice as valuable as a page cache page; shadow cache entries with
eviction distances beyond the size of memory aren't either.
In most cases, the behavior in practice is still fine. Such virtual
caches don't tend to grow and assert themselves aggressively, and usually
get picked up before they cause problems. But there are scenarios where
that's not true.
Our database workloads suffer from two of those. For one, their file
workingset is several times bigger than available memory, which has the
kernel aggressively create shadow page cache entries for the non-resident
parts of it. The workingset code does tell the VM that most of these are
expendable, but the VM ends up balancing them 2:1 to cache pages as per
the seeks setting. This is a huge waste of memory.
These workloads also deal with tens of thousands of open files and use
/proc for introspection, which ends up growing the proc_inode_cache to
absurdly large sizes - again at the cost of valuable cache space, which
isn't a reasonable trade-off, given that proc inodes can be re-created
without involving the disk.
This patch implements a "zero-seek" setting for shrinkers that results in
a target ratio of 0:1 between their objects and IO-backed caches. This
allows such virtual caches to grow when memory is available (they do
cache/avoid CPU work after all), but effectively disables them as soon as
IO-backed objects are under pressure.
It then switches the shrinkers for procfs and sysfs metadata, as well as
excess page cache shadow nodes, to the new zero-seek setting.
Link: http://lkml.kernel.org/r/20181009184732.762-5-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: Domas Mituzas <dmituzas@fb.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Rik van Riel <riel@surriel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There are several definitions of those functions/macros in places that
mess with fixed-point load averages. Provide an official version.
[akpm@linux-foundation.org: fix missed conversion in block/blk-iolatency.c]
Link: http://lkml.kernel.org/r/20180828172258.3185-5-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Suren Baghdasaryan <surenb@google.com>
Tested-by: Daniel Drake <drake@endlessm.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <jweiner@fb.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Enderborg <peter.enderborg@sony.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The vmstat NR_KERNEL_MISC_RECLAIMABLE counter is for kernel non-slab
allocations that can be reclaimed via shrinker. In /proc/meminfo, we can
show the sum of all reclaimable kernel allocations (including slab) as
"KReclaimable". Add the same counter also to per-node meminfo under /sys
With this counter, users will have more complete information about kernel
memory usage. Non-slab reclaimable pages (currently just the ION
allocator) will not be missing from /proc/meminfo, making users wonder
where part of their memory went. More precisely, they already appear in
MemAvailable, but without the new counter, it's not obvious why the value
in MemAvailable doesn't fully correspond with the sum of other counters
participating in it.
Link: http://lkml.kernel.org/r/20180731090649.16028-6-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Roman Gushchin <guro@fb.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Sumit Semwal <sumit.semwal@linaro.org>
Cc: Vijayanand Jitta <vjitta@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We can use the newly introduced kmalloc-reclaimable-X caches, to allocate
external names in dcache, which will take care of the proper accounting
automatically, and also improve anti-fragmentation page grouping.
This effectively reverts commit f1782c9bc5 ("dcache: account external
names as indirectly reclaimable memory") and instead passes
__GFP_RECLAIMABLE to kmalloc(). The accounting thus moves from
NR_INDIRECTLY_RECLAIMABLE_BYTES to NR_SLAB_RECLAIMABLE, which is also
considered in MemAvailable calculation and overcommit decisions.
Link: http://lkml.kernel.org/r/20180731090649.16028-4-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Roman Gushchin <guro@fb.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Sumit Semwal <sumit.semwal@linaro.org>
Cc: Vijayanand Jitta <vjitta@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
cramfs is the only remaining user of vm_insert_mixed() and should be
converted to vmf_insert_mixed().
Based on a previous patch from Matthew Wilcox.
Link: http://lkml.kernel.org/r/nycvar.YSQ.7.76.1808290945450.10215@knanqh.ubzr
Signed-off-by: Nicolas Pitre <nico@linaro.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Souptick Joarder <jrdr.linux@gmail.com>a
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Change iomap_page_mkwrite() return type to vm_fault_t.
see commit 1c8f422059 ("mm: change return type to vm_fault_t") for
reference.
Link: http://lkml.kernel.org/r/20180827172050.GA18673@jordon-HP-15-Notebook-PC
Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com>
Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fixes gcc '-Wunused-but-set-variable' warning:
fs/ocfs2/refcounttree.c: In function 'ocfs2_create_reflink_node':
fs/ocfs2/refcounttree.c:4138:31: warning:
variable 'rb' set but not used [-Wunused-but-set-variable]
Link: http://lkml.kernel.org/r/1536198443-113047-1-git-send-email-yuehaibing@huawei.com
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Cc: Changwei Ge <ge.changwei@h3c.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The kernel module may sleep with holding a spinlock.
The function call paths (from bottom to top) in Linux-4.16 are:
[FUNC] get_zeroed_page(GFP_NOFS)
fs/ocfs2/dlm/dlmdebug.c, 332: get_zeroed_page in dlm_print_one_mle
fs/ocfs2/dlm/dlmmaster.c, 240: dlm_print_one_mle in __dlm_put_mle
fs/ocfs2/dlm/dlmmaster.c, 255: __dlm_put_mle in dlm_put_mle
fs/ocfs2/dlm/dlmmaster.c, 254: spin_lock in dlm_put_ml
[FUNC] get_zeroed_page(GFP_NOFS)
fs/ocfs2/dlm/dlmdebug.c, 332: get_zeroed_page in dlm_print_one_mle
fs/ocfs2/dlm/dlmmaster.c, 240: dlm_print_one_mle in __dlm_put_mle
fs/ocfs2/dlm/dlmmaster.c, 222: __dlm_put_mle in dlm_put_mle_inuse
fs/ocfs2/dlm/dlmmaster.c, 219: spin_lock in dlm_put_mle_inuse
To fix this bug, GFP_NOFS is replaced with GFP_ATOMIC.
This bug is found by my static analysis tool DSAC.
Link: http://lkml.kernel.org/r/20180901112528.27025-1-baijiaju1990@gmail.com
Signed-off-by: Jia-Ju Bai <baijiaju1990@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Cc: Changwei Ge <ge.changwei@h3c.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Null check for kfree is unnecessary, so remove it.
Link: http://lkml.kernel.org/r/1535704514-26559-1-git-send-email-dingxiang@cmss.chinamobile.com
Signed-off-by: Ding Xiang <dingxiang@cmss.chinamobile.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Cc: Changwei Ge <ge.changwei@h3c.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pointer 'eb' is being assigned but is never used hence it is
redundant and can be removed.
Cleans up clang warning:
warning: variable 'eb' set but not used [-Wunused-but-set-variable]
Link: http://lkml.kernel.org/r/20180828141907.10826-1-colin.king@canonical.com
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Cc: Changwei Ge <ge.changwei@h3c.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Clang warns when more than one set of parentheses is used for a
single conditional statement:
fs/ocfs2/dlm/dlmthread.c:534:18: warning: equality comparison with extraneous
parentheses [-Wparentheses-equality]
if ((res->owner == dlm->node_num)) {
~~~~~~~~~~~^~~~~~~~~~~~~~~~
fs/ocfs2/dlm/dlmthread.c:534:18: note: remove extraneous parentheses around the
comparison to silence this warning
if ((res->owner == dlm->node_num)) {
~ ^ ~
Link: http://lkml.kernel.org/r/20180924181929.6853-1-natechancellor@gmail.com
Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>
Reported-by: Nick Desaulniers <ndesaulniers@google.com>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
userfaultfd contains howe-grown locking of the waitqueue lock, and does
not disable interrupts. This relies on the fact that no one else takes it
from interrupt context and violates an invariat of the normal waitqueue
locking scheme. With aio poll it is easy to trigger other locks that
disable interrupts (or are called from interrupt context).
Link: http://lkml.kernel.org/r/20181018154101.18750-1-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: <stable@vger.kernel.org> [4.19.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There is no functional change but it seems better to get size by calling
posix_acl_xattr_size() instead of calling posix_acl_to_xattr() with
NULL buffer argument. Additionally, remove unnecessary assignments.
Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
It just makes the interface strange without adding any significant value.
The only case where locked is false and return value is 0 is in
ovl_rename() when new is negative, so handle that case explicitly in
ovl_rename().
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
We use uuid to associate an overlay lower file handle with a lower layer,
so we can accept lower fs with null uuid as long as all lower layers with
null uuid are on the same fs.
This change allows enabling index and nfs_export features for the setup of
single lower fs of type squashfs - squashfs supports file handles, but has
a null uuid. This change also allows enabling index and nfs_export features
for nested overlayfs, where the lower overlay has nfs_export enabled.
Enabling the index feature with single lower squashfs fixes the
unionmount-testsuite test:
./run --ov --squashfs --verify
As a by-product, if, like the lower squashfs, upper fs also uses the
generic export_encode_fh() implementation to export 32bit inode file
handles (e.g. ext4), then the xino_auto config/module/mount option will
enable unique overlay inode numbers.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Now that the workdir and tmpfile copy up modes have been untagled, the
functions become simple enough that the helpers can be folded into the
callers.
Add new helpers where there is any duplication remaining: preparing creds
for creating the object.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
In an attempt to dedup ~100 LOC, we ended up creating a tangled call chain,
whose branches merge and diverge in several points according to the
immutable c->tmpfile copy up mode.
This call chain was hard to analyse for locking correctness because the
locking requirements for the c->tmpfile flow were very different from the
locking requirements for the !c->tmpfile flow (i.e. directory vs. regulare
file copy up).
Split the copy up helpers of the c->tmpfile flow from those of the
!c->tmpfile (i.e. workdir) flow and remove the c->tmpfile mode from copy up
context.
Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Make permission checking more consistent:
- special files don't need any access check on underling fs
- exec permission check doesn't need to be performed on underlying fs
Reported-by: "J. Bruce Fields" <bfields@fieldses.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
linking a non-copied-up file into a non-copied-up parent results in a
nested call to mutex_lock_interruptible(&oi->lock). Fix this by copying up
target parent before ovl_nlink_start(), same as done in ovl_rename().
~/unionmount-testsuite$ ./run --ov -s
~/unionmount-testsuite$ ln /mnt/a/foo100 /mnt/a/dir100/
WARNING: possible recursive locking detected
--------------------------------------------
ln/1545 is trying to acquire lock:
00000000bcce7c4c (&ovl_i_lock_key[depth]){+.+.}, at:
ovl_copy_up_start+0x28/0x7d
but task is already holding lock:
0000000026d73d5b (&ovl_i_lock_key[depth]){+.+.}, at:
ovl_nlink_start+0x3c/0xc1
[SzM: this seems to be a false positive, but doing the copy-up first is
harmless and removes the lockdep splat]
Reported-by: syzbot+3ef5c0d1a5cb0b21e6be@syzkaller.appspotmail.com
Fixes: 5f8415d6b8 ("ovl: persistent overlay inode nlink for...")
Cc: <stable@vger.kernel.org> # v4.13
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Some anon_bdev filesystems (e.g. overlayfs, ceph) don't have s_blocksize
set. Returning zero from FIGETBSZ ioctl results in a Floating point
exception from the e2fsprogs utility filefrag, which divides the size of
the file with the value returned by FIGETBSZ.
Fix the interface by returning -EINVAL for these filesystems.
Fixes: d1d04ef857 ("ovl: stack file ops")
Cc: <stable@vger.kernel.org> # v4.19
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
If security_inode_copy_up() fails, it should not set new_creds, so no need
for the cleanup (which would've Oops-ed anyway, due to old_creds being
NULL).
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
We hit a BUG on kfree of an ERR_PTR()...
Reported-by: syzbot+ff03fe05c717b82502d0@syzkaller.appspotmail.com
Fixes: 8b88a2e640 ("ovl: verify upper root dir matches lower root dir")
Cc: <stable@vger.kernel.org> # v4.13
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Highlights include:
Stable fixes:
- Fix the NFSv4.1 r/wsize sanity checking
- Reset the RPC/RDMA credit grant properly after a disconnect
- Fix a missed page unlock after pg_doio()
Features and optimisations:
- Overhaul of the RPC client socket code to eliminate a locking bottleneck
and reduce the latency when transmitting lots of requests in parallel.
- Allow parallelisation of the RPCSEC_GSS encoding of an RPC request.
- Convert the RPC client socket receive code to use iovec_iter() for
improved efficiency.
- Convert several NFS and RPC lookup operations to use RCU instead of
taking global locks.
- Avoid the need for BH-safe locks in the RPC/RDMA back channel.
Bugfixes and cleanups:
- Fix lock recovery during NFSv4 delegation recalls
- Fix the NFSv4 + NFSv4.1 "lookup revalidate + open file" case.
- Fixes for the RPC connection metrics
- Various RPC client layer cleanups to consolidate stream based sockets
- RPC/RDMA connection cleanups
- Simplify the RPC/RDMA cleanup after memory operation failures
- Clean ups for NFS v4.2 copy completion and NFSv4 open state reclaim.
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJb0zW8AAoJEA4mA3inWBJcmccP/0hkeNFk2y4tErit1lq4TYDs
sMkFv0rjhBkxWbZFmGJfAulbQ5cu+GwTBqqmhm67rE+2C+vevrE4JRfDFmcEGpio
lE/2uJdqu1UlIOiovyjk0jMetUuf2LTS82vloPP/z5mmvgQ4S1NSajUGuPbjQR2S
AtTj0XGI5e1nm8PZDftbomcxD5HUYaITQEDCyrm8a7xX8OZ5ySXakzdgXuNM5TgI
MPjcpOFvIARwF4MhovYFZtSInB5XiZYSiTAB03deVgy38JDsSPeQgwUVWjErrq/K
V/6kOg8EYd0uNFmUCwKX/ecbvAlnbfqAMX+YcL0ZrbVk0pBqxVvoGVXK8ex8Wbm1
eL9tyYK81Sc7TliXr2+R22CHDcMTTMImFLix5Gp6mk2Fd5TpMydV9c9S7NBCHYB4
rgcM9brgutFF6N8zqdBpa1FVH3cBE1A428/90kp4XU/kdQlxIvYBLBCylI25POEL
7oqhcJxljFLWXZdhmH7t3WV0RWOzITZHEp9foL8p6yAPzOSWPF98OlQU+FmLj3Y4
EZ61qLXIRxYpLf1aZh7GNKms5ZzOhKiZgw43UL3pl4xKhk2i9061IUKGSEHgIklk
BX34dmCALDlapt+Ggcm1uIe9BLCc4KADfixqNfr91dSOycFM2RajsSZCPrP9Gx8G
t8rYl8x+lLZ5ZxLkdTUP
=Fn8z
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-4.20-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client updates from Trond Myklebust:
"Highlights include:
Stable fixes:
- Fix the NFSv4.1 r/wsize sanity checking
- Reset the RPC/RDMA credit grant properly after a disconnect
- Fix a missed page unlock after pg_doio()
Features and optimisations:
- Overhaul of the RPC client socket code to eliminate a locking
bottleneck and reduce the latency when transmitting lots of
requests in parallel.
- Allow parallelisation of the RPCSEC_GSS encoding of an RPC request.
- Convert the RPC client socket receive code to use iovec_iter() for
improved efficiency.
- Convert several NFS and RPC lookup operations to use RCU instead of
taking global locks.
- Avoid the need for BH-safe locks in the RPC/RDMA back channel.
Bugfixes and cleanups:
- Fix lock recovery during NFSv4 delegation recalls
- Fix the NFSv4 + NFSv4.1 "lookup revalidate + open file" case.
- Fixes for the RPC connection metrics
- Various RPC client layer cleanups to consolidate stream based
sockets
- RPC/RDMA connection cleanups
- Simplify the RPC/RDMA cleanup after memory operation failures
- Clean ups for NFS v4.2 copy completion and NFSv4 open state
reclaim"
* tag 'nfs-for-4.20-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (97 commits)
SUNRPC: Convert the auth cred cache to use refcount_t
SUNRPC: Convert auth creds to use refcount_t
SUNRPC: Simplify lookup code
SUNRPC: Clean up the AUTH cache code
NFS: change sign of nfs_fh length
sunrpc: safely reallow resvport min/max inversion
nfs: remove redundant call to nfs_context_set_write_error()
nfs: Fix a missed page unlock after pg_doio()
SUNRPC: Fix a compile warning for cmpxchg64()
NFSv4.x: fix lock recovery during delegation recall
SUNRPC: use cmpxchg64() in gss_seq_send64_fetch_and_inc()
xprtrdma: Squelch a sparse warning
xprtrdma: Clean up xprt_rdma_disconnect_inject
xprtrdma: Add documenting comments
xprtrdma: Report when there were zero posted Receives
xprtrdma: Move rb_flags initialization
xprtrdma: Don't disable BH's in backchannel server
xprtrdma: Remove memory address of "ep" from an error message
xprtrdma: Rename rpcrdma_qp_async_error_upcall
xprtrdma: Simplify RPC wake-ups on connect
...
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAlvR8lcACgkQiiy9cAdy
T1FNtgv/fpnMnf/4JPE40NgJ6CUcv4xsJ3bDzmezB5ZUgoNigtVeUMSBa8qCEBcg
cdC243TOpwNGaWQ1yzRN4kyGq1cAE9B1xal4n7+xlii+ZpWXwrkOAiF27UTAIGTR
ck3IfeS529QoQt9ReI4v+pWYKZOnlbWgF7iBflg0Snsz/JvICQ05wRA9VaXBJIz8
Pwb3SDPCrON1KRJzJVDjC6AaYhZqu2VLbSV9fOhZ5WVcHb/t0EUqsFvgMzhk2+tv
Rh+9zNzcQWyYI8KtYQmWMMoSk7F8OGlARWXW0ROfOoQwC70zg35F+tGUahlWsIYD
19TLJy28g5Gqh0DZoPmtpNUdu1NCfy+vQcqaSNnAaQreMlqmV6ODxjvz3DeGL9lK
Teo0V9dwWOZNFneFTpVsrWL4KQEZfDPDt1L6e3GOL5t6QLOZa5IuPVs8A9txqFCD
kTAIQstESmXOrl+HpP64LcovV4BaD05st+fo7Cec16UDJjEqxCmHUSIYw3kFnCny
4UAITp4V
=q4Qs
-----END PGP SIGNATURE-----
Merge tag '4.20-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs updates from Steve French:
"Three smb3 fixes for stable, patches for improved debugging and perf
gathering, and much improved performance for most metadata operations
(expanded use of compounding)"
* tag '4.20-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6: (46 commits)
cifs: update internal module version number for cifs.ko to 2.14
smb3: add debug for unexpected mid cancellation
cifs: allow calling SMB2_xxx_free(NULL)
smb3 - clean up debug output displaying network interfaces
smb3: show number of current open files in /proc/fs/cifs/Stats
cifs: add support for ioctl on directories
cifs: fallback to older infolevels on findfirst queryinfo retry
smb3: do not attempt cifs operation in smb3 query info error path
smb3: send backup intent on compounded query info
cifs: track writepages in vfs operation counters
smb2: fix uninitialized variable bug in smb2_ioctl_query_info
cifs: add IOCTL for QUERY_INFO passthrough to userspace
cifs: minor clarification in comments
CIFS: Print message when attempting a mount
CIFS: Adds information-level logging function
cifs: OFD locks do not conflict with eachothers
CIFS: SMBD: Do not call ib_dereg_mr on invalidated memory registration
CIFS: pass page offsets on SMB1 read/write
fs/cifs: fix uninitialised variable warnings
smb3: add tracepoint for sending lease break responses to server
...
Trivial fix to a spelling mistake of the error access name EACCESS,
rename to EACCES
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
Driver core patches for 4.20-rc1
Here is a small number of driver core patches for 4.20-rc1.
Not much happened here this merge window, only a very tiny number of
patches that do:
- add BUS_ATTR_WO() for use by drivers
- component error path fixes
- kernfs range check fix
- other tiny error path fixes and const changes
All of these have been in linux-next with no reported issues for a
while.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCW9Lhtw8cZ3JlZ0Brcm9h
aC5jb20ACgkQMUfUDdst+ykHTgCguaJ3SgRefuC/WijjqboTC/SikCoAnRVTUxfU
v8BisSN22kR3jmxwsXud
=/IvY
-----END PGP SIGNATURE-----
Merge tag 'driver-core-4.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core updates from Greg KH:
"Here is a small number of driver core patches for 4.20-rc1.
Not much happened here this merge window, only a very tiny number of
patches that do:
- add BUS_ATTR_WO() for use by drivers
- component error path fixes
- kernfs range check fix
- other tiny error path fixes and const changes
All of these have been in linux-next with no reported issues for a
while"
* tag 'driver-core-4.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
devres: provide devm_kstrdup_const()
mm: move is_kernel_rodata() to asm-generic/sections.h
devres: constify p in devm_kfree()
driver core: add BUS_ATTR_WO() macro
kernfs: Fix range checks in kernfs_get_target_path
component: fix loop condition to call unbind() if bind() fails
drivers/base/devtmpfs.c: don't pretend path is const in delete_path
kernfs: update comment about kernfs_path() return value
Pull integrity updates from James Morris:
"From Mimi: This contains a couple of bug fixes, including one for a
recent problem with calculating file hashes on overlayfs, and some
code cleanup"
* 'next-integrity' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security:
MAINTAINERS: add Jarkko as maintainer for trusted keys
ima: open a new file instance if no read permissions
ima: fix showing large 'violations' or 'runtime_measurements_count'
security/integrity: remove unnecessary 'init_keyring' variable
security/integrity: constify some read-only data
vfs: require i_size <= SIZE_MAX in kernel_read_file()
Pull more ->lookup() cleanups from Al Viro:
"Some ->lookup() instances are still overcomplicating the life
for themselves, open-coding the stuff that would be handled by
d_splice_alias() just fine.
Simplify a couple of such cases caught this cycle and document
d_splice_alias() intended use"
* 'work.lookup' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
Document d_splice_alias() calling conventions for ->lookup() users.
simplify btrfs_lookup()
clean erofs_lookup()
Pull compat_ioctl fixes from Al Viro:
"A bunch of compat_ioctl fixes, mostly in bluetooth.
Hopefully, most of fs/compat_ioctl.c will get killed off over the next
few cycles; between this, tty series already merged and Arnd's work
this cycle ought to take a good chunk out of the damn thing..."
* 'work.compat' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
hidp: fix compat_ioctl
hidp: constify hidp_connection_add()
cmtp: fix compat_ioctl
bnep: fix compat_ioctl
compat_ioctl: trim the pointless includes
Pull timekeeping updates from Thomas Gleixner:
"The timers and timekeeping departement provides:
- Another large y2038 update with further preparations for providing
the y2038 safe timespecs closer to the syscalls.
- An overhaul of the SHCMT clocksource driver
- SPDX license identifier updates
- Small cleanups and fixes all over the place"
* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (31 commits)
tick/sched : Remove redundant cpu_online() check
clocksource/drivers/dw_apb: Add reset control
clocksource: Remove obsolete CLOCKSOURCE_OF_DECLARE
clocksource/drivers: Unify the names to timer-* format
clocksource/drivers/sh_cmt: Add R-Car gen3 support
dt-bindings: timer: renesas: cmt: document R-Car gen3 support
clocksource/drivers/sh_cmt: Properly line-wrap sh_cmt_of_table[] initializer
clocksource/drivers/sh_cmt: Fix clocksource width for 32-bit machines
clocksource/drivers/sh_cmt: Fixup for 64-bit machines
clocksource/drivers/sh_tmu: Convert to SPDX identifiers
clocksource/drivers/sh_mtu2: Convert to SPDX identifiers
clocksource/drivers/sh_cmt: Convert to SPDX identifiers
clocksource/drivers/renesas-ostm: Convert to SPDX identifiers
clocksource: Convert to using %pOFn instead of device_node.name
tick/broadcast: Remove redundant check
RISC-V: Request newstat syscalls
y2038: signal: Change rt_sigtimedwait to use __kernel_timespec
y2038: socket: Change recvmmsg to use __kernel_timespec
y2038: sched: Change sched_rr_get_interval to use __kernel_timespec
y2038: utimes: Rework #ifdef guards for compat syscalls
...
Detaching of mark connector from fsnotify_put_mark() can race with
unmounting of the filesystem like:
CPU1 CPU2
fsnotify_put_mark()
spin_lock(&conn->lock);
...
inode = fsnotify_detach_connector_from_object(conn)
spin_unlock(&conn->lock);
generic_shutdown_super()
fsnotify_unmount_inodes()
sees connector detached for inode
-> nothing to do
evict_inode()
barfs on pending inode reference
iput(inode);
Resulting in "Busy inodes after unmount" message and possible kernel
oops. Make fsnotify_unmount_inodes() properly wait for outstanding inode
references from detached connectors.
Note that the accounting of outstanding inode references in the
superblock can cause some cacheline contention on the counter. OTOH it
happens only during deletion of the last notification mark from an inode
(or during unlinking of watched inode) and that is not too bad. I have
measured time to create & delete inotify watch 100000 times from 64
processes in parallel (each process having its own inotify group and its
own file on a shared superblock) on a 64 CPU machine. Average and
standard deviation of 15 runs look like:
Avg Stddev
Vanilla 9.817400 0.276165
Fixed 9.710467 0.228294
So there's no statistically significant difference.
Fixes: 6b3f05d24d ("fsnotify: Detach mark from object list when last reference is dropped")
CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
allocation for bigalloc file systems; fix up some syzbot-detected
races in EXT4_IOC_MOVE_EXT, EXT4_IOC_SWAP_BOOT, and ext4_remount; and
a few other miscellaneous bugs and optimizations.
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAlvQYEcACgkQ8vlZVpUN
gaOPYAgAh0BF7mTRnHAp/qkR5ZhDi3ecb3TpNlnpfzoDqQhPYETFisc18DD4HwTj
wctwzSdYxYodeuPIK+R2bBzUy3FuSwtlER9cdr1ilcrUYPZHbir1rPPfTNb/oDGx
WNcd/aulLjuU1eKDODowqMOF2HDchiJHqJqMBa+LfCHck1x/bt2uqdjNA5A1p5AV
lp07DoXT54q5rWJDaXpbxTShWKhzHlRKbB9PKEvMHgPNl9sn5oRReRMKAW+WkT91
e3mfy/GhzhugdWxYUg2oAn3dbqYkkAjW96WnBhCQHioW9ASphjl7yBi1LWh2aPA4
haGxe5W3En8q678ZVtTVNJOyvbW81Q==
=VgdS
-----END PGP SIGNATURE-----
Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 updates from Ted Ts'o:
- further restructure ext4 documentation
- fix up ext4's delayed allocation for bigalloc file systems
- fix up some syzbot-detected races in EXT4_IOC_MOVE_EXT,
EXT4_IOC_SWAP_BOOT, and ext4_remount
- ... and a few other miscellaneous bugs and optimizations.
* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (21 commits)
ext4: fix use-after-free race in ext4_remount()'s error path
ext4: cache NULL when both default_acl and acl are NULL
docs: promote the ext4 data structures book to top level
docs: move ext4 administrative docs to admin-guide/
jbd2: fix use after free in jbd2_log_do_checkpoint()
ext4: propagate error from dquot_initialize() in EXT4_IOC_FSSETXATTR
ext4: fix setattr project check in fssetxattr ioctl
docs: make ext4 readme tables readable
docs: fix ext4 documentation table formatting problems
docs: generate a separate ext4 pdf file from the documentation
ext4: convert fault handler to use vm_fault_t type
ext4: initialize retries variable in ext4_da_write_inline_data_begin()
ext4: fix EXT4_IOC_SWAP_BOOT
ext4: fix build error when DX_DEBUG is defined
ext4: fix argument checking in EXT4_IOC_MOVE_EXT
ext4: fix reserved cluster accounting at page invalidation time
ext4: adjust reserved cluster count when removing extents
ext4: reduce reserved cluster count by number of allocated clusters
ext4: fix reserved cluster accounting at delayed write time
ext4: add new pending reservation mechanism
...
In this round, we've added 1) superblock checksum feature, 2) implemented new
mount option which we can disable/enable checkpoint to provide atomic updates of
entire filesystem, 3) refactored quota operations to enhance its consistency
along with checkpoint, 4) fixed subtle IO hang conditions and roll-forward
recovery flow to resurrect any fsync'ed inode metadata.
Enhancement:
- add checksum to keep superblock contents more safe
- add checkpoint=disable/enable to support A/B update of entire filesystem
- use plug for readahead IO in readdir
- add more IO counts to avoid block layer hacks
Bug fix:
- prevent data corruption issue for hardware encryption
- fix IO hang issues when GC is heavily triggered
- add missing up_read in __write_node_page
- recover inode metadata during roll-forward recovery flow
- fix null pointer dereference issue in wrongly configured discard map
There are some more sanity checks and minor bug fixes as well.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE00UqedjCtOrGVvQiQBSofoJIUNIFAlvPzt4ACgkQQBSofoJI
UNJLfg/8Ch9TWfbeUEH+6ioJ4pdURHmb/gFOQRSGX8Nu+HkuPJQD8pqlheK7n5g7
Pw3K6NLXHmL2d8xSNFbBwYmUSAeoXTc0J4nqX0sUJ6m7SKsuQ45Qe3A90faKEoAA
ce7flWKVI+aJcGurBe99GOM69ptfyjb1w/8UGB0pcXUDq4oaRv5a1UtAAm92WF7H
4/7jYD3ub5aeSynwe16wWR4B5aXJT0l0FcZYicI6IRY1mOjMtXuQt72AY+ffSzBt
yQ6qb8OEl3xpfQZHHH00ZfvarkTBzXJGZwquiPX/CPzVcee8cOqPp+XZqN7CXBEr
9ItezxYiUxOkKCl12Al8DynHZa6o2kEnWxgd49WkL/cNdInnvf5MD0kdfCV3KfQa
CAR0UVe2yTg5mGLemtTSWveLdHfI7+LhDmURuXmoTUa9GWldw0413qqVVypcsizv
QOAS86hSicrVK+bDnCA70i8Xxw7YEnAyrfCcgihU84NZSi7nTPUYj4xtMd9SzRnK
JO8gA79D7lcWaxUS4r9I+JBDwWcfMQZRPS7PFbvoGWilIwsEaocCPYNgtjCTsAsK
1fePqiF/265Q4lapmEhEjhuQSNH2xfJQZ4ux1OU+eS3OTDjbEAFBeVPZImh7Mo7F
dkpXQwfcqAXPzOM4QAJ6hFX40D8SWMAlId6XGiIfJlrFmEUAxBk=
=VDw4
-----END PGP SIGNATURE-----
Merge tag 'f2fs-for-4.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs
Pull f2fs updates from Jaegeuk Kim:
"In this round, we've added 1) superblock checksum feature, 2)
implemented new mount option which we can disable/enable checkpoint to
provide atomic updates of entire filesystem, 3) refactored quota
operations to enhance its consistency along with checkpoint, 4) fixed
subtle IO hang conditions and roll-forward recovery flow to resurrect
any fsync'ed inode metadata.
Enhancements:
- add checksum to keep superblock contents more safe
- add checkpoint=disable/enable to support A/B update of entire filesystem
- use plug for readahead IO in readdir
- add more IO counts to avoid block layer hacks
Bug fixes:
- prevent data corruption issue for hardware encryption
- fix IO hang issues when GC is heavily triggered
- add missing up_read in __write_node_page
- recover inode metadata during roll-forward recovery flow
- fix null pointer dereference issue in wrongly configured discard map
There are some more sanity checks and minor bug fixes as well"
* tag 'f2fs-for-4.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (62 commits)
f2fs: fix to keep project quota consistent
f2fs: guarantee journalled quota data by checkpoint
f2fs: cleanup dirty pages if recover failed
f2fs: fix data corruption issue with hardware encryption
f2fs: fix to recover inode->i_flags of inode block during POR
f2fs: spread f2fs_set_inode_flags()
f2fs: fix to spread clear_cold_data()
Revert "f2fs: fix to clear PG_checked flag in set_page_dirty()"
f2fs: account read IOs and use IO counts for is_idle
f2fs: fix to account IO correctly for cgroup writeback
f2fs: fix to account IO correctly
f2fs: remove request_list check in is_idle()
f2fs: allow to mount, if quota is failed
f2fs: update REQ_TIME in f2fs_cross_rename()
f2fs: do not update REQ_TIME in case of error conditions
f2fs: remove unneeded disable_nat_bits()
f2fs: remove unused sbi->trigger_ssr_threshold
f2fs: shrink sbi->sb_lock coverage in set_file_temperature()
f2fs: use rb_*_cached friends
f2fs: fix to recover cold bit of inode block during POR
...
- only support filesystems with unwritten extents
- add definition for statfs XFS magic number
- remove unused parameters around reflink code
- more debug for dangling delalloc extents
- cancel COW extents on extent swap targets
- fix quota stats output and clean up the code
- refactor some of the attribute code in preparation for parent pointers
- fix several buffer handling bugs
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJbz7CiAAoJEK3oKUf0dfodz3YP/jRBYAMall0h5sJZt0FHdIb1
y8bvDz3YaN3T9cFsSKC+2eRiDnxUwvX23Fm3rd3XhMZb0JkLwxX2Y4Tm+BMLeOtx
JcoFvO6xVNFugYgLHvxNh0jGfLAPRYzd1KOkoGBiDWl3GYijrXCU9wdXukHmYq9k
/JIfXjVabgmlLhOo1SnaXtqBOo140FjYAlL4h1sGBYxzeoRk/ltsXBnEzjwOJajY
EYKDjBMoGtxSh38EoXPHHP0Pf89miTE+B3Y7wkR+sURG/cptt6WN1MFCOmEOAPLH
RYpaZrFYLLTS4xvaDo0UMLuMTv72tabqEtjQ1Rj4bSPZaFb7QZVscpNkoFjuK2V7
iVJUE3bqlCAGoHdnK22a6Mq1c7inOEG/GGDv+V+xfdOZFlWtaUNyjSmDtpmwxR5D
0W8eSYaEYfvhQJ7I9066SWof8EtUdc8cc4P+hshego1DWzDF1TSpBEwwy2V+WUsC
l4NJyLwjUPjfuD/MSUly9N7bIEzgLM5sh+aSBGNY87ODnWbRTzoBbNjl2NOCWv58
2zT57WLlT7mBzQPE6yNpUGwrpubNEC5z+LzcQRfBedzx/Wh8XBnV/8Z6ETft3sO+
v63i5e11ejZBUSR1TGf8dIzQonBroJo7Zwk9ghxNs+KIXmG/8vQHN9EuC9X4IUm7
RD5X0oxgxuFBzrD3G9kD
=twNy
-----END PGP SIGNATURE-----
Merge tag 'xfs-4.20-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pul xfs updates from Dave Chinner:
"There's not a huge amount of change in this cycle - Darrick has been
out of action for a couple of months (hence me sending the last few
pull requests), so we decided a quiet cycle mainly focussed on bug
fixes was a good idea. Darrick will take the helm again at the end of
this merge window.
FYI, I may be sending another update later in the cycle - there's a
pending rework of the clone/dedupe_file_range code that fixes numerous
bugs that is spread amongst the VFS, XFS and ocfs2 code. It has been
reviewed and tested, Al and I just need to work out the details of the
merge, so it may come from him rather than me.
Summary:
- only support filesystems with unwritten extents
- add definition for statfs XFS magic number
- remove unused parameters around reflink code
- more debug for dangling delalloc extents
- cancel COW extents on extent swap targets
- fix quota stats output and clean up the code
- refactor some of the attribute code in preparation for parent
pointers
- fix several buffer handling bugs"
* tag 'xfs-4.20-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (21 commits)
xfs: cancel COW blocks before swapext
xfs: clear ail delwri queued bufs on unmount of shutdown fs
xfs: use offsetof() in place of offset macros for __xfsstats
xfs: Fix xqmstats offsets in /proc/fs/xfs/xqmstat
xfs: fix use-after-free race in xfs_buf_rele
xfs: Add attibute remove and helper functions
xfs: Add attibute set and helper functions
xfs: Add helper function xfs_attr_try_sf_addname
xfs: Move fs/xfs/xfs_attr.h to fs/xfs/libxfs/xfs_attr.h
xfs: issue log message on user force shutdown
xfs: fix buffer state management in xrep_findroot_block
xfs: always assign buffer verifiers when one is provided
xfs: xrep_findroot_block should reject root blocks with siblings
xfs: add a define for statfs magic to uapi
xfs: print dangling delalloc extents
xfs: fix fork selection in xfs_find_trim_cow_extent
xfs: remove the unused trimmed argument from xfs_reflink_trim_around_shared
xfs: remove the unused shared argument to xfs_reflink_reserve_cow
xfs: handle zeroing in xfs_file_iomap_begin_delay
xfs: remove suport for filesystems without unwritten extent flag
...
1. Andreas Gruenbacher contributed several patches to clean up the gfs2
block allocator to prepare for future performance enhancements.
2. Andy Price contributed a patch to fix a use-after-free problem.
3. I contributed some patches that fix gfs2's broken rgrplvb mount option.
4. I contributed some cleanup patches and error message improvements.
5. Steve Whitehouse and Abhi Das sent a patch to enable getlabel support.
6. Tim Smith contributed a patch to flush the glock delete workqueue at exit.
-----BEGIN PGP SIGNATURE-----
iQEcBAABAgAGBQJbz3QlAAoJENeLYdPf93o7+VUH/0XUfbyNQsU6fyLP8NrZq05z
qNsVN3Hm+tPc0/V0C75lSp9ej7B8ogMl0RPysniziRTEWIDK6oGB/JUGvIH0O/z1
vZ/sofZEDXthV3YjiI8RXLcaJLsOavSXnGwHbNKohM2PdRObVkZbaUL+xWlL9X3q
yHgP5AHCIrpVzz5l4sLO6N0Npnl0aNRTBxPIyDTaBBmitXkvtqkCbkw185jzkDDs
fMvZ6I+3UbUxp99InFTHeUXvTr1EbvfPrhZzmppuV1N4LLSa1eRaWmsKTEDPdnsy
uhsh9ittv8EJXN2dpmZIOdGmDEK07kFoZrsbM5F78sOH/LbUyJ5YfBN02lDaaAI=
=b7zk
-----END PGP SIGNATURE-----
Merge tag 'gfs2-4.20.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2
Pull gfs2 updates from Bob Peterson:
"We've got 18 patches for this merge window, none of which are very
major:
- clean up the gfs2 block allocator to prepare for future performance
enhancements (Andreas Gruenbacher)
- fix a use-after-free problem (Andy Price)
- patches that fix gfs2's broken rgrplvb mount option (me)
- cleanup patches and error message improvements (me)
- enable getlabel support (Steve Whitehouse and Abhi Das)
- flush the glock delete workqueue at exit (Tim Smith)"
* tag 'gfs2-4.20.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
gfs2: Fix minor typo: couln't versus couldn't.
gfs2: write revokes should traverse sd_ail1_list in reverse
gfs2: Pass resource group to rgblk_free
gfs2: Remove unnecessary gfs2_rlist_alloc parameter
gfs2: Fix marking bitmaps non-full
gfs2: Fix some minor typos
gfs2: Rename bitmap.bi_{len => bytes}
gfs2: Remove unused RGRP_RSRV_MINBYTES definition
gfs2: Move rs_{sizehint, rgd_gh} fields into the inode
gfs2: Clean up out-of-bounds check in gfs2_rbm_from_block
gfs2: Always check the result of gfs2_rbm_from_block
gfs2: getlabel support
GFS2: Flush the GFS2 delete workqueue before stopping the kernel threads
gfs2: Don't leave s_fs_info pointing to freed memory in init_sbd
gfs2: Use fs_* functions instead of pr_* function where we can
gfs2: slow the deluge of io error messages
gfs2: Don't set GFS2_RDF_UPTODATE when the lvb is updated
gfs2: improve debug information when lvb mismatches are found
fixes:
+ fix superfluous service_operation return code check in orangefs_lookup
+ fix some error code paths that missed kmem_cache_free
+ don't let orangefs_iget return NULL
+ don't let orangefs_new_inode return NULL
+ cache NULL when both default_acl and acl are NULL
cleanup:
+ rate limit the client not running info message
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJbzzyfAAoJEM9EDqnrzg2+gfQP/1HC41/2Xg+8NFzxASN4Cd6f
6MJGGG3yfvq2PO+APQFG4SQNtFQO+8CDB3rIehCBQxKhHcjanr5ejzpoALaFcgEu
VC1Pw4cZcKixxJyRf9LZChI2uzfFTVn3pna5y1ABUZA7r+LTvg/oMXkJH9BS03Xk
r0onRKd0/nsde4dhnNlFizQuSmddzvObeZdAgqL4QWzh2J1Zs7ehtqcGu8JTDK2G
nVm+tXyqllQmigk/blhE6lydxrQQt0w95+DlUf6x+PmH5pp7MKqi5kzelbo1ND/9
BfSK4AvCdJhRjQNValop9Pafu55sj5RZEoG/GOSCvU3bdghoQi1mtuSVCEUblvR6
EOJWd31y1Shk+XtVOFUNwo1jk/FhZOkGZNBY4xYjMmTUA3np4rpB2HMEkt4Sy/EO
cOnS4CoB4Wc/36ZAGAfthNhMH66igMBdA7acDx91DeCFzBPn7SmstVDDVj6rdcGr
MfyzvQaYooqHdWF3PzK97EsQW7ZXk8YPpUCbrF6+cxYfZN4DT4qJzKImXRHMLTiV
qbUVtgYyzqa6caBHpVphq3evA0//vk4qK+fuIFZs/+cFZPBNtJNhye9q87DftcMx
b8SiSNjBL4Q+/DGD6kvLwuX53PfxVAgllbk0Ncql2aSvCTvOirfpYbvM55RDapuJ
dOuvR1rhzsP93t+dwRP4
=iVPX
-----END PGP SIGNATURE-----
Merge tag 'for-linus-4.20-ofs1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux
Pull orangefs updates from Mike Marshall:
"Fixes and a cleanup.
Fixes:
- fix superfluous service_operation return code check in
orangefs_lookup
- fix some error code paths that missed kmem_cache_free
- don't let orangefs_iget return NULL
- don't let orangefs_new_inode return NULL
- cache NULL when both default_acl and acl are NULL
Cleanup:
- rate limit the client not running info message"
* tag 'for-linus-4.20-ofs1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux:
orangefs: no need to check for service_operation returns > 0
orangefs: some error code paths missed kmem_cache_free
orangefs: don't let orangefs_iget return NULL.
orangefs: don't let orangefs_new_inode return NULL
orangefs: rate limit the client not running info message
orangefs: cache NULL when both default_acl and acl are NULL
Pull vfs fixes from Al Viro.
* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
gfs2_meta: ->mount() can get NULL dev_name
ecryptfs_rename(): verify that lower dentries are still OK after lock_rename()
cachefiles: fix the race between cachefiles_bury_object() and rmdir(2)
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEIodevzQLVs53l6BhNqiEXrVAjGQFAlvPJVQACgkQNqiEXrVA
jGQgUQ//QEMIcY3LZoHSUGfcI2Il4oBwLBnE1604FFNuhbBPiceVXcLIgPMeAXDt
fMFaujgM/KEa+uj1tWdCD7kGiqlJo6SWzHVaF1+jwD+p2TZda0QxqAEa98lC07wb
QrS+VGF0EFxYWmuJryB25KhhIDQjy0ePK78rXieNGcb9MaKsXxPeT2zgRFMv550v
+XnbPzf2vpuNc5xh0l8ceNRETfa/SJDupWVqt7s7P3UrARiFPsVNSBqiRnHRLhS6
HKQcuhVZHgphS5b3f8F75Opie2r/TZxFs7AaPHc3/W95bDVSHjXzojzn8/NVdN0e
aLDEhhlqgx0fAC46TanxqQmnTYyD67LdMLQokU99ia60WdPdVhq3FvX9jX6EwRgj
pQg1CE3RD3QCUqvMm0h/a/JmlegecrwFHeoE77DFj151s5/ceuT7yYrWHWUonfVs
FwbuX1bWWd5zOKLifpf6tkXtTitdacUAxxBXHIFgaHL5HNS4vNtf9uTn3QKO2FHN
ZVzAbsbLyShGRIdP0Z9Jf98m6HIFwKx4qQEDHgMbM9h+oVROiJIada1OWYxAK4yH
nx9B8+mya0ZU4ozfDCL1JTdGMKD/rfjonDve1XCO5AVn8L+lxvU/SUbFZQr/UjeI
Iw/xSOtFq/1Ma/K//gGNSFMhukDxtOD3PP8lJNSVXet4/wkag30=
=hDy2
-----END PGP SIGNATURE-----
Merge tag 'jfs-for-4.20' of git://github.com/kleikamp/linux-shaggy
Pull jfs updates from David Kleikamp:
"Just a few small fixes"
* tag 'jfs-for-4.20' of git://github.com/kleikamp/linux-shaggy:
jfs: remove redundant dquot_initialize() in jfs_evict_inode()
jfs: remove quota option from ignore list
jfs: cache NULL when both default_acl and acl are NULL
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAlvN/4kACgkQxWXV+ddt
WDtA0Q//fIKXjjxDLa/EM4fz63urfczoxq/nVRVXveoTtAjT+QOTIY0iRblLhKQg
ehy86ygYmwfnBjCzZRTIjtvFJtt93kWH3VtqGAOMcvA/sMGP68CDn8STyIaCsw2o
UN0qEIYcQ5wj5aJlJhlWZgx9bvjK3xkMuNO/uPWz0M0OL+KTk6O265FjIot0M6Pa
XsihJCn9ZrWL9peevGDUXdoPuXQKaU8aW47qe+989Ya0oPD11Knpn75J+t3U07/h
sty6pFIQMCBSKOjXnCKhysv3wSyewzRyMznVPz6EhRfzn7od5GLmvG7cANiWcV6K
jep5Hd7cJq/BeIwpTSAxh+ygxbce8EGIm9NyUPAPXDPtfgMv1Zf2QLEPZhi729xk
OpSF+eOGuRdSCur7ng629LqOANjb+D1939QKrzIwO5SC3xSZ5Ht258IWcJ+FlDNz
Cfxk8b3rWhsS9qSSmiq3kdQ3ECEOzFBYx7d8m2FZ2z3mViA1VdIROHhX97VCAcVu
X1dq5kUycyioGak86Ce/cexpmkwcwf12ypCZWz7tL+pfgDKcAKFHjA0vNLzYjZEy
mxHZijjJXrg8vWTqMCqgDhIDQVzmaOjFo7upXc+5MsL8sbC71V38QvTZ3F6jcnAa
kmiE8lDRQmnM38/5U1EEjROHpBaRLbpSfZvQcfSOGNkaycb2cFI=
=g5+b
-----END PGP SIGNATURE-----
Merge tag 'for-4.20-part1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
"This is the first batch with fixes and some nice performance
improvements.
Preliminary results show eg. more files/sec in fsmark, better perf on
multi-threaded workloads (filebench, dbench), fewer context switches
and overall better memory allocation characteristics (multiple
benchmarks).
Apart from general performance, there's an improvement for qgroups +
balance workload that's been troubling our users.
Note for stable: there are 20+ patches tagged for stable, out of 90.
Not all of them apply cleanly on all stable versions but the conflicts
are mostly due to simple cleanups and resolving should be obvious. The
fixes are otherwise independent.
Performance improvements:
- transition between blocking and spinning modes of path is gone,
which originally resulted to more unnecessary wakeups and updates
to the path locks, the effects are measurable and improve latency
and scalability
- qgroups: first batch of changes that should speedup balancing with
qgroups on, skip quota accounting on unchanged subtrees, overall
gain is about 30+% in runtime
- use rb-tree with cached first node for several structures, small
improvement to avoid pointer chasing
Fixes:
- trim
- fix: some blockgroups could have been missed if their logical
address was past the total filesystem size (ie. after a lot of
balancing)
- better error reporting, after processing blockgroups and whole
device
- fix: continue trimming block groups after an error is
encountered
- check for trim support of the device earlier and avoid some
unnecessary work
- less interaction with transaction commit that improves latency
on slower storage (eg. image files over NFS)
- fsync
- fix warning when replaying log after fsync of a O_TMPFILE
- fix wrong dentries after fsync of file that got its parent
replaced
- qgroups: fix rescan that might misc some dirty groups
- don't clean dirty pages during buffered writes, this could lead to
lost updates in some corner cases
- some block groups could have been delayed in creation, if the
allocation triggered another one
- error handling improvements
Cleanups:
- removed unused struct members and variables
- function return type cleanups
- delayed refs code refactoring
- protect against deadlock that could be caused by crafted image that
tries to allocate from a tree that's locked already"
* tag 'for-4.20-part1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (93 commits)
btrfs: switch return_bigger to bool in find_ref_head
btrfs: remove fs_info from btrfs_should_throttle_delayed_refs
btrfs: remove fs_info from btrfs_check_space_for_delayed_refs
btrfs: delayed-ref: pass delayed_refs directly to btrfs_delayed_ref_lock
btrfs: delayed-ref: pass delayed_refs directly to btrfs_select_ref_head
btrfs: qgroup: move the qgroup->members check out from (!qgroup)'s else branch
btrfs: relocation: Remove redundant tree level check
btrfs: relocation: Cleanup while loop using rbtree_postorder_for_each_entry_safe
btrfs: qgroup: Avoid calling qgroup functions if qgroup is not enabled
Btrfs: fix wrong dentries after fsync of file that got its parent replaced
Btrfs: fix warning when replaying log after fsync of a tmpfile
btrfs: drop min_size from evict_refill_and_join
btrfs: assert on non-empty delayed iputs
btrfs: make sure we create all new block groups
btrfs: reset max_extent_size on clear in a bitmap
btrfs: protect space cache inode alloc with GFP_NOFS
btrfs: release metadata before running delayed refs
Btrfs: kill btrfs_clear_path_blocking
btrfs: dev-replace: remove pointless assert in write unlock
btrfs: dev-replace: move replace members out of fs_info
...
Pull tty ioctl updates from Al Viro:
"This is the compat_ioctl work related to tty ioctls.
Quite a bit of dead code taken out, all tty-related stuff gone from
fs/compat_ioctl.c. A bunch of compat bugs fixed - some still remain,
but all more or less generic tty-related ioctls should be covered
(remaining issues are in things like driver-private ioctls in a pcmcia
serial card driver not getting properly handled in 32bit processes on
64bit host, etc)"
* 'work.tty-ioctl' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (53 commits)
kill TIOCSERGSTRUCT
change semantics of ldisc ->compat_ioctl()
kill TIOCSER[SG]WILD
synclink_gt(): fix compat_ioctl()
pty: fix compat ioctls
compat_ioctl - kill keyboard ioctl handling
gigaset: add ->compat_ioctl()
vt_compat_ioctl(): clean up, use compat_ptr() properly
gigaset: don't try to printk userland buffer contents
dgnc: don't bother with (empty) stub for TCXONC
dgnc: leave TIOC[GS]SOFTCAR to ldisc
remove fallback to drivers for TIOCGICOUNT
dgnc: break-related ioctls won't reach ->ioctl()
kill the rest of tty COMPAT_IOCTL() entries
dgnc: TIOCM... won't reach ->ioctl()
isdn_tty: TCSBRK{,P} won't reach ->ioctl()
kill capinc_tty_ioctl()
take compat TIOC[SG]SERIAL treatment into tty_compat_ioctl()
synclink: reduce pointless checks in ->ioctl()
complete ->[sg]et_serial() switchover
...
We have hit this intermittently, increase the verbosity of
warning message on unexpected mid cancellation.
Signed-off-by: Steve French <stfrench@microsoft.com>
Change these free functions to allow passing NULL as the argument and
treat it as a no-op just like free(NULL) would.
Or, if rqst->rq_iov is NULL.
The second scenario could happen for smb2_queryfs() if the call
to SMB2_query_info_init() fails and we go to qfs_exit to clean up
and free all resources.
In that case we have not yet assigned rqst[2].rq_iov and thus
the rq_iov dereference in SMB2_close_free() will cause a NULL pointer
dereference.
Fixes: 1eb9fb5204 ("cifs: create SMB2_open_init()/SMB2_open_free() helpers")
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
CC: Stable <stable@vger.kernel.org>
Pull siginfo updates from Eric Biederman:
"I have been slowly sorting out siginfo and this is the culmination of
that work.
The primary result is in several ways the signal infrastructure has
been made less error prone. The code has been updated so that manually
specifying SEND_SIG_FORCED is never necessary. The conversion to the
new siginfo sending functions is now complete, which makes it
difficult to send a signal without filling in the proper siginfo
fields.
At the tail end of the patchset comes the optimization of decreasing
the size of struct siginfo in the kernel from 128 bytes to about 48
bytes on 64bit. The fundamental observation that enables this is by
definition none of the known ways to use struct siginfo uses the extra
bytes.
This comes at the cost of a small user space observable difference.
For the rare case of siginfo being injected into the kernel only what
can be copied into kernel_siginfo is delivered to the destination, the
rest of the bytes are set to 0. For cases where the signal and the
si_code are known this is safe, because we know those bytes are not
used. For cases where the signal and si_code combination is unknown
the bits that won't fit into struct kernel_siginfo are tested to
verify they are zero, and the send fails if they are not.
I made an extensive search through userspace code and I could not find
anything that would break because of the above change. If it turns out
I did break something it will take just the revert of a single change
to restore kernel_siginfo to the same size as userspace siginfo.
Testing did reveal dependencies on preferring the signo passed to
sigqueueinfo over si->signo, so bit the bullet and added the
complexity necessary to handle that case.
Testing also revealed bad things can happen if a negative signal
number is passed into the system calls. Something no sane application
will do but something a malicious program or a fuzzer might do. So I
have fixed the code that performs the bounds checks to ensure negative
signal numbers are handled"
* 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (80 commits)
signal: Guard against negative signal numbers in copy_siginfo_from_user32
signal: Guard against negative signal numbers in copy_siginfo_from_user
signal: In sigqueueinfo prefer sig not si_signo
signal: Use a smaller struct siginfo in the kernel
signal: Distinguish between kernel_siginfo and siginfo
signal: Introduce copy_siginfo_from_user and use it's return value
signal: Remove the need for __ARCH_SI_PREABLE_SIZE and SI_PAD_SIZE
signal: Fail sigqueueinfo if si_signo != sig
signal/sparc: Move EMT_TAGOVF into the generic siginfo.h
signal/unicore32: Use force_sig_fault where appropriate
signal/unicore32: Generate siginfo in ucs32_notify_die
signal/unicore32: Use send_sig_fault where appropriate
signal/arc: Use force_sig_fault where appropriate
signal/arc: Push siginfo generation into unhandled_exception
signal/ia64: Use force_sig_fault where appropriate
signal/ia64: Use the force_sig(SIGSEGV,...) in ia64_rt_sigreturn
signal/ia64: Use the generic force_sigsegv in setup_frame
signal/arm/kvm: Use send_sig_mceerr
signal/arm: Use send_sig_fault where appropriate
signal/arm: Use force_sig_fault where appropriate
...
Pull networking updates from David Miller:
1) Add VF IPSEC offload support in ixgbe, from Shannon Nelson.
2) Add zero-copy AF_XDP support to i40e, from Björn Töpel.
3) All in-tree drivers are converted to {g,s}et_link_ksettings() so we
can get rid of the {g,s}et_settings ethtool callbacks, from Michal
Kubecek.
4) Add software timestamping to veth driver, from Michael Walle.
5) More work to make packet classifiers and actions lockless, from Vlad
Buslov.
6) Support sticky FDB entries in bridge, from Nikolay Aleksandrov.
7) Add ipv6 version of IP_MULTICAST_ALL sockopt, from Andre Naujoks.
8) Support batching of XDP buffers in vhost_net, from Jason Wang.
9) Add flow dissector BPF hook, from Petar Penkov.
10) i40e vf --> generic iavf conversion, from Jesse Brandeburg.
11) Add NLA_REJECT netlink attribute policy type, to signal when users
provide attributes in situations which don't make sense. From
Johannes Berg.
12) Switch TCP and fair-queue scheduler over to earliest departure time
model. From Eric Dumazet.
13) Improve guest receive performance by doing rx busy polling in tx
path of vhost networking driver, from Tonghao Zhang.
14) Add per-cgroup local storage to bpf
15) Add reference tracking to BPF, from Joe Stringer. The verifier can
now make sure that references taken to objects are properly released
by the program.
16) Support in-place encryption in TLS, from Vakul Garg.
17) Add new taprio packet scheduler, from Vinicius Costa Gomes.
18) Lots of selftests additions, too numerous to mention one by one here
but all of which are very much appreciated.
19) Support offloading of eBPF programs containing BPF to BPF calls in
nfp driver, frm Quentin Monnet.
20) Move dpaa2_ptp driver out of staging, from Yangbo Lu.
21) Lots of u32 classifier cleanups and simplifications, from Al Viro.
22) Add new strict versions of netlink message parsers, and enable them
for some situations. From David Ahern.
23) Evict neighbour entries on carrier down, also from David Ahern.
24) Support BPF sk_msg verdict programs with kTLS, from Daniel Borkmann
and John Fastabend.
25) Add support for filtering route dumps, from David Ahern.
26) New igc Intel driver for 2.5G parts, from Sasha Neftin et al.
27) Allow vxlan enslavement to bridges in mlxsw driver, from Ido
Schimmel.
28) Add queue and stack map types to eBPF, from Mauricio Vasquez B.
29) Add back byte-queue-limit support to r8169, with all the bug fixes
in other areas of the driver it works now! From Florian Westphal and
Heiner Kallweit.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (2147 commits)
tcp: add tcp_reset_xmit_timer() helper
qed: Fix static checker warning
Revert "be2net: remove desc field from be_eq_obj"
Revert "net: simplify sock_poll_wait"
net: socionext: Reset tx queue in ndo_stop
net: socionext: Add dummy PHY register read in phy_write()
net: socionext: Stop PHY before resetting netsec
net: stmmac: Set OWN bit for jumbo frames
arm64: dts: stratix10: Support Ethernet Jumbo frame
tls: Add maintainers
net: ethernet: ti: cpsw: unsync mcast entries while switch promisc mode
octeontx2-af: Support for NIXLF's UCAST/PROMISC/ALLMULTI modes
octeontx2-af: Support for setting MAC address
octeontx2-af: Support for changing RSS algorithm
octeontx2-af: NIX Rx flowkey configuration for RSS
octeontx2-af: Install ucast and bcast pkt forwarding rules
octeontx2-af: Add LMAC channel info to NIXLF_ALLOC response
octeontx2-af: NPC MCAM and LDATA extract minimal configuration
octeontx2-af: Enable packet length and csum validation
octeontx2-af: Support for VTAG strip and capture
...
Make the output of /proc/fs/cifs/DebugData a little easier to
read by cleaning up the listing of network interfaces removing
a wasted line break.
Here is a comparison of the network interface information
that from be viewed at the end of output from
"cat /proc/fs/cifs/DebugData"
Before:
Server interfaces: 8
0)
Speed: 10000000000 bps
Capabilities: rss
IPv6: fe80:0000:0000:0000:2cf5:407e:84b0:21dd
1)
Speed: 1000000000 bps
Capabilities:
IPv6: fe80:0000:0000:0000:61cd:6147:3d0c:f484
vs. after:
Server interfaces: 11
0) Speed: 10000000000 bps
Capabilities: rss
IPv6: fe80:0000:0000:0000:2cf5:407e:84b0:21dd
1) Speed: 2000000000 bps
Capabilities:
IPv6: fe80:0000:0000:0000:3d76:2d05:dcf8:ed10
Signed-off-by: Steve French <stfrench@microsoft.com>
To allow better debugging (for example applications with
handle leaks, or complex reconnect scenarios) display the
number of open files (on the client) and number of open
server file handles for each tcon in /proc/fs/cifs/Stats.
Note that open files on server is one larger than local
due to handle caching (in this case of the root of
the share). In this example there are two local
open files, and three (two file and one directory handle)
open on the server.
Sample output:
$ cat /proc/fs/cifs/Stats
Resources in use
CIFS Session: 1
Share (unique mount targets): 2
SMB Request/Response Buffer: 1 Pool size: 5
SMB Small Req/Resp Buffer: 1 Pool size: 30
Operations (MIDs): 0
0 session 0 share reconnects
Total vfs operations: 36 maximum at one time: 2
1) \\localhost\test
SMBs: 69
Bytes read: 27 Bytes written: 0
Open files: 2 total (local), 3 open on server
TreeConnects: 1 total 0 failed
TreeDisconnects: 0 total 0 failed
Creates: 19 total 0 failed
Closes: 16 total 0 failed
...
Signed-off-by: Steve French <stfrench@microsoft.com>
We do not call cifs_open_file() for directories and thus we do not have a
pSMBFile we can extract the FIDs from.
Solve this by instead always using a compounded open/query/close for
the passthrough ioctl.
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
In cases where queryinfo fails, we have cases in cifs (vers=1.0)
where with backupuid mounts we retry the query info with findfirst.
This doesn't work to some NetApp servers which don't support
WindowsXP (and later) infolevel 261 (SMB_FIND_FILE_ID_FULL_DIR_INFO)
so in this case use other info levels (in this case it will usually
be level 257, SMB_FIND_FILE_DIRECTORY_INFO).
(Also fixes some indentation)
See kernel bugzilla 201435
Signed-off-by: Steve French <stfrench@microsoft.com>
If backupuid mount option is sent, we can incorrectly retry
(on access denied on query info) with a cifs (FindFirst) operation
on an smb3 mount which causes the server to force the session close.
We set backup intent on open so no need for this fallback.
See kernel bugzilla 201435
Signed-off-by: Steve French <stfrench@microsoft.com>
CC: Stable <stable@vger.kernel.org>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
When mounting with backupuid set, we should be setting
CREATE_OPEN_BACKUP_INTENT flag on compounded opens as well,
especially the case of compounded smb2_query_path_info.
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
writepages and readpages operations did not call get/free_xid
so the statistics for file copy could get confusing with "vfs operations"
not increasing. Add get_xid and free_xid to cifs readpages and
writepages functions.
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
There is a potential execution path in which variable *resp_buftype*
is passed as an argument to function free_rsp_buf(), in which it is
used in a comparison without being properly initialized previously.
Fix this by initializing variable *resp_buftype* to CIFS_NO_BUFFER
in order to avoid unpredictable or unintended results.
Addresses-Coverity-ID: 1473971 ("Uninitialized scalar variable")
Fixes: c5d25bdb2967 ("cifs: add IOCTL for QUERY_INFO passthrough to userspace")
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
This allows userspace tools to query the raw info levels for cifs files
and process the response in userspace.
In particular this is useful for many of those data where there is no
corresponding native data structure in linux.
For example querying the security descriptor for a file and extract the
SIDs.
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Clarify meaning (in comments) meaning of various
options for debug messages in cifs.ko. Also fixed
trivial formatting/style issue with previous patch.
Signed-off-by: Steve French <stfrench@microsoft.com>
Currently, no messages are printed when mounting a CIFS filesystem and
no debug configuration is enabled.
However, a CIFS mount information is valuable when troubleshooting
and/or forensic analyzing a system and finding out if was a CIFS
endpoint mount attempted.
Other filesystems such as XFS, EXT* does issue a printk() when mounting
their filesystems.
A terse log message is printed only if cifsFYI is not enabled. Otherwise,
the default full debug message is printed.
In order to not clutter and classify correctly the event messages, these
are logged as KERN_INFO level.
Sample mount operations:
[root@corinthians ~]# mount -o user=administrator //172.25.250.18/c$ /mnt
(non-existent system)
[root@corinthians ~]# mount -o user=administrator //172.25.250.19/c$ /mnt
(Valid system)
Kernel message log for the mount operations:
[ 450.464543] CIFS: Attempting to mount //172.25.250.18/c$
[ 456.478186] CIFS VFS: Error connecting to socket. Aborting operation.
[ 456.478381] CIFS VFS: cifs_mount failed w/return code = -113
[ 467.688866] CIFS: Attempting to mount //172.25.250.19/c$
Signed-off-by: Rodrigo Freire <rfreire@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Currently, CIFS lacks a internal logging function that prints out data
when CIFS_DEBUG=n. When CIFS_DEBUG=y, the only message level for CIFS
events are KERN_ERR or KERN_DEBUG.
This patch creates cifs_info(), which is useful for printing
non-critical event messges, at either CIFS_DEBUG state.
Signed-off-by: Rodrigo Freire <rfreire@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
RHBZ 1484130
Update cifs_find_fid_lock_conflict() to recognize that
ODF locks do not conflict with eachother.
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
It is not necessary to deregister a memory registration after it has been
successfully invalidated.
Signed-off-by: Long Li <longli@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
When issuing SMB1 read/write, pass the page offset to transport.
Signed-off-by: Long Li <longli@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
In some error conditions, resp_buftype can be passed uninitialised to
free_rsp_buf(), potentially resulting in a spurious debug message.
If resp_buftype randomly had the value 1 (CIFS_SMALL_BUFFER) then this
would log a debug message.
The rsp pointer is initialised to NULL so there is no other side-effect.
Detected by CoverityScan, CID 1438585 ("Uninitialized scalar variable")
Detected by CoverityScan, CID 1438667 ("Uninitialized scalar variable")
Detected by CoverityScan, CID 1438764 ("Uninitialized scalar variable")
Signed-off-by: Garry McNulty <garrmcnu@gmail.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Be able to log a ftrace message on success and/or failure of
sending a lease break response to the server.
Example output:
TASK-PID CPU# |||| TIMESTAMP FUNCTION
| | | |||| | |
kworker/1:1-5681 [001] .... 11123.530457: smb3_lease_done: sid=0x291e3e0f tid=0x8ba43071 lease_key=0x1852ca0d3ecd9b55847750a86716fde lease_state=0x0
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
In network file system it is fairly easy for server and client
atime vs. mtime to get confused (and atime updated less frequently)
which we noticed broke some apps which expect atime >= mtime
Also ignore relatime mount option (rather than error on it) since
relatime is basically what some network server fs are doing
(relatime).
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Modern servers often support 8MB as maximum i/o size, and we see some
performance benefits (my testing showed 1 to 13% on write paths,
and 1 to 3% on read paths for increasing the default to 4MB). If server
doesn't support larger i/o size, during negotiate protocol it is already
set correctly to the server's maximum if lower than 4MB.
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
As we reset credits later in the reconnect path, useful
to have optional (cifsFYI) debug message.
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
tcon->Flags is only used by SMB1 code and changing it is not permanent
(you lose the setting on tcon reconnect).
* Move the setting to superblock flags (per mount-points).
* Make automount callback exit early when flag present
* Make dfs resolving happening in mount syscall exit early if flag present
Signed-off-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Acked-by: Pavel Shilovsky <pshilov@microsoft.com>
Each time we reconnect to the same server, bump an instance
counter (and display in /proc/fs/cifs/DebugData) to make it
easier to debug.
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Previously reserved dpen response field changed in smb3
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
/proc/fs/cifs/Stats when CONFIG_CIFS_STATS2 is enabled logs 'slow'
responses, but depending on the server you are debugging a
one second timeout may be too fast, so allow setting it to
a larger number of seconds via new module parameter
/sys/module/cifs/parameters/slow_rsp_threshold
or via modprobe:
slow_rsp_threshold:Amount of time (in seconds) to wait before
logging that a response is delayed.
Default: 1 (if set to 0 disables msg). (uint)
Recommended values are 0 (disabled) to 32767 (9 hours) with
the default remaining as 1 second.
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
note smb3 (and common more modern servers) in the module description
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
For a network file system we generally prefer large i/o, but
if the server returns invalid file system block/sector sizes
in cifs (vers=1.0) QFSInfo then set block size to a default
of a reasonable minimum (4K).
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Currently, "echo 0 > /proc/fs/cifs/Stats" resets all of the stats
except the session and share reconnect counts. Fix it to
reset those as well.
CC: Stable <stable@vger.kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
When "backup intent" is requested on the mount (e.g. backupuid or
backupgid mount options), the corresponding flag was missing from
some of the new compounding operations as well (now that
open_query_close is gone).
Related to kernel bugzilla #200953
Reported-and-tested-by: <whh@rubrik.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
So we don't overflow the io vector arrays accidentally
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Get rid of smb2_open_op_close() as all operations are now migrated
to smb2_compound_op().
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
We never pass is_falloc==true here anyway and if we ever need to support
is_falloc in the future, SMB2_set_eof is such a trivial wrapper around
send_set_info() that we can/should just create a differently named wrapper
for that new functionality.
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Cuts number of network roundtrips significantly for some common syscalls
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
This changes SMB2_OP_SET_EOF to use compounding in some situations.
This is part of the path based API to truncate a file.
Most of the time this will however not be invoked for SMB2 since
cifs_set_file_size() will as far as I can tell almost always just
open the file synchronously and switch to the handle based truncate
code path, thus bypassing the compounding we add here.
Rewriting cifs_set_file_size() and make that whole pile of code more
compounding friendly, and also easier to read and understand, is a
different project though and not for this patch.
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
This and previous patches drop the number of roundtrips we need for rmdir()
from 6 to 2.
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
so that we can use these later for compounded set-info calls.
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
This,and previous patches, drops the number of roundtrips from five to two
for unlink()
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
This with the previous patch changes mkdir() from needing 6 roundtrips to
just 3.
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
This turns most open/query-info/close patterns in cifs.ko
to become compounds.
This changes stat from using 3 roundtrips to just a single one.
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
When processing the mids for compounds we would only add credits based on
the last successful mid in the compound which would leak credits and
eventually triggering a re-connect.
Fix this by splitting the mid processing part into two loops instead of one
where the first loop just waits for all mids and then counts how many
credits we were granted for the whole compound.
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Add tracepoint to catch potential cases where a pending operation overlapping a
reconnect could fail and incorrectly refund its credits causing the client
to think it has more credits available than the server thinks it does.
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Fixes gcc '-Wunused-but-set-variable' warning:
fs/cifs/ioctl.c: In function 'cifs_ioctl':
fs/cifs/ioctl.c:164:23: warning:
variable 'cifs_sb' set but not used [-Wunused-but-set-variable]
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Use kmemdup rather than duplicating its implementation
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Some servers (e.g. Azure) return "STATUS_NOT_IMPLEMENTED" rather
than "STATUS_INVALID_DEVICE_REQUEST" on query network interface
info at mount. This shouldn't cause us to log a warning message
automatically. Don't log this unless noisier cifsFYI is enabled.
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Send probes to all the unprobed fileservers in a fileserver list on all
addresses simultaneously in an attempt to find out the fastest route whilst
not getting stuck for 20s on any server or address that we don't get a
reply from.
This alleviates the problem whereby attempting to access a new server can
take a long time because the rotation algorithm ends up rotating through
all servers and addresses until it finds one that responds.
Signed-off-by: David Howells <dhowells@redhat.com>
In some circumstances, the callback interest pointer is NULL, so in such a
case we can't dereference it when checking to see if the callback is
broken. This causes an oops in some circumstances.
Fix this by replacing the function that worked out the aggregate break
counter with one that actually does the comparison, and then make that
return true (ie. broken) if there is no callback interest as yet (ie. the
pointer is NULL).
Fixes: 68251f0a68 ("afs: Fix whole-volume callback handling")
Signed-off-by: David Howells <dhowells@redhat.com>
Eliminate the address pointer from the address list cursor as it's
redundant (ac->addrs[ac->index] can be used to find the same address) and
address lists must be replaced rather than being rearranged, so is of
limited value.
Signed-off-by: David Howells <dhowells@redhat.com>
Provide an option to allow the file or volume location server cursor to be
dumped if the rotation routine falls off the end without managing to
contact a server.
Signed-off-by: David Howells <dhowells@redhat.com>
Implement support for talking to YFS-variant fileservers in the cache
manager and the filesystem client. These implement upgraded services on
the same port as their AFS services.
YFS fileservers provide expanded capabilities over AFS.
Signed-off-by: David Howells <dhowells@redhat.com>
Expand fields in various data structures to support the expanded
information that YFS is capable of returning.
Signed-off-by: David Howells <dhowells@redhat.com>
Get the target vnode in afs_rmdir() and validate it before we attempt the
deletion, The vnode pointer will be passed through to the delivery function
in a later patch so that the delivery function can mark it deleted.
Signed-off-by: David Howells <dhowells@redhat.com>
Calculate the callback expiration time at the point of operation reply
delivery, using the reply time queried from AF_RXRPC on that call as a
base.
Signed-off-by: David Howells <dhowells@redhat.com>
The FS.FetchStatus reply delivery function was updating inode of the
directory in which a lookup had been done with the status of the looked up
file. This corrupts some of the directory state.
Fixes: 5cf9dd55a0 ("afs: Prospectively look up extra files when doing a single lookup")
Signed-off-by: David Howells <dhowells@redhat.com>
Implement the YFS cache manager service which gives extra capabilities on
top of AFS. This is done by listening for an additional service on the
same port and indicating that anyone requesting an upgrade should be
upgraded to the YFS port.
Signed-off-by: David Howells <dhowells@redhat.com>
Remove unnecessary details of a broken callback, such as version, expiry
and type, from the afs_callback_break struct as they're not actually used
and make the list take more memory.
Signed-off-by: David Howells <dhowells@redhat.com>
Call the function to commit the status on a new file, dir or symlink so
that the access rights for the caller's key are cached for that object.
Without this, the next access to the file will cause a FetchStatus
operation to be emitted to retrieve the access rights.
Signed-off-by: David Howells <dhowells@redhat.com>
Increase the sizes of the volume ID to 64 bits and the vnode ID (inode
number equivalent) to 96 bits to allow the support of YFS.
This requires the iget comparator to check the vnode->fid rather than i_ino
and i_generation as i_ino is not sufficiently capacious. It also requires
this data to be placed into the vnode cache key for fscache.
For the moment, just discard the top 32 bits of the vnode ID when returning
it though stat.
Signed-off-by: David Howells <dhowells@redhat.com>
When writing a new page, clear space in the page rather than attempting to
load it from the server if the space is beyond the EOF.
Signed-off-by: David Howells <dhowells@redhat.com>
Fix afs_deliver_to_call() to handle -EIO being returned by the operation
delivery function, indicating that the call found itself in the wrong
state, by printing an error and aborting the call.
Currently, an assertion failure will occur. This can happen, say, if the
delivery function falls off the end without calling afs_extract_data() with
the want_more parameter set to false to collect the end of the Rx phase of
a call.
The assertion failure looks like:
AFS: Assertion failed
4 == 7 is false
0x4 == 0x7 is false
------------[ cut here ]------------
kernel BUG at fs/afs/rxrpc.c:462!
and is matched in the trace buffer by a line like:
kworker/7:3-3226 [007] ...1 85158.030203: afs_io_error: c=0003be0c r=-5 CM_REPLY
Fixes: 98bf40cd99 ("afs: Protect call->state changes against signals")
Reported-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Currently the TTL on VL server and address lists isn't set in all
circumstances and may be set to poor choices in others, since the TTL is
derived from the SRV/AFSDB DNS record if and when available.
Fix the TTL by limiting the range to a minimum and maximum from the current
time. At some point these can be made into sysctl knobs. Further, use the
TTL we obtained from the upcall to set the expiry on negative results too;
in future a mechanism can be added to force reloading of such data.
Signed-off-by: David Howells <dhowells@redhat.com>
Track VL servers as independent entities rather than lumping all their
addresses together into one set and implement server-level rotation by:
(1) Add the concept of a VL server list, where each server has its own
separate address list. This code is similar to the FS server list.
(2) Use the DNS resolver to retrieve a set of servers and their associated
addresses, ports, preference and weight ratings.
(3) In the case of a legacy DNS resolver or an address list given directly
through /proc/net/afs/cells, create a list containing just a dummy
server record and attach all the addresses to that.
(4) Implement a simple rotation policy, for the moment ignoring the
priorities and weights assigned to the servers.
(5) Show the address list through /proc/net/afs/<cell>/vlservers. This
also displays the source and status of the data as indicated by the
upcall.
Signed-off-by: David Howells <dhowells@redhat.com>
Improve the error handling in FS server rotation by:
(1) Cache the latest useful error value for the fs operation as a whole in
struct afs_fs_cursor separately from the error cached in the
afs_addr_cursor struct. The one in the address cursor gets clobbered
occasionally. Copy over the error to the fs operation only when it's
something we'd be interested in passing to userspace.
(2) Make it so that EDESTADDRREQ is the default that is seen only if no
addresses are available to be accessed.
(3) When calling utility functions, such as checking a volume status or
probing a fileserver, don't let a successful result clobber the cached
error in the cursor; instead, stash the result in a temporary variable
until it has been assessed.
(4) Don't return ETIMEDOUT or ETIME if a better error, such as
ENETUNREACH, is already cached.
(5) On leaving the rotation loop, turn any remote abort code into a more
useful error than ECONNABORTED.
Fixes: d2ddc776a4 ("afs: Overhaul volume and server record caching and fileserver rotation")
Signed-off-by: David Howells <dhowells@redhat.com>
afs_extract_data sets up a temporary iov_iter and passes it to AF_RXRPC
each time it is called to describe the remaining buffer to be filled.
Instead:
(1) Put an iterator in the afs_call struct.
(2) Set the iterator for each marshalling stage to load data into the
appropriate places. A number of convenience functions are provided to
this end (eg. afs_extract_to_buf()).
This iterator is then passed to afs_extract_data().
(3) Use the new ITER_DISCARD iterator to discard any excess data provided
by FetchData.
Signed-off-by: David Howells <dhowells@redhat.com>
Include the site of detection of AFS protocol errors in trace lines to
better be able to determine what went wrong.
Signed-off-by: David Howells <dhowells@redhat.com>
In the iov_iter struct, separate the iterator type from the iterator
direction and use accessor functions to access them in most places.
Convert a bunch of places to use switch-statements to access them rather
then chains of bitwise-AND statements. This makes it easier to add further
iterator types. Also, this can be more efficient as to implement a switch
of small contiguous integers, the compiler can use ~50% fewer compare
instructions than it has to use bitwise-and instructions.
Further, cease passing the iterator type into the iterator setup function.
The iterator function can set that itself. Only the direction is required.
Signed-off-by: David Howells <dhowells@redhat.com>
Use accessor functions to access an iterator's type and direction. This
allows for the possibility of using some other method of determining the
type of iterator than if-chains with bitwise-AND conditions.
Signed-off-by: David Howells <dhowells@redhat.com>
The filehandle has a length which is defined as a 32-bit
"unsigned integer". Change sign of the length appropriately.
Signed-off-by: Frank Sorenson <sorenson@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Pull x86 mm updates from Ingo Molnar:
"Lots of changes in this cycle:
- Lots of CPA (change page attribute) optimizations and related
cleanups (Thomas Gleixner, Peter Zijstra)
- Make lazy TLB mode even lazier (Rik van Riel)
- Fault handler cleanups and improvements (Dave Hansen)
- kdump, vmcore: Enable kdumping encrypted memory with AMD SME
enabled (Lianbo Jiang)
- Clean up VM layout documentation (Baoquan He, Ingo Molnar)
- ... plus misc other fixes and enhancements"
* 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (51 commits)
x86/stackprotector: Remove the call to boot_init_stack_canary() from cpu_startup_entry()
x86/mm: Kill stray kernel fault handling comment
x86/mm: Do not warn about PCI BIOS W+X mappings
resource: Clean it up a bit
resource: Fix find_next_iomem_res() iteration issue
resource: Include resource end in walk_*() interfaces
x86/kexec: Correct KEXEC_BACKUP_SRC_END off-by-one error
x86/mm: Remove spurious fault pkey check
x86/mm/vsyscall: Consider vsyscall page part of user address space
x86/mm: Add vsyscall address helper
x86/mm: Fix exception table comments
x86/mm: Add clarifying comments for user addr space
x86/mm: Break out user address space handling
x86/mm: Break out kernel address space handling
x86/mm: Clarify hardware vs. software "error_code"
x86/mm/tlb: Make lazy TLB mode lazier
x86/mm/tlb: Add freed_tables element to flush_tlb_info
x86/mm/tlb: Add freed_tables argument to flush_tlb_mm_range
smp,cpumask: introduce on_each_cpu_cond_mask
smp: use __cpumask_set_cpu in on_each_cpu_cond
...
Pull locking and misc x86 updates from Ingo Molnar:
"Lots of changes in this cycle - in part because locking/core attracted
a number of related x86 low level work which was easier to handle in a
single tree:
- Linux Kernel Memory Consistency Model updates (Alan Stern, Paul E.
McKenney, Andrea Parri)
- lockdep scalability improvements and micro-optimizations (Waiman
Long)
- rwsem improvements (Waiman Long)
- spinlock micro-optimization (Matthew Wilcox)
- qspinlocks: Provide a liveness guarantee (more fairness) on x86.
(Peter Zijlstra)
- Add support for relative references in jump tables on arm64, x86
and s390 to optimize jump labels (Ard Biesheuvel, Heiko Carstens)
- Be a lot less permissive on weird (kernel address) uaccess faults
on x86: BUG() when uaccess helpers fault on kernel addresses (Jann
Horn)
- macrofy x86 asm statements to un-confuse the GCC inliner. (Nadav
Amit)
- ... and a handful of other smaller changes as well"
* 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (57 commits)
locking/lockdep: Make global debug_locks* variables read-mostly
locking/lockdep: Fix debug_locks off performance problem
locking/pvqspinlock: Extend node size when pvqspinlock is configured
locking/qspinlock_stat: Count instances of nested lock slowpaths
locking/qspinlock, x86: Provide liveness guarantee
x86/asm: 'Simplify' GEN_*_RMWcc() macros
locking/qspinlock: Rework some comments
locking/qspinlock: Re-order code
locking/lockdep: Remove duplicated 'lock_class_ops' percpu array
x86/defconfig: Enable CONFIG_USB_XHCI_HCD=y
futex: Replace spin_is_locked() with lockdep
locking/lockdep: Make class->ops a percpu counter and move it under CONFIG_DEBUG_LOCKDEP=y
x86/jump-labels: Macrofy inline assembly code to work around GCC inlining bugs
x86/cpufeature: Macrofy inline assembly code to work around GCC inlining bugs
x86/extable: Macrofy inline assembly code to work around GCC inlining bugs
x86/paravirt: Work around GCC inlining bugs when compiling paravirt ops
x86/bug: Macrofy the BUG table section handling, to work around GCC inlining bugs
x86/alternatives: Macrofy lock prefixes to work around GCC inlining bugs
x86/refcount: Work around GCC inlining bug
x86/objtool: Use asm macros to work around GCC inlining bugs
...
With the preparations all being done this patch now enables authentication
support for UBIFS. Authentication is enabled when the newly introduced
auth_key and auth_hash_name mount options are passed. auth_key provides
the key which is used for authentication whereas auth_hash_name provides
the hashing algorithm used for this FS. Passing these options make
authentication mandatory and only UBIFS images that can be authenticated
with the given key are allowed.
Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de>
Signed-off-by: Richard Weinberger <richard@nod.at>
In authenticated mode we cannot fixup the inode sizes in-place
during recovery as this would invalidate the hashes and HMACs
we stored for this inode.
Instead, we just write the updated inodes to the journal. We can
only do this after ubifs_rcvry_gc_commit() is done though, so for
authenticated mode call ubifs_recover_size() after
ubifs_rcvry_gc_commit() and not vice versa as normally done.
Calling ubifs_recover_size() after ubifs_rcvry_gc_commit() has the
drawback that after a commit the size fixup information is gone, so
when a powercut happens while recovering from another powercut
we may lose some data written right before the first powercut.
This is why we only do this in authenticated mode and leave the
behaviour for unauthenticated mode untouched.
Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de>
Signed-off-by: Richard Weinberger <richard@nod.at>
This patch calculates the necessary hashes and HMACs for the default
filesystem so that the dynamically created default fs can be
authenticated.
Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de>
Signed-off-by: Richard Weinberger <richard@nod.at>
This adds a HMAC covering the super block node and adds the logic that
decides if a filesystem shall be mounted unauthenticated or
authenticated.
Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de>
Signed-off-by: Richard Weinberger <richard@nod.at>
During creation of the default filesystem on an empty flash the default
LPT is created. With this patch a hash over the default LPT is
calculated which can be added to the default filesystems master node.
Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de>
Signed-off-by: Richard Weinberger <richard@nod.at>
The master node contains hashes over the root index node and the LPT.
This patch adds a HMAC to authenticate the master node itself.
Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de>
Signed-off-by: Richard Weinberger <richard@nod.at>
The LPT needs to be authenticated aswell. Since the LPT is only written
during commit it is enough to authenticate the whole LPT with a single
hash which is stored in the master node. Only the leaf nodes (pnodes)
are hashed which makes the implementation much simpler than it would be
to hash the complete LPT.
Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de>
Signed-off-by: Richard Weinberger <richard@nod.at>
Make sure that during replay all buds can be authenticated. To do
this we calculate the hash chain until we find an authentication
node and check the HMAC in that node against the current status
of the hash chain.
After a power cut it can happen that some nodes have been written, but
not yet the authentication node for them. These nodes have to be
discarded during replay.
Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de>
Signed-off-by: Richard Weinberger <richard@nod.at>
To be able to authenticate the garbage collector journal head add
authentication nodes to the buds the garbage collector creates.
Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de>
Signed-off-by: Richard Weinberger <richard@nod.at>
Nodes that are written to flash can only be authenticated through the
index after the next commit. When a journal replay is necessary the
nodes are not yet referenced by the index and thus can't be
authenticated.
This patch overcomes this situation by creating a hash over all nodes
beginning from the commit start node over the reference node(s) and
the buds themselves. From
time to time we insert authentication nodes. Authentication nodes
contain a HMAC from the current hash state, so that they can be
used to authenticate a journal replay up to the point where the
authentication node is. The hash is continued afterwards
so that theoretically we would only have to check the HMAC of
the last authentication node we find.
Overall we get this picture:
,,,,,,,,
,......,...........................................
,. CS , hash1.----. hash2.----.
,. | , . |hmac . |hmac
,. v , . v . v
,.REF#0,-> bud -> bud -> bud.-> auth -> bud -> bud.-> auth ...
,..|...,...........................................
, | ,
, | ,,,,,,,,,,,,,,,
. | hash3,----.
, | , |hmac
, v , v
, REF#1 -> bud -> bud,-> auth ...
,,,|,,,,,,,,,,,,,,,,,,
v
REF#2 -> ...
|
V
...
Note how hash3 covers CS, REF#0 and REF#1 so that it is not possible to
exchange or skip any reference nodes. Unlike the picture suggests the
auth nodes themselves are not hashed.
With this it is possible for an offline attacker to cut each journal
head or to drop the last reference node(s), but not to skip any journal
heads or to reorder any operations.
Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de>
Signed-off-by: Richard Weinberger <richard@nod.at>
With this patch the hashes over the index nodes stored in the tree node
cache are written to flash and are checked when read back from flash.
The hash of the root index node is stored in the master node.
During journal replay the hashes are regenerated from the read nodes
and stored in the tree node cache. This means the nodes must previously
be authenticated by other means. This is done in a later patch.
Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de>
Signed-off-by: Richard Weinberger <richard@nod.at>
As part of the UBIFS authentication support every branch in the index
gets a hash covering the referenced node. To make that happen the tree
node cache needs hashes over the nodes. This patch adds a hash argument
to ubifs_tnc_add() and ubifs_tnc_add_nm(). The hashes are calculated
from the callers of these functions which actually prepare the nodes.
With this patch all the leaf nodes of the index tree get hashes, but
currently nothing is done with these hashes, this is left for a later
patch.
Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de>
Signed-off-by: Richard Weinberger <richard@nod.at>
With authentication support some nodes (master node, super block node)
get a HMAC embedded into them. This patch adds functions to prepare and
write such a node.
The difficulty is that besides the HMAC the nodes also have a CRC which
must stay valid. This means we first have to initialize all fields in
the node, then calculate the HMAC (not covering the CRC) and finally
calculate the CRC.
Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de>
Signed-off-by: Richard Weinberger <richard@nod.at>
This patch adds the various helper functions needed for authentication
support. We need functions to hash nodes, to embed HMACs into a node and
to compare hashes and HMACs. Most functions first check if this
filesystem is authenticated and bail out early if not, which makes the
functions safe to be called with disabled authentication.
Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de>
Signed-off-by: Richard Weinberger <richard@nod.at>
When adding authentication support we will embed a HMAC into some
nodes. To prepare these nodes we have to first initialize the nodes,
then add a HMAC and finally add a CRC. To accomplish this add separate
ubifs_init_node/ubifs_crc_node functions.
Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de>
Signed-off-by: Richard Weinberger <richard@nod.at>
This patch adds the changes to the on disk format needed for
authentication support. We'll add:
* a HMAC covering super block node
* a HMAC covering the master node
* a hash over the root index node to the master node
* a hash over the LPT to the master node
* a flag to the filesystem flag indicating the filesystem is
authenticated
* an authentication node necessary to authenticate the nodes written
to the journal heads while they are written.
* a HMAC of a well known message to the super block node to be able
to check if the correct key is provided
And finally, not visible in this patch, nevertheless explained here:
* hashes over the referenced child nodes in each branch of a index node
Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de>
Signed-off-by: Richard Weinberger <richard@nod.at>
The superblock node is read/modified/written several times throughout
the UBIFS code. Instead of reading it from the device each time just
keep a copy in memory and write back the modified copy when necessary.
This patch helps for authentication support, here we not only have to
read the superblock node, but also have to authenticate it, which
is easier if we do it once during initialization.
Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de>
Signed-off-by: Richard Weinberger <richard@nod.at>
write_node() is used only once and can easily be replaced with calls
to ubifs_prepare_node()/write_head() which makes the code a bit shorter.
Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de>
Signed-off-by: Richard Weinberger <richard@nod.at>
ubifs_lpt_lookup() starts by looking up the nth pnode in the LPT. We
already have this functionality in ubifs_pnode_lookup(). Use this
function rather than open coding its functionality.
Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de>
Signed-off-by: Richard Weinberger <richard@nod.at>
ubifs_lpt_lookup could be implemented using pnode_lookup. To make that
possible move pnode_lookup from lpt.c to lpt_commit.c. Rename it to
ubifs_pnode_lookup since it's now exported.
Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de>
Signed-off-by: Richard Weinberger <richard@nod.at>
read_znode() takes len, lnum and offs arguments which the caller all
extracts from the same struct ubifs_zbranch *. When adding authentication
support we would have to add a pointer to a hash to the arguments which
is also part of struct ubifs_zbranch. Pass the ubifs_zbranch * instead
so that we do not have to add another argument.
Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de>
Signed-off-by: Richard Weinberger <richard@nod.at>
try_read_node() takes len, lnum and offs arguments which the caller all
extracts from the same struct ubifs_zbranch *. When adding authentication
support we would have to add a pointer to a hash to the arguments which
is also part of struct ubifs_zbranch. Pass the ubifs_zbranch * instead
so that we do not have to add another argument.
Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de>
Signed-off-by: Richard Weinberger <richard@nod.at>
create_default_filesystem() allocates memory for a node, writes that
node and frees the memory directly afterwards. With this patch we
allocate memory for all nodes at the beginning of the function and
free the memory at the end. This makes it easier to implement
authentication support since with authentication support we'll need
the contents of some nodes when creating other nodes.
Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de>
Signed-off-by: Richard Weinberger <richard@nod.at>
This patch does below changes to keep consistence of project quota data
in sudden power-cut case:
- update inode.i_projid and project quota atomically under lock_op() in
f2fs_ioc_setproject()
- recover inode.i_projid and project quota in recover_inode()
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
For journalled quota mode, let checkpoint to flush dquot dirty data
and quota file data to guarntee persistence of all quota sysfile in
last checkpoint, by this way, we can avoid corrupting quota sysfile
when encountering SPO.
The implementation is as below:
1. add a global state SBI_QUOTA_NEED_FLUSH to indicate that there is
cached dquot metadata changes in quota subsystem, and later checkpoint
should:
a) flush dquot metadata into quota file.
b) flush quota file to storage to keep file usage be consistent.
2. add a global state SBI_QUOTA_NEED_REPAIR to indicate that quota
operation failed due to -EIO or -ENOSPC, so later,
a) checkpoint will skip syncing dquot metadata.
b) CP_QUOTA_NEED_FSCK_FLAG will be set in last cp pack to give a
hint for fsck repairing.
3. add a global state SBI_QUOTA_SKIP_FLUSH, in checkpoint, if quota
data updating is very heavy, it may cause hungtask in block_operation().
To avoid this, if our retry time exceed threshold, let's just skip
flushing and retry in next checkpoint().
Signed-off-by: Weichao Guo <guoweichao@huawei.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
[Jaegeuk Kim: avoid warnings and set fsck flag]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Direct IO can be used in case of hardware encryption. The following
scenario results into data corruption issue in this path -
Thread A - Thread B-
-> write file#1 in direct IO
-> GC gets kicked in
-> GC submitted bio on meta mapping
for file#1, but pending completion
-> write file#1 again with new data
in direct IO
-> GC bio gets completed now
-> GC writes old data to the new
location and thus file#1 is
corrupted.
Fix this by submitting and waiting for pending io on meta mapping
for direct IO case in f2fs_map_blocks().
Signed-off-by: Sahitya Tummala <stummala@codeaurora.org>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Testcase to reproduce this bug:
1. mkfs.f2fs /dev/sdd
2. mount -t f2fs /dev/sdd /mnt/f2fs
3. touch /mnt/f2fs/file
4. sync
5. chattr +a /mnt/f2fs/file
6. xfs_io -a /mnt/f2fs/file -c "fsync"
7. godown /mnt/f2fs
8. umount /mnt/f2fs
9. mount -t f2fs /dev/sdd /mnt/f2fs
10. xfs_io /mnt/f2fs/file
There is no error when opening this file w/o O_APPEND, but actually,
we expect the correct result should be:
/mnt/f2fs/file: Operation not permitted
The root cause is, in recover_inode(), we recover inode->i_flags more
than F2FS_I(inode)->i_flags, so fix it.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch changes codes as below:
- use f2fs_set_inode_flags() to update i_flags atomically to avoid
potential race.
- synchronize F2FS_I(inode)->i_flags to inode->i_flags in
f2fs_new_inode().
- use f2fs_set_inode_flags() to simply codes in f2fs_quota_{on,off}.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
We need to drop PG_checked flag on page as well when we clear PG_uptodate
flag, in order to avoid treating the page as GCing one later.
Signed-off-by: Weichao Guo <guoweichao@huawei.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This reverts commit 66110abc4c.
If we clear the cold data flag out of the writeback flow, we can miscount
-1 by end_io, which incurs a deadlock caused by all I/Os being blocked during
heavy GC.
Balancing F2FS Async:
- IO (CP: 1, Data: -1, Flush: ( 0 0 1), Discard: ( ...
GC thread: IRQ
- move_data_page()
- set_page_dirty()
- clear_cold_data()
- f2fs_write_end_io()
- type = WB_DATA_TYPE(page);
here, we get wrong type
- dec_page_count(sbi, type);
- f2fs_wait_on_page_writeback()
Cc: <stable@vger.kernel.org>
Reported-and-Tested-by: Park Ju Hyung <qkrwngud825@gmail.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch adds issued read IO counts which is under block layer.
Chao modified a bit, since:
Below race can cause reversed reference on F2FS_RD_DATA, there is
the same issue in f2fs_submit_page_bio(), fix them by relocate
__submit_bio() and inc_page_count.
Thread A Thread B
- f2fs_write_begin
- f2fs_submit_page_read
- __submit_bio
- f2fs_read_end_io
- __read_end_io
- dec_page_count(, F2FS_RD_DATA)
- inc_page_count(, F2FS_RD_DATA)
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Now, we have supported cgroup writeback, it depends on correctly IO
account of specified filesystem.
But in commit d1b3e72d54 ("f2fs: submit bio of in-place-update pages"),
we split write paths from f2fs_submit_page_mbio() to two:
- f2fs_submit_page_bio() for IPU path
- f2fs_submit_page_bio() for OPU path
But still we account write IO only in f2fs_submit_page_mbio(), result in
incorrect IO account, fix it by adding missing IO account in IPU path.
Fixes: d1b3e72d54 ("f2fs: submit bio of in-place-update pages")
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Below race can cause reversed reference on dirty count, fix it by
relocating __submit_bio() and inc_page_count().
Thread A Thread B
- f2fs_inplace_write_data
- f2fs_submit_page_bio
- __submit_bio
- f2fs_write_end_io
- dec_page_count
- inc_page_count
Cc: <stable@vger.kernel.org>
Fixes: d1b3e72d54 ("f2fs: submit bio of in-place-update pages")
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Core changes:
* Support non-uniform erase size
* Support controllers with limited TX fifo size
Driver changes:
* m25p80: Re-issue a WREN command after each write access
* cadence: Pass a proper dir value to dma_[un]map_single()
* fsl-qspi: Check fsl_qspi_get_seqid() return val make sure 4B
addressing opcodes are properly handled
* intel-spi: Add a new PCI entry for Ice Lake
NAND changes:
Raw NAND core changes:
- Two batchs of cleanups of the NAND API, including:
* Deprecating a lot of interfaces (now replaced by ->exec_op()).
* Moving code in separate drivers (JEDEC, ONFI), in private files
(internals), in platform drivers, etc.
* Functions/structures reordering.
* Exclusive use of the nand_chip structure instead of the MTD one
all across the subsystem.
- Addition of the nand_wait_readrdy/rdy_op() helpers.
Raw NAND controllers drivers changes:
- Various coccinelle patches.
- Marvell:
* Use regmap_update_bits() for syscon access.
* More documentation.
* BCH failure path rework.
* More layouts to be supported.
* IRQ handler complete() condition fixed.
- Fsl_ifc:
* SRAM initialization fixed for newer controller versions.
- Denali:
* Fix licenses mismatch and use a SPDX tag.
* Set SPARE_AREA_SKIP_BYTES register to 8 if unset.
- Qualcomm:
* Do not include dma-direct.h.
- Docg4:
* Removed.
- Ams-delta:
* Use of a GPIO lookup table
* Internal machinery changes.
Raw NAND chip drivers changes:
- Toshiba:
* Add support for Toshiba memory BENAND
* Pass a single nand_chip object to the status helper.
- ESMT:
* New driver to retrieve the ECC requirements from the 5th ID byte.
MTD changes:
* physmap cleanups/fixe
* gpio-addr-flash cleanups/fixes
-----BEGIN PGP SIGNATURE-----
iQJQBAABCgA6FiEEKmCqpbOU668PNA69Ze02AX4ItwAFAlvNmJocHGJvcmlzLmJy
ZXppbGxvbkBib290bGluLmNvbQAKCRBl7TYBfgi3AKr+D/4pH0gYSGDUstZsqUHW
gZsPRU4jmOA+OCRbSxE03bOZOYR0UPgdoUYLfAKhZ5qxc7wHd3b47wykP0kUEviu
lC8QhSs4DUA+EuMVPDVS4FlXRT0e3dMV7jh/IK6nInshD2YkaovyCqa6GvgwFEcM
7BCizxRhtHV8+fo7GVQWuMLb9ZfbEvz42D0sYOu6hIsF1SnRDvHOvfdDUFEXpJoF
a2mC9ove65ChEzc2iZ/r260iZ4aoJYb9XJRJKWxmYeITZSmLNmcrUyGqAaNQ7NRc
AIuPXASbeHGjrIuEfXpKYW07AE5MV1nJFSk3v4u5FjgoohOoobPp7npk+Lz/qwe3
y8/uW9qOQ/iEOsRnkvMGNu4Yjhw41L1a+J5wVvUvzmwHy1xMCrRHYB/8gXoZzemR
A7XmCPwjAFVWv1WeKV2cs5MLW9yZq9QNMtGLlNs5OgFR6CccLk67/tHIkYLsJ/l9
IDXFhd/YKhTeF361u6Iimmgb427TwM71P9N+OMpv4nk46DxurpxkgGle3nslzKNU
DOFNFlMGiPIx3h4X96AKER6u7cxiOXnLGq+XeHa/y8tIrziy3jg03YoTh0RSwKxV
J3EXwh1sFLaeebwWEgpE3Ag1LOxpRCqJ2ED71SPzS/DR/938HBVsEoPqUEHNf1in
jwsUB3cRzNwf+XX68DRCVT5GFA==
=7w4w
-----END PGP SIGNATURE-----
Merge tag 'mtd/for-4.20' of git://git.infradead.org/linux-mtd
Pull mtd updates from Boris Brezillon:
"SPI NOR core changes:
- Support non-uniform erase size
- Support controllers with limited TX fifo size
Driver changes:
- m25p80: Re-issue a WREN command after each write access
- cadence: Pass a proper dir value to dma_[un]map_single()
- fsl-qspi: Check fsl_qspi_get_seqid() return val make sure 4B
addressing opcodes are properly handled
- intel-spi: Add a new PCI entry for Ice Lake
Raw NAND core changes:
- Two batchs of cleanups of the NAND API, including:
* Deprecating a lot of interfaces (now replaced by ->exec_op()).
* Moving code in separate drivers (JEDEC, ONFI), in private files
(internals), in platform drivers, etc.
* Functions/structures reordering.
* Exclusive use of the nand_chip structure instead of the MTD one
all across the subsystem.
- Addition of the nand_wait_readrdy/rdy_op() helpers.
Raw NAND controllers drivers changes:
- Various coccinelle patches.
- Marvell:
* Use regmap_update_bits() for syscon access.
* More documentation.
* BCH failure path rework.
* More layouts to be supported.
* IRQ handler complete() condition fixed.
- Fsl_ifc:
* SRAM initialization fixed for newer controller versions.
- Denali:
* Fix licenses mismatch and use a SPDX tag.
* Set SPARE_AREA_SKIP_BYTES register to 8 if unset.
- Qualcomm:
* Do not include dma-direct.h.
- Docg4:
* Removed.
- Ams-delta:
* Use of a GPIO lookup table
* Internal machinery changes.
Raw NAND chip drivers changes:
- Toshiba:
* Add support for Toshiba memory BENAND
* Pass a single nand_chip object to the status helper.
- ESMT:
* New driver to retrieve the ECC requirements from the 5th ID
byte.
MTD changes:
- physmap cleanups/fixe
- gpio-addr-flash cleanups/fixes"
* tag 'mtd/for-4.20' of git://git.infradead.org/linux-mtd: (93 commits)
jffs2: free jffs2_sb_info through jffs2_kill_sb()
mtd: spi-nor: fsl-quadspi: fix read error for flash size larger than 16MB
mtd: spi-nor: intel-spi: Add support for Intel Ice Lake SPI serial flash
mtd: maps: gpio-addr-flash: Convert to gpiod
mtd: maps: gpio-addr-flash: Replace array with an integer
mtd: maps: gpio-addr-flash: Use order instead of size
mtd: spi-nor: fsl-quadspi: Don't let -EINVAL on the bus
mtd: devices: m25p80: Make sure WRITE_EN is issued before each write
mtd: spi-nor: Support controllers with limited TX FIFO size
mtd: spi-nor: cadence-quadspi: Use proper enum for dma_[un]map_single
mtd: spi-nor: parse SFDP Sector Map Parameter Table
mtd: spi-nor: add support to non-uniform SFDP SPI NOR flash memories
mtd: rawnand: marvell: fix the IRQ handler complete() condition
mtd: rawnand: denali: set SPARE_AREA_SKIP_BYTES register to 8 if unset
mtd: rawnand: r852: fix spelling mistake "card_registred" -> "card_registered"
mtd: rawnand: toshiba: Pass a single nand_chip object to the status helper
mtd: maps: gpio-addr-flash: Use devm_* functions
mtd: maps: gpio-addr-flash: Fix ioremapped size
mtd: maps: gpio-addr-flash: Replace custom printk
mtd: physmap_of: Release resources on error
...
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAlvNQKgQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgps+8D/9Iy6YIeoPwN10gYsqIh0P2fS3wKzL3kiww
3vFsWO78PzgLxUlNmB7teLtNFc/R5mi8becZmAdvs9za5YFZk56o3Ifv1x+e+z00
VY1/gxhiJD6suLeJ6lECnERGDaiWOZVRMo2TE17vxYGW6GGaa0Ts6PUUXmpla1u5
WKctgt0Qv9WVNyiIdLdeHqzKJwsSSwNTt8fK7eFhy3x8e0CwJr+GtXckbbW3LFkY
lug0npsTli3EmEPMovZhd25SjZmTk5GTM+ADZQ7Tnv5KXoDWB9jn6TcCSAi3G+5d
5WUVwfnDyYJiH8qvlg5tRJ690muIy3xMOmpr7QBQ0YnR/LQ3EW+1CVfqD+qimgLH
TXzlREXQpBP3YlxSDS5nddz4o5z84GZmC9B/43ujPaZKIQ6eBXYdkmQH7tPtSugm
C6VGomR5tHotjxIiAsexh/5hAus+wW8bObKGTPTyINT0ub3XNclwCKLh26CgI9ie
WvbS9g3j/KPvu/7s6weZpgD+cks0YdWe/XdXXxiHwsGI9h3J2aJna5RQt1rKWDm5
wGCgbc/B8eSwiWx+GXlqdB9/Dy/bGXOnSTDnKpEVl1f5zNjeLwUKXbjvkMefWs4m
jEIcquuDETORY+ZYEfa5YbmS4Lhskr0kzMVTVkZ++81tAWpSCU9Xh3IHrR8TNpt+
J0oh0FHBDg==
=LRTT
-----END PGP SIGNATURE-----
Merge tag 'for-4.20/block-20181021' of git://git.kernel.dk/linux-block
Pull block layer updates from Jens Axboe:
"This is the main pull request for block changes for 4.20. This
contains:
- Series enabling runtime PM for blk-mq (Bart).
- Two pull requests from Christoph for NVMe, with items such as;
- Better AEN tracking
- Multipath improvements
- RDMA fixes
- Rework of FC for target removal
- Fixes for issues identified by static checkers
- Fabric cleanups, as prep for TCP transport
- Various cleanups and bug fixes
- Block merging cleanups (Christoph)
- Conversion of drivers to generic DMA mapping API (Christoph)
- Series fixing ref count issues with blkcg (Dennis)
- Series improving BFQ heuristics (Paolo, et al)
- Series improving heuristics for the Kyber IO scheduler (Omar)
- Removal of dangerous bio_rewind_iter() API (Ming)
- Apply single queue IPI redirection logic to blk-mq (Ming)
- Set of fixes and improvements for bcache (Coly et al)
- Series closing a hotplug race with sysfs group attributes (Hannes)
- Set of patches for lightnvm:
- pblk trace support (Hans)
- SPDX license header update (Javier)
- Tons of refactoring patches to cleanly abstract the 1.2 and 2.0
specs behind a common core interface. (Javier, Matias)
- Enable pblk to use a common interface to retrieve chunk metadata
(Matias)
- Bug fixes (Various)
- Set of fixes and updates to the blk IO latency target (Josef)
- blk-mq queue number updates fixes (Jianchao)
- Convert a bunch of drivers from the old legacy IO interface to
blk-mq. This will conclude with the removal of the legacy IO
interface itself in 4.21, with the rest of the drivers (me, Omar)
- Removal of the DAC960 driver. The SCSI tree will introduce two
replacement drivers for this (Hannes)"
* tag 'for-4.20/block-20181021' of git://git.kernel.dk/linux-block: (204 commits)
block: setup bounce bio_sets properly
blkcg: reassociate bios when make_request() is called recursively
blkcg: fix edge case for blk_get_rl() under memory pressure
nvme-fabrics: move controller options matching to fabrics
nvme-rdma: always have a valid trsvcid
mtip32xx: fully switch to the generic DMA API
rsxx: switch to the generic DMA API
umem: switch to the generic DMA API
sx8: switch to the generic DMA API
sx8: remove dead IF_64BIT_DMA_IS_POSSIBLE code
skd: switch to the generic DMA API
ubd: remove use of blk_rq_map_sg
nvme-pci: remove duplicate check
drivers/block: Remove DAC960 driver
nvme-pci: fix hot removal during error handling
nvmet-fcloop: suppress a compiler warning
nvme-core: make implicit seed truncation explicit
nvmet-fc: fix kernel-doc headers
nvme-fc: rework the request initialization code
nvme-fc: introduce struct nvme_fcp_op_w_sgl
...
When ramoops reserved a memory region in the kernel, it had an unhelpful
label of "persistent_memory". When reading /proc/iomem, it would be
repeated many times, did not hint that it was ramoops in particular,
and didn't clarify very much about what each was used for:
400000000-407ffffff : Persistent Memory (legacy)
400000000-400000fff : persistent_memory
400001000-400001fff : persistent_memory
...
4000ff000-4000fffff : persistent_memory
Instead, this adds meaningful labels for how the various regions are
being used:
400000000-407ffffff : Persistent Memory (legacy)
400000000-400000fff : ramoops:dump(0/252)
400001000-400001fff : ramoops:dump(1/252)
...
4000fc000-4000fcfff : ramoops:dump(252/252)
4000fd000-4000fdfff : ramoops:console
4000fe000-4000fe3ff : ramoops:ftrace(0/3)
4000fe400-4000fe7ff : ramoops:ftrace(1/3)
4000fe800-4000febff : ramoops:ftrace(2/3)
4000fec00-4000fefff : ramoops:ftrace(3/3)
4000ff000-4000fffff : ramoops:pmsg
Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Tested-by: Sai Prakash Ranjan <saiprakash.ranjan@codeaurora.org>
Tested-by: Guenter Roeck <groeck@chromium.org>
This refactors compression initialization slightly to better handle
getting potentially called twice (via early pstore_register() calls
and later pstore_init()) and improves the comments and reporting to be
more verbose.
Signed-off-by: Kees Cook <keescook@chromium.org>
Tested-by: Guenter Roeck <groeck@chromium.org>
ramoops's call of pstore_register() was recently moved to run during
late_initcall() because the crypto backend may not have been ready during
postcore_initcall(). This meant early-boot crash dumps were not getting
caught by pstore any more.
Instead, lets allow calls to pstore_register() earlier, and once crypto
is ready we can initialize the compression.
Reported-by: Sai Prakash Ranjan <saiprakash.ranjan@codeaurora.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Tested-by: Sai Prakash Ranjan <saiprakash.ranjan@codeaurora.org>
Fixes: cb3bee0369 ("pstore: Use crypto compress API")
[kees: trivial rebase]
Signed-off-by: Kees Cook <keescook@chromium.org>
Tested-by: Guenter Roeck <groeck@chromium.org>
In preparation for having additional actions during init/exit, this moves
the init/exit into platform.c, centralizing the logic to make call outs
to the fs init/exit.
Signed-off-by: Kees Cook <keescook@chromium.org>
Tested-by: Guenter Roeck <groeck@chromium.org>
Add a new mount option 'nocopyfrom' that will prevent the usage of the
RADOS 'copy-from' operation in cephfs. This could be useful, for example,
for an administrator to temporarily mitigate any possible bugs in the
'copy-from' implementation.
Currently, only copy_file_range uses this RADOS operation. Setting this
mount option will result in this syscall reverting to the default VFS
implementation, i.e. to perform the copies locally instead of doing remote
object copies.
Signed-off-by: Luis Henriques <lhenriques@suse.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
This commit implements support for the copy_file_range syscall in cephfs.
It is implemented using the RADOS 'copy-from' operation, which allows to
do a remote object copy, without the need to download/upload data from/to
the OSDs.
Some manual copy may however be required if the source/destination file
offsets aren't object aligned or if the copy length is smaller than the
object size.
Signed-off-by: Luis Henriques <lhenriques@suse.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
ceph_try_get_caps currently calls try_get_cap_refs with the nonblock
parameter always set to 'true'. This change adds a new parameter that
allows to set it's value. This will be useful for a follow-up patch that
will need to get two sets of capabilities for two different inodes without
risking a deadlock.
Signed-off-by: Luis Henriques <lhenriques@suse.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Currently message data items are allocated with ceph_msg_data_create()
in setup_request_data() inside send_request(). send_request() has never
been allowed to fail, so each allocation is followed by a BUG_ON:
data = ceph_msg_data_create(...);
BUG_ON(!data);
It's been this way since support for multiple message data items was
added in commit 6644ed7b7e ("libceph: make message data be a pointer")
in 3.10.
There is no reason to delay the allocation of message data items until
the last possible moment and we certainly don't need a linked list of
them as they are only ever appended to the end and never erased. Make
ceph_msg_new2() take max_data_items and adapt the rest of the code.
Reported-by: Jerry Lee <leisurelysw24@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
The current requirement is that ceph_osdc_alloc_messages() should be
called after oid and oloc are known. In preparation for preallocating
message data items, move ceph_osdc_alloc_messages() further down, so
that it is called when OSD op codes are known.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
__cap_delay_requeue could be invoked through ceph_check_caps when there
exists caps that needs to be sent and are delayed by "i_hold_caps_min"
or "i_hold_caps_max". If __cap_delay_requeue sets timeout unconditionally,
there could be a chance that some "wanted" caps can not be release for a
long since their timeouts are reset every time they get delayed.
Fixes: http://tracker.ceph.com/issues/36369
Signed-off-by: Xuehan Xu <xuxuehan@360.cn>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Because send_mds_reconnect() wants to send a message with a pagelist
and pass the ownership to the messenger, ceph_msg_data_add_pagelist()
consumes a ref which is then put in ceph_msg_data_destroy(). This
makes managing pagelists in the OSD client (where they are wrapped in
ceph_osd_data) unnecessarily hard because the handoff only happens in
ceph_osdc_start_request() instead of when the pagelist is passed to
ceph_osd_data_pagelist_init(). I counted several memory leaks on
various error paths.
Fix up ceph_msg_data_add_pagelist() and carry a pagelist ref in
ceph_osd_data.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
struct ceph_pagelist cannot be embedded into anything else because it
has its own refcount. Merge allocation and initialization together.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Current implementation of cephfs fallocate isn't correct as it doesn't
really reserve the space in the cluster, which means that a subsequent
call to a write may actually fail due to lack of space. In fact, it is
currently possible to fallocate an amount space that is larger than the
free space in the cluster. It has behaved this way since the initial
commit ad7a60de88 ("ceph: punch hole support").
Since there's no easy solution to fix this at the moment, this patch
simply removes support for all fallocate operations but
FALLOC_FL_PUNCH_HOLE (which implies FALLOC_FL_KEEP_SIZE).
Link: https://tracker.ceph.com/issues/36317
Signed-off-by: Luis Henriques <lhenriques@suse.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Avoid allocating memory for the entire user request: striped_read()
does a synchronous OSD request per object, so it doesn't need more than
object size worth of pages at a time.
[ Preserve the comment, changelog. ]
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
This reverts commit 8b8f53af1e.
splice_dentry() is used by three places. For two places, req->r_dentry
is passed to splice_dentry(). In the case of error, req->r_dentry does
not get updated. So splice_dentry() should not drop reference.
Cc: stable@vger.kernel.org # 4.18+
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Do the snap check first in ceph_set_acl(), so we can avoid
unnecessary operations when the inode has snap.
Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
__cap_delay_requeue() only requeue inode which does not
have CEPH_I_FLUSH flag, so avoid reset cap hold timeout
for that inode.
Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
With no more radix tree API users left, we can drop the GFP flags
and use xa_init() instead of INIT_RADIX_TREE().
Signed-off-by: Matthew Wilcox <willy@infradead.org>
Instead of using a pagevec, just use the XArray iterators. Add a
conditional rescheduling point which probably should have been there in
the original.
Signed-off-by: Matthew Wilcox <willy@infradead.org>
Add some XArray-based helper functions to replace the radix tree based
metaphors currently in use. The biggest change is that converted code
doesn't see its own lock bit; get_unlocked_entry() always returns an
entry with the lock bit clear. So we don't have to mess around loading
the current entry and clearing the lock bit; we can just store the
unlocked entry that we already have.
Signed-off-by: Matthew Wilcox <willy@infradead.org>
Since the XArray is embedded in the struct address_space, its address
contains exactly as much entropy as the address of the mapping. This
patch is purely preparatory for later patches which will simplify the
wait/wake interfaces.
Signed-off-by: Matthew Wilcox <willy@infradead.org>
This is close to a 1:1 replacement of radix tree APIs with their XArray
equivalents. It would be possible to optimise nilfs_copy_back_pages(),
but that doesn't seem to be in the performance path. Also, I think
it has a pre-existing bug, and I've added a note to that effect in the
source code.
Signed-off-by: Matthew Wilcox <willy@infradead.org>
The page cache offers the ability to search for a miss in the previous or
next N locations. Rather than teach the XArray about the page cache's
definition of a miss, use xas_prev() and xas_next() to search the page
array. This should be more efficient as it does not have to start the
lookup from the top for each index.
Signed-off-by: Matthew Wilcox <willy@infradead.org>
net/sched/cls_api.c has overlapping changes to a call to
nlmsg_parse(), one (from 'net') added rtm_tca_policy instead of NULL
to the 5th argument, and another (from 'net-next') added cb->extack
instead of NULL to the 6th argument.
net/ipv4/ipmr_base.c is a case of a bug fix in 'net' being done to
code which moved (to mr_table_dump)) in 'net-next'. Thanks to David
Ahern for the heads up.
Signed-off-by: David S. Miller <davem@davemloft.net>
At inode.c:evict_inode_truncate_pages(), when we iterate over the
inode's extent states, we access an extent state record's "state" field
after we unlocked the inode's io tree lock. This can lead to a
use-after-free issue because after we unlock the io tree that extent
state record might have been freed due to being merged into another
adjacent extent state record (a previous inflight bio for a read
operation finished in the meanwhile which unlocked a range in the io
tree and cause a merge of extent state records, as explained in the
comment before the while loop added in commit 6ca0709756 ("Btrfs: fix
hang during inode eviction due to concurrent readahead")).
Fix this by keeping a copy of the extent state's flags in a local
variable and using it after unlocking the io tree.
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=201189
Fixes: b9d0b38928 ("btrfs: Add handler for invalidate page")
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This could result in a really bad case where we do something like
evict
evict_refill_and_join
btrfs_commit_transaction
btrfs_run_delayed_iputs
evict
evict_refill_and_join
btrfs_commit_transaction
... forever
We have plenty of other places where we run delayed iputs that are much
safer, let those do the work.
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We were not handling the reserved byte accounting properly for data
references. Metadata was fine, if it errored out the error paths would
free the bytes_reserved count and pin the extent, but it even missed one
of the error cases. So instead move this handling up into
run_one_delayed_ref so we are sure that both cases are properly cleaned
up in case of a transaction abort.
CC: stable@vger.kernel.org # 4.18+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When we insert the file extent once the ordered extent completes we free
the reserved extent reservation as it'll have been migrated to the
bytes_used counter. However if we error out after this step we'll still
clear the reserved extent reservation, resulting in a negative
accounting of the reserved bytes for the block group and space info.
Fix this by only doing the free if we didn't successfully insert a file
extent for this extent.
CC: stable@vger.kernel.org # 4.14+
Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
max_extent_size is supposed to be the largest contiguous range for the
space info, and ctl->free_space is the total free space in the block
group. We need to keep track of these separately and _only_ use the
max_free_space if we don't have a max_extent_size, as that means our
original request was too large to search any of the block groups for and
therefore wouldn't have a max_extent_size set.
CC: stable@vger.kernel.org # 4.14+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We can't use entry->bytes if our entry is a bitmap entry, we need to use
entry->max_extent_size in that case. Fix up all the logic to make this
consistent.
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If we use up our block group before allocating a new one we'll easily
get a max_extent_size that's set really really low, which will result in
a lot of fragmentation. We need to make sure we're resetting the
max_extent_size when we add a new chunk or add new space.
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
- Two batchs of cleanups of the NAND API, including:
* Deprecating a lot of interfaces (now replaced by ->exec_op()).
* Moving code in separate drivers (JEDEC, ONFI), in private files
(internals), in platform drivers, etc.
* Functions/structures reordering.
* Exclusive use of the nand_chip structure instead of the MTD one
all across the subsystem.
- Addition of the nand_wait_readrdy/rdy_op() helpers.
Raw NAND controllers drivers changes:
- Various coccinelle patches.
- Marvell:
* Use regmap_update_bits() for syscon access.
* More documentation.
* BCH failure path rework.
* More layouts to be supported.
* IRQ handler complete() condition fixed.
- Fsl_ifc:
* SRAM initialization fixed for newer controller versions.
- Denali:
* Fix licenses mismatch and use a SPDX tag.
* Set SPARE_AREA_SKIP_BYTES register to 8 if unset.
- Qualcomm:
* Do not include dma-direct.h.
- Docg4:
* Removed.
- Ams-delta:
* Use of a GPIO lookup table
* Internal machinery changes.
Raw NAND chip drivers changes:
- Toshiba:
* Add support for Toshiba memory BENAND
* Pass a single nand_chip object to the status helper.
- ESMT:
* New driver to retrieve the ECC requirements from the 5th ID byte.
-----BEGIN PGP SIGNATURE-----
iQEzBAABCgAdFiEE9HuaYnbmDhq/XIDIJWrqGEe9VoQFAlvDdvgACgkQJWrqGEe9
VoQ0zQf/QfXYjLDbC/1FXckEkEeVZPRbXQMf+ptdTo0inJefy9LMOvhRKmuABCxn
wwPa99aqQ+aHESnU/s2d8FUrhNRUf35lmRIAP512lJk79tBfgo73l5WSc6UTg6hO
w3KbigOvTi7wCKQYGDFOBZuoFFCvpNsIEtWFuodCbYG1GmeAaL7L2fhbri9JbcE2
zPqmCv0w5BUrPR7i/taI5gD+KDtRdmCEyfb7wQoJy2XZVO0SKqGsBhDhYWqwCpck
eP2UzQ4cFT8X1S0/AfUpbowc6cftdqIwW9VRY4inx3ALQ+sKxrkJ+0E3gL9lnZ+F
p/is0gTdvDmnOjXfwGcL7GXGy0nN8Q==
=EKv1
-----END PGP SIGNATURE-----
Merge tag 'nand/for-4.20' of git://git.infradead.org/linux-mtd into mtd/next
NAND core changes:
- Two batchs of cleanups of the NAND API, including:
* Deprecating a lot of interfaces (now replaced by ->exec_op()).
* Moving code in separate drivers (JEDEC, ONFI), in private files
(internals), in platform drivers, etc.
* Functions/structures reordering.
* Exclusive use of the nand_chip structure instead of the MTD one
all across the subsystem.
- Addition of the nand_wait_readrdy/rdy_op() helpers.
Raw NAND controllers drivers changes:
- Various coccinelle patches.
- Marvell:
* Use regmap_update_bits() for syscon access.
* More documentation.
* BCH failure path rework.
* More layouts to be supported.
* IRQ handler complete() condition fixed.
- Fsl_ifc:
* SRAM initialization fixed for newer controller versions.
- Denali:
* Fix licenses mismatch and use a SPDX tag.
* Set SPARE_AREA_SKIP_BYTES register to 8 if unset.
- Qualcomm:
* Do not include dma-direct.h.
- Docg4:
* Removed.
- Ams-delta:
* Use of a GPIO lookup table
* Internal machinery changes.
Raw NAND chip drivers changes:
- Toshiba:
* Add support for Toshiba memory BENAND
* Pass a single nand_chip object to the status helper.
- ESMT:
* New driver to retrieve the ECC requirements from the 5th ID byte.
We don't need to call this in the direct, read, or pnfs resend paths and
the only other caller is the write path in nfs_page_async_flush() which
already checks and sets the pg_error on the context.
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
We must check pg_error and call error_cleanup after any call to pg_doio.
Currently, we are skipping the unlock of a page if we encounter an error in
nfs_pageio_complete() before handing off the work to the RPC layer.
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
If a slab cache object is allocated, it needs to be freed eventually,
certainly before anyone unloads the module that allocated it.
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
fscache_set_key() can incur an out-of-bounds read, reported by KASAN:
BUG: KASAN: slab-out-of-bounds in fscache_alloc_cookie+0x5b3/0x680 [fscache]
Read of size 4 at addr ffff88084ff056d4 by task mount.nfs/32615
and also reported by syzbot at https://lkml.org/lkml/2018/7/8/236
BUG: KASAN: slab-out-of-bounds in fscache_set_key fs/fscache/cookie.c:120 [inline]
BUG: KASAN: slab-out-of-bounds in fscache_alloc_cookie+0x7a9/0x880 fs/fscache/cookie.c:171
Read of size 4 at addr ffff8801d3cc8bb4 by task syz-executor907/4466
This happens for any index_key_len which is not divisible by 4 and is
larger than the size of the inline key, because the code allocates exactly
index_key_len for the key buffer, but the hashing loop is stepping through
it 4 bytes (u32) at a time in the buf[] array.
Fix this by calculating how many u32 buffers we'll need by using
DIV_ROUND_UP, and then using kcalloc() to allocate a precleared allocation
buffer to hold the index_key, then using that same count as the hashing
index limit.
Fixes: ec0328e46d ("fscache: Maintain a catalogue of allocated cookies")
Reported-by: syzbot+a95b989b2dde8e806af8@syzkaller.appspotmail.com
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Cc: stable <stable@vger.kernel.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The inline key in struct rxrpc_cookie is insufficiently initialized,
zeroing only 3 of the 4 slots, therefore an index_key_len between 13 and 15
bytes will end up hashing uninitialized memory because the memcpy only
partially fills the last buf[] element.
Fix this by clearing fscache_cookie objects on allocation rather than using
the slab constructor to initialise them. We're going to pretty much fill
in the entire struct anyway, so bringing it into our dcache writably
shouldn't incur much overhead.
This removes the need to do clearance in fscache_set_key() (where we aren't
doing it correctly anyway).
Also, we don't need to set cookie->key_len in fscache_set_key() as we
already did it in the only caller, so remove that.
Fixes: ec0328e46d ("fscache: Maintain a catalogue of allocated cookies")
Reported-by: syzbot+a95b989b2dde8e806af8@syzkaller.appspotmail.com
Reported-by: Eric Sandeen <sandeen@redhat.com>
Cc: stable <stable@vger.kernel.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
the victim might've been rmdir'ed just before the lock_rename();
unlike the normal callers, we do not look the source up after the
parents are locked - we know it beforehand and just recheck that it's
still the child of what used to be its parent. Unfortunately,
the check is too weak - we don't spot a dead directory since its
->d_parent is unchanged, dentry is positive, etc. So we sail all
the way to ->rename(), with hosting filesystems _not_ expecting
to be asked renaming an rmdir'ed subdirectory.
The fix is easy, fortunately - the lock on parent is sufficient for
making IS_DEADDIR() on child safe.
Cc: stable@vger.kernel.org
Fixes: 9ae326a690 (CacheFiles: A cache that backs onto a mounted filesystem)
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
In the typical unmount case, the AIL is forced out by the unmount
sequence before the xfsaild task is stopped. Since AIL items are
removed on writeback completion, this means that the AIL
->ail_buf_list delwri queue has been drained. This is not always
true in the shutdown case, however.
It's possible for buffers to sit on a delwri queue for a period of
time across submission attempts if said items are locked or have
been relogged and pinned since first added to the queue. If the
attempt to log such an item results in a log I/O error, the error
processing can shutdown the fs, remove the item from the AIL, stale
the buffer (dropping the LRU reference) and clear its delwri queue
state. The latter bit means the buffer will be released from a
delwri queue on the next submission attempt, but this might never
occur if the filesystem has shutdown and the AIL is empty.
This means that such buffers are held indefinitely by the AIL delwri
queue across destruction of the AIL. Aside from being a memory leak,
these buffers can also hold references to in-core perag structures.
The latter problem manifests as a generic/475 failure, reproducing
the following asserts at unmount time:
XFS: Assertion failed: atomic_read(&pag->pag_ref) == 0,
file: fs/xfs/xfs_mount.c, line: 151
XFS: Assertion failed: atomic_read(&pag->pag_ref) == 0,
file: fs/xfs/xfs_mount.c, line: 132
To prevent this problem, clear the AIL delwri queue as a final step
before xfsaild() exit. The !empty state should never occur in the
normal case, so add an assert to catch unexpected problems going
forward.
[dgc: add comment explaining need for xfs_buf_delwri_cancel() after
calling xfs_buf_delwri_submit_nowait().]
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Most offset macro mess is used in xfs_stats_format() only, and we can
simply get the right offsets using offsetof(), instead of several macros
to mark the offsets inside __xfsstats structure.
Replace all XFSSTAT_END_* macros by a single helper macro to get the
right offset into __xfsstats, and use this helper in xfs_stats_format()
directly.
The quota stats code, still looks a bit cleaner when using XFSSTAT_*
macros, so, this patch also defines XFSSTAT_START_XQMSTAT and
XFSSTAT_END_XQMSTAT locally to that code. This also should prevent
offset mistakes when updates are done into __xfsstats.
Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
The addition of FIBT, RMAP and REFCOUNT changed the offsets into
__xfssats structure.
This caused xqmstat_proc_show() to display garbage data via
/proc/fs/xfs/xqmstat, once it relies on the offsets marked via macros.
Fix it.
Fixes: 00f4e4f9 xfs: add rmap btree stats infrastructure
Fixes: aafc3c24 xfs: support the XFS_BTNUM_FINOBT free inode btree type
Fixes: 46eeb521 xfs: introduce refcount btree definitions
Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
When looking at a 4.18 based KASAN use after free report, I noticed
that racing xfs_buf_rele() may race on dropping the last reference
to the buffer and taking the buffer lock. This was the symptom
displayed by the KASAN report, but the actual issue that was
reported had already been fixed in 4.19-rc1 by commit e339dd8d8b
("xfs: use sync buffer I/O for sync delwri queue submission").
Despite this, I think there is still an issue with xfs_buf_rele()
in this code:
release = atomic_dec_and_lock(&bp->b_hold, &pag->pag_buf_lock);
spin_lock(&bp->b_lock);
if (!release) {
.....
If two threads race on the b_lock after both dropping a reference
and one getting dropping the last reference so release = true, we
end up with:
CPU 0 CPU 1
atomic_dec_and_lock()
atomic_dec_and_lock()
spin_lock(&bp->b_lock)
spin_lock(&bp->b_lock)
<spins>
<release = true bp->b_lru_ref = 0>
<remove from lists>
freebuf = true
spin_unlock(&bp->b_lock)
xfs_buf_free(bp)
<gets lock, reading and writing freed memory>
<accesses freed memory>
spin_unlock(&bp->b_lock) <reads/writes freed memory>
IOWs, we can't safely take bp->b_lock after dropping the hold
reference because the buffer may go away at any time after we
drop that reference. However, this can be fixed simply by taking the
bp->b_lock before we drop the reference.
It is safe to nest the pag_buf_lock inside bp->b_lock as the
pag_buf_lock is only used to serialise against lookup in
xfs_buf_find() and no other locks are held over or under the
pag_buf_lock there. Make this clear by documenting the buffer lock
orders at the top of the file.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com
Signed-off-by: Dave Chinner <david@fromorbit.com>
This patch adds xfs_attr_remove_args. These sub-routines remove
the attributes specified in @args. We will use this later for setting
parent pointers as a deferred attribute operation.
Signed-off-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
This patch adds xfs_attr_set_args and xfs_bmap_set_attrforkoff.
These sub-routines set the attributes specified in @args.
We will use this later for setting parent pointers as a deferred
attribute operation.
[dgc: remove attr fork init code from xfs_attr_set_args().]
[dgc: xfs_attr_try_sf_addname() NULLs args.trans after commit.]
[dgc: correct sf add error handling.]
Signed-off-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
This patch adds a subroutine xfs_attr_try_sf_addname
used by xfs_attr_set. This subrotine will attempt to
add the attribute name specified in args in shortform,
as well and perform error handling previously done in
xfs_attr_set.
This patch helps to pre-simplify xfs_attr_set for reviewing
purposes and reduce indentation. New function will be added
in the next patch.
[dgc: moved commit to helper function, too.]
Signed-off-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
This patch moves fs/xfs/xfs_attr.h to fs/xfs/libxfs/xfs_attr.h
since xfs_attr.c is in libxfs. We will need these later in
xfsprogs.
Signed-off-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
The kernel only issues a log message that it's been shut down when
the filesystem triggers a shutdown itself. Hence there is no trace
in the log when a shutdown is triggered manually from userspace.
This can make it hard to see sequence of events in the log when
things go wrong, so make sure we always log a message when a
shutdown is run.
While there, clean up the logic flow so we don't have to continually
check if the shutdown trigger was user initiated before logging
shutdown messages.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
We don't handle buffer state properly in online repair's findroot
routine. If a buffer already has b_ops set, we don't ever want to touch
that, and we don't want to call the read verifiers on a buffer that
could be dirty (CRCs are only recomputed during log checkpoints).
Therefore, be more careful about what we do with a buffer -- if someone
else already attached ops that are not the ones for this btree type,
just ignore the buffer. We only attach our btree type's buf ops if it
matches the magic/uuid and structure checks.
We also modify xfs_buf_read_map to allow callers to set buffer ops on a
DONE buffer with NULL ops so that repair doesn't leave behind buffers
which won't have buffers attached to them.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
If a caller supplies buffer ops when trying to read a buffer and the
buffer doesn't already have buf ops assigned, ensure that the ops are
assigned to the buffer and the verifier is run on that buffer.
Note that current XFS code is careful to assign buffer ops after a
xfs_{trans_,}buf_read call in which ops were not supplied. However, we
should apply ops defensively in case there is ever a coding mistake; and
an upcoming repair patch will need to be able to read a buffer without
assigning buf ops.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
In xrep_findroot_block, if we find a candidate root block with sibling
pointers or sibling blocks on the same tree level, we should not return
that block as a tree root because root blocks cannot have siblings.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Needed by userspace programs that call fstatfs().
It'd be natural to publish XFS_SB_MAGIC in uapi, but while these two
have identical values, they have different semantic meaning: one is
an enum cookie meant for statfs, the other a signature of the
on-disk format.
Signed-off-by: Adam Borowski <kilobyte@angband.pl>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Instead of just asserting that we have no delalloc space dangling
in an inode that gets freed print the actual offenders for debug
mode.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
We should want to write directly into the data fork for blocks that don't
have an extent in the COW fork covering them yet.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
We only need to allocate blocks for zeroing for reflink inodes,
and for we currently have a special case for reflink files in
the otherwise direct I/O path that I'd like to get rid of.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
The option to enable unwritten extents was made default in 2003,
removed from mkfs in 2007, and cannot be disabled in v5. We also
rely on it for a lot of common functionality, so filesystems without
it will run a completely untested and buggy code path. Enabling the
support also is a simple bit flip using xfs_db, so legacy file
systems can still be brought forward.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
The invalid state isn't any different from a hole, so merge the two
states. Use the more descriptive hole name, but keep it as the first
value of the enum to catch uninitialized fields.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
There are some cases can cause memory leak when parsing
option 'osdname'.
Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Assigning value -EINVAL to "retval" here, but that stored value is
overwritten before it can be used.
retval = -EINVAL;
....
retval = rw_verify_area(WRITE, out.file, &out_pos, count);
value_overwrite: Overwriting previous write to "retval" with value
from rw_verify_area
delete invalid assignment statements
Signed-off-by: n00202754 <nixiaoming@huawei.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Right now we return EINVAL if a process does not have permission to dedupe a
file. This was an oversight on my part. EPERM gives a true description of
the nature of our error, and EINVAL is already used for the case that the
filesystem does not support dedupe.
Signed-off-by: Mark Fasheh <mfasheh@suse.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Acked-by: David Sterba <dsterba@suse.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The permission check in vfs_dedupe_file_range_one() is too coarse - We only
allow dedupe of the destination file if the user is root, or they have the
file open for write.
This effectively limits a non-root user from deduping their own read-only
files. In addition, the write file descriptor that the user is forced to
hold open can prevent execution of files. As file data during a dedupe
does not change, the behavior is unexpected and this has caused a number of
issue reports. For an example, see:
https://github.com/markfasheh/duperemove/issues/129
So change the check so we allow dedupe on the target if:
- the root or admin is asking for it
- the process has write access
- the owner of the file is asking for the dedupe
- the process could get write access
That way users can open read-only and still get dedupe.
Signed-off-by: Mark Fasheh <mfasheh@suse.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The find_ref_head shouldn't return the first entry even if no exact match
is found. So move the hidden behavior to higher level.
Besides, remove the useless local variables in the btrfs_select_ref_head.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
[ reformat comment ]
Signed-off-by: David Sterba <dsterba@suse.com>
When writing out a block group free space cache we can end deadlocking
with ourselves on an extent buffer lock resulting in a warning like the
following:
[245043.379979] WARNING: CPU: 4 PID: 2608 at fs/btrfs/locking.c:251 btrfs_tree_lock+0x1be/0x1d0 [btrfs]
[245043.392792] CPU: 4 PID: 2608 Comm: btrfs-transacti Tainted: G
W I 4.16.8 #1
[245043.395489] RIP: 0010:btrfs_tree_lock+0x1be/0x1d0 [btrfs]
[245043.396791] RSP: 0018:ffffc9000424b840 EFLAGS: 00010246
[245043.398093] RAX: 0000000000000a30 RBX: ffff8807e20a3d20 RCX: 0000000000000001
[245043.399414] RDX: 0000000000000001 RSI: 0000000000000002 RDI: ffff8807e20a3d20
[245043.400732] RBP: 0000000000000001 R08: ffff88041f39a700 R09: ffff880000000000
[245043.402021] R10: 0000000000000040 R11: ffff8807e20a3d20 R12: ffff8807cb220630
[245043.403296] R13: 0000000000000001 R14: ffff8807cb220628 R15: ffff88041fbdf000
[245043.404780] FS: 0000000000000000(0000) GS:ffff88082fc80000(0000) knlGS:0000000000000000
[245043.406050] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[245043.407321] CR2: 00007fffdbdb9f10 CR3: 0000000001c09005 CR4: 00000000000206e0
[245043.408670] Call Trace:
[245043.409977] btrfs_search_slot+0x761/0xa60 [btrfs]
[245043.411278] btrfs_insert_empty_items+0x62/0xb0 [btrfs]
[245043.412572] btrfs_insert_item+0x5b/0xc0 [btrfs]
[245043.413922] btrfs_create_pending_block_groups+0xfb/0x1e0 [btrfs]
[245043.415216] do_chunk_alloc+0x1e5/0x2a0 [btrfs]
[245043.416487] find_free_extent+0xcd0/0xf60 [btrfs]
[245043.417813] btrfs_reserve_extent+0x96/0x1e0 [btrfs]
[245043.419105] btrfs_alloc_tree_block+0xfb/0x4a0 [btrfs]
[245043.420378] __btrfs_cow_block+0x127/0x550 [btrfs]
[245043.421652] btrfs_cow_block+0xee/0x190 [btrfs]
[245043.422979] btrfs_search_slot+0x227/0xa60 [btrfs]
[245043.424279] ? btrfs_update_inode_item+0x59/0x100 [btrfs]
[245043.425538] ? iput+0x72/0x1e0
[245043.426798] write_one_cache_group.isra.49+0x20/0x90 [btrfs]
[245043.428131] btrfs_start_dirty_block_groups+0x102/0x420 [btrfs]
[245043.429419] btrfs_commit_transaction+0x11b/0x880 [btrfs]
[245043.430712] ? start_transaction+0x8e/0x410 [btrfs]
[245043.432006] transaction_kthread+0x184/0x1a0 [btrfs]
[245043.433341] kthread+0xf0/0x130
[245043.434628] ? btrfs_cleanup_transaction+0x4e0/0x4e0 [btrfs]
[245043.435928] ? kthread_create_worker_on_cpu+0x40/0x40
[245043.437236] ret_from_fork+0x1f/0x30
[245043.441054] ---[ end trace 15abaa2aaf36827f ]---
This is because at write_one_cache_group() when we are COWing a leaf from
the extent tree we end up allocating a new block group (chunk) and,
because we have hit a threshold on the number of bytes reserved for system
chunks, we attempt to finalize the creation of new block groups from the
current transaction, by calling btrfs_create_pending_block_groups().
However here we also need to modify the extent tree in order to insert
a block group item, and if the location for this new block group item
happens to be in the same leaf that we were COWing earlier, we deadlock
since btrfs_search_slot() tries to write lock the extent buffer that we
locked before at write_one_cache_group().
We have already hit similar cases in the past and commit d9a0540a79
("Btrfs: fix deadlock when finalizing block group creation") fixed some
of those cases by delaying the creation of pending block groups at the
known specific spots that could lead to a deadlock. This change reworks
that commit to be more generic so that we don't have to add similar logic
to every possible path that can lead to a deadlock. This is done by
making __btrfs_cow_block() disallowing the creation of new block groups
(setting the transaction's can_flush_pending_bgs to false) before it
attempts to allocate a new extent buffer for either the extent, chunk or
device trees, since those are the trees that pending block creation
modifies. Once the new extent buffer is allocated, it allows creation of
pending block groups to happen again.
This change depends on a recent patch from Josef which is not yet in
Linus' tree, named "btrfs: make sure we create all new block groups" in
order to avoid occasional warnings at btrfs_trans_release_chunk_metadata().
Fixes: d9a0540a79 ("Btrfs: fix deadlock when finalizing block group creation")
CC: stable@vger.kernel.org # 4.4+
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=199753
Link: https://lore.kernel.org/linux-btrfs/CAJtFHUTHna09ST-_EEiyWmDH6gAqS6wa=zMNMBsifj8ABu99cw@mail.gmail.com/
Reported-by: E V <eliventer@gmail.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When using the NO_HOLES feature and logging a regular file, we were
expecting that if we find an inline extent, that either its size in RAM
(uncompressed and unenconded) matches the size of the file or if it does
not, that it matches the sector size and it represents compressed data.
This assertion does not cover a case where the length of the inline extent
is smaller than the sector size and also smaller the file's size, such
case is possible through fallocate. Example:
$ mkfs.btrfs -f -O no-holes /dev/sdb
$ mount /dev/sdb /mnt
$ xfs_io -f -c "pwrite -S 0xb60 0 21" /mnt/foobar
$ xfs_io -c "falloc 40 40" /mnt/foobar
$ xfs_io -c "fsync" /mnt/foobar
In the above example we trigger the assertion because the inline extent's
length is 21 bytes while the file size is 80 bytes. The fallocate() call
merely updated the file's size and did not touch the existing inline
extent, as expected.
So fix this by adjusting the assertion so that an inline extent length
smaller than the file size is valid if the file size is smaller than the
filesystem's sector size.
A test case for fstests follows soon.
Reported-by: Anatoly Trosinenko <anatoly.trosinenko@gmail.com>
Fixes: a89ca6f24f ("Btrfs: fix fsync after truncate when no_holes feature is enabled")
CC: stable@vger.kernel.org # 4.14+
Link: https://lore.kernel.org/linux-btrfs/CAE5jQCfRSBC7n4pUTFJcmHh109=gwyT9mFkCOL+NKfzswmR=_Q@mail.gmail.com/
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
At inode.c:compress_file_range(), under the "free_pages_out" label, we can
end up dereferencing the "pages" pointer when it has a NULL value. This
case happens when "start" has a value of 0 and we fail to allocate memory
for the "pages" pointer. When that happens we jump to the "cont" label and
then enter the "if (start == 0)" branch where we immediately call the
cow_file_range_inline() function. If that function returns 0 (success
creating an inline extent) or an error (like -ENOMEM for example) we jump
to the "free_pages_out" label and then access "pages[i]" leading to a NULL
pointer dereference, since "nr_pages" has a value greater than zero at
that point.
Fix this by setting "nr_pages" to 0 when we fail to allocate memory for
the "pages" pointer.
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=201119
Fixes: 771ed689d2 ("Btrfs: Optimize compressed writeback and reads")
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Liu Bo <bo.liu@linux.alibaba.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This doesn't work on stacked devices, and it doesn't work on
blk-mq devices. The request_list is only used on legacy, which
we don't have much of anymore, and soon won't have any of.
Kill the check.
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: linux-f2fs-devel@lists.sourceforge.net
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Since we can use the filesystem without quotas till next boot.
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The REQ_TIME should be updated only in case of success cases
as followed at all other places in the file system.
Signed-off-by: Sahitya Tummala <stummala@codeaurora.org>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Commit 7735730d39 ("f2fs: fix to propagate error from __get_meta_page()")
added disable_nat_bits() in error path of __get_nat_bitmaps(), but it's
unneeded, beause we will fail mount, we won't have chance to change nid
usage status w/o nat full/empty bitmaps.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Commit a2a12b679f ("f2fs: export SSR allocation threshold") introduced
two threshold .min_ssr_sections and .trigger_ssr_threshold, but only
.min_ssr_sections is used, so just remove redundant one for cleanup.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
file_set_{cold,hot} doesn't need holding sbi->sb_lock, so moving them
out of the lock.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
As rbtree supports caching leftmost node natively, update f2fs codes
to use rb_*_cached helpers to speed up leftmost node visiting.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Testcase to reproduce this bug:
1. mkfs.f2fs /dev/sdd
2. mount -t f2fs /dev/sdd /mnt/f2fs
3. touch /mnt/f2fs/file
4. sync
5. chattr +A /mnt/f2fs/file
6. xfs_io -f /mnt/f2fs/file -c "fsync"
7. godown /mnt/f2fs
8. umount /mnt/f2fs
9. mount -t f2fs /dev/sdd /mnt/f2fs
10. chattr -A /mnt/f2fs/file
11. xfs_io -f /mnt/f2fs/file -c "fsync"
12. umount /mnt/f2fs
13. mount -t f2fs /dev/sdd /mnt/f2fs
14. lsattr /mnt/f2fs/file
-----------------N- /mnt/f2fs/file
But actually, we expect the corrct result is:
-------A---------N- /mnt/f2fs/file
The reason is in step 9) we missed to recover cold bit flag in inode
block, so later, in fsync, we will skip write inode block due to below
condition check, result in lossing data in another SPOR.
f2fs_fsync_node_pages()
if (!IS_DNODE(page) || !is_cold_node(page))
continue;
Note that, I guess that some non-dir inode has already lost cold bit
during POR, so in order to reenable recovery for those inode, let's
try to recover cold bit in f2fs_iget() to save more fsynced data.
Fixes: c56675750d ("f2fs: remove unneeded set_cold_node()")
Cc: <stable@vger.kernel.org> 4.17+
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
When migrating encrypted block from background GC thread, we only add
them into f2fs inner bio cache, but forget to submit the cached bio, it
may cause potential deadlock when we are waiting page writebacked, fix
it.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Note that, it requires "f2fs: return correct errno in f2fs_gc".
This adds a lightweight non-persistent snapshotting scheme to f2fs.
To use, mount with the option checkpoint=disable, and to return to
normal operation, remount with checkpoint=enable. If the filesystem
is shut down before remounting with checkpoint=enable, it will revert
back to its apparent state when it was first mounted with
checkpoint=disable. This is useful for situations where you wish to be
able to roll back the state of the disk in case of some critical
failure.
Signed-off-by: Daniel Rosenberg <drosen@google.com>
[Jaegeuk Kim: use SB_RDONLY instead of MS_RDONLY]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
When an invalid mount option is passed to jffs2, jffs2_parse_options()
will fail and jffs2_sb_info will be freed, but then jffs2_sb_info will
be used (use-after-free) and freeed (double-free) in jffs2_kill_sb().
Fix it by removing the buggy invocation of kfree() when getting invalid
mount options.
Fixes: 92abc475d8 ("jffs2: implement mount option parsing and compression overriding")
Cc: stable@kernel.org
Signed-off-by: Hou Tao <houtao1@huawei.com>
Reviewed-by: Richard Weinberger <richard@nod.at>
Signed-off-by: Boris Brezillon <boris.brezillon@bootlin.com>
All the other functions that deal with the sd_ail_list run the list
from the tail back to the head, iow, in reverse. We should do the
same while writing revokes, otherwise we might miss removing entries
properly from the list when we hit the limit of how many revokes we
can write at one time (based on block size, which determines how
many block pointers will fit in the revoke block).
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Using bool is more suitable than int here, and add the comment about the
return_bigger.
Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The avg_delayed_ref_runtime can be referenced from the transaction
handle.
Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It can be referenced from the transaction handle.
Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since trans is only used for referring to delayed_refs, there is no need
to pass it instead of delayed_refs to btrfs_delayed_ref_lock().
No functional change.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since trans is only used for referring to delayed_refs, there is no need
to pass it instead of delayed_refs to btrfs_select_ref_head(). No
functional change.
Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There is no reason to put this check in (!qgroup)'s else branch because
if qgroup is null, it will goto out directly. So move it out to reduce
indentation level. No functional change.
Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Commit 581c176041 ("btrfs: Validate child tree block's level and first
key") has made tree block level check mandatory.
So if tree block level doesn't match, we won't get a valid extent
buffer. The extra WARN_ON() check can be removed completely.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
And add one line comment explaining what we're doing for each loop.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Some qgroup trace events like btrfs_qgroup_release_data() and
btrfs_qgroup_free_delayed_ref() can still be triggered even if qgroup is
not enabled.
This is caused by the lack of qgroup status check before calling some
qgroup functions. Thankfully the functions can handle quota disabled
case well and just do nothing for qgroup disabled case.
This patch will do earlier check before triggering related trace events.
And for enabled <-> disabled race case:
1) For enabled->disabled case
Disable will wipe out all qgroups data including reservation and
excl/rfer. Even if we leak some reservation or numbers, it will
still be cleared, so nothing will go wrong.
2) For disabled -> enabled case
Current btrfs_qgroup_release_data() will use extent_io tree to ensure
we won't underflow reservation. And for delayed_ref we use
head->qgroup_reserved to record the reserved space, so in that case
head->qgroup_reserved should be 0 and we won't underflow.
CC: stable@vger.kernel.org # 4.14+
Reported-by: Chris Murphy <lists@colorremedies.com>
Link: https://lore.kernel.org/linux-btrfs/CAJCQCtQau7DtuUUeycCkZ36qjbKuxNzsgqJ7+sJ6W0dK_NLE3w@mail.gmail.com/
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In a scenario like the following:
mkdir /mnt/A # inode 258
mkdir /mnt/B # inode 259
touch /mnt/B/bar # inode 260
sync
mv /mnt/B/bar /mnt/A/bar
mv -T /mnt/A /mnt/B
fsync /mnt/B/bar
<power fail>
After replaying the log we end up with file bar having 2 hard links, both
with the name 'bar' and one in the directory with inode number 258 and the
other in the directory with inode number 259. Also, we end up with the
directory inode 259 still existing and with the directory inode 258 still
named as 'A', instead of 'B'. In this scenario, file 'bar' should only
have one hard link, located at directory inode 258, the directory inode
259 should not exist anymore and the name for directory inode 258 should
be 'B'.
This incorrect behaviour happens because when attempting to log the old
parents of an inode, we skip any parents that no longer exist. Fix this
by forcing a full commit if an old parent no longer exists.
A test case for fstests follows soon.
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When replaying a log which contains a tmpfile (which necessarily has a
link count of 0) we end up calling inc_nlink(), at
fs/btrfs/tree-log.c:replay_one_buffer(), which produces a warning like
the following:
[195191.943673] WARNING: CPU: 0 PID: 6924 at fs/inode.c:342 inc_nlink+0x33/0x40
[195191.943723] CPU: 0 PID: 6924 Comm: mount Not tainted 4.19.0-rc6-btrfs-next-38 #1
[195191.943724] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014
[195191.943726] RIP: 0010:inc_nlink+0x33/0x40
[195191.943728] RSP: 0018:ffffb96e425e3870 EFLAGS: 00010246
[195191.943730] RAX: 0000000000000000 RBX: ffff8c0d1e6af4f0 RCX: 0000000000000006
[195191.943731] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8c0d1e6af4f0
[195191.943731] RBP: 0000000000000097 R08: 0000000000000001 R09: 0000000000000000
[195191.943732] R10: 0000000000000000 R11: 0000000000000000 R12: ffffb96e425e3a60
[195191.943733] R13: ffff8c0d10cff0c8 R14: ffff8c0d0d515348 R15: ffff8c0d78a1b3f8
[195191.943735] FS: 00007f570ee24480(0000) GS:ffff8c0dfb200000(0000) knlGS:0000000000000000
[195191.943736] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[195191.943737] CR2: 00005593286277c8 CR3: 00000000bb8f2006 CR4: 00000000003606f0
[195191.943739] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[195191.943740] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[195191.943741] Call Trace:
[195191.943778] replay_one_buffer+0x797/0x7d0 [btrfs]
[195191.943802] walk_up_log_tree+0x1c1/0x250 [btrfs]
[195191.943809] ? rcu_read_lock_sched_held+0x3f/0x70
[195191.943825] walk_log_tree+0xae/0x1d0 [btrfs]
[195191.943840] btrfs_recover_log_trees+0x1d7/0x4d0 [btrfs]
[195191.943856] ? replay_dir_deletes+0x280/0x280 [btrfs]
[195191.943870] open_ctree+0x1c3b/0x22a0 [btrfs]
[195191.943887] btrfs_mount_root+0x6b4/0x800 [btrfs]
[195191.943894] ? rcu_read_lock_sched_held+0x3f/0x70
[195191.943899] ? pcpu_alloc+0x55b/0x7c0
[195191.943906] ? mount_fs+0x3b/0x140
[195191.943908] mount_fs+0x3b/0x140
[195191.943912] ? __init_waitqueue_head+0x36/0x50
[195191.943916] vfs_kern_mount+0x62/0x160
[195191.943927] btrfs_mount+0x134/0x890 [btrfs]
[195191.943936] ? rcu_read_lock_sched_held+0x3f/0x70
[195191.943938] ? pcpu_alloc+0x55b/0x7c0
[195191.943943] ? mount_fs+0x3b/0x140
[195191.943952] ? btrfs_remount+0x570/0x570 [btrfs]
[195191.943954] mount_fs+0x3b/0x140
[195191.943956] ? __init_waitqueue_head+0x36/0x50
[195191.943960] vfs_kern_mount+0x62/0x160
[195191.943963] do_mount+0x1f9/0xd40
[195191.943967] ? memdup_user+0x4b/0x70
[195191.943971] ksys_mount+0x7e/0xd0
[195191.943974] __x64_sys_mount+0x21/0x30
[195191.943977] do_syscall_64+0x60/0x1b0
[195191.943980] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[195191.943983] RIP: 0033:0x7f570e4e524a
[195191.943986] RSP: 002b:00007ffd83589478 EFLAGS: 00000206 ORIG_RAX: 00000000000000a5
[195191.943989] RAX: ffffffffffffffda RBX: 0000563f335b2060 RCX: 00007f570e4e524a
[195191.943990] RDX: 0000563f335b2240 RSI: 0000563f335b2280 RDI: 0000563f335b2260
[195191.943992] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000020
[195191.943993] R10: 00000000c0ed0000 R11: 0000000000000206 R12: 0000563f335b2260
[195191.943994] R13: 0000563f335b2240 R14: 0000000000000000 R15: 00000000ffffffff
[195191.944002] irq event stamp: 8688
[195191.944010] hardirqs last enabled at (8687): [<ffffffff9cb004c3>] console_unlock+0x503/0x640
[195191.944012] hardirqs last disabled at (8688): [<ffffffff9ca037dd>] trace_hardirqs_off_thunk+0x1a/0x1c
[195191.944018] softirqs last enabled at (8638): [<ffffffff9cc0a5d1>] __set_page_dirty_nobuffers+0x101/0x150
[195191.944020] softirqs last disabled at (8634): [<ffffffff9cc26bbe>] wb_wakeup_delayed+0x2e/0x60
[195191.944022] ---[ end trace 5d6e873a9a0b811a ]---
This happens because the inode does not have the flag I_LINKABLE set,
which is a runtime only flag, not meant to be persisted, set when the
inode is created through open(2) if the flag O_EXCL is not passed to it.
Except for the warning, there are no other consequences (like corruptions
or metadata inconsistencies).
Since it's pointless to replay a tmpfile as it would be deleted in a
later phase of the log replay procedure (it has a link count of 0), fix
this by not logging tmpfiles and if a tmpfile is found in a log (created
by a kernel without this change), skip the replay of the inode.
A test case for fstests follows soon.
Fixes: 471d557afe ("Btrfs: fix loss of prealloc extents past i_size after fsync log replay")
CC: stable@vger.kernel.org # 4.18+
Reported-by: Martin Steigerwald <martin@lichtvoll.de>
Link: https://lore.kernel.org/linux-btrfs/3666619.NTnn27ZJZE@merkaba/
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We don't need it, rsv->size is set once and never changes throughout
its lifetime, so just use that for the reserve size.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
I ran into an issue where there was some reference being held on an
inode that I couldn't track. This assert wasn't triggered, but it at
least rules out we're doing something stupid.
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Allocating new chunks modifies both the extent and chunk tree, which can
trigger new chunk allocations. So instead of doing list_for_each_safe,
just do while (!list_empty()) so we make sure we don't exit with other
pending bg's still on our list.
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Liu Bo <bo.liu@linux.alibaba.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We need to clear the max_extent_size when we clear bits from a bitmap
since it could have been from the range that contains the
max_extent_size.
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Liu Bo <bo.liu@linux.alibaba.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If we're allocating a new space cache inode it's likely going to be
under a transaction handle, so we need to use memalloc_nofs_save() in
order to avoid deadlocks, and more importantly lockdep messages that
make xfstests fail.
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We want to release the unused reservation we have since it refills the
delayed refs reserve, which will make everything go smoother when
running the delayed refs if we're short on our reservation.
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Btrfs's btree locking has two modes, spinning mode and blocking mode,
while searching btree, locking is always acquired in spinning mode and
then converted to blocking mode if necessary, and in some hot paths we may
switch the locking back to spinning mode by btrfs_clear_path_blocking().
When acquiring locks, both of reader and writer need to wait for blocking
readers and writers to complete before doing read_lock()/write_lock().
The problem is that btrfs_clear_path_blocking() needs to switch nodes
in the path to blocking mode at first (by btrfs_set_path_blocking) to
make lockdep happy before doing its actual clearing blocking job.
When switching to blocking mode from spinning mode, it consists of
step 1) bumping up blocking readers counter and
step 2) read_unlock()/write_unlock(),
this has caused serious ping-pong effect if there're a great amount of
concurrent readers/writers, as waiters will be woken up and go to
sleep immediately.
1) Killing this kind of ping-pong results in a big improvement in my 1600k
files creation script,
MNT=/mnt/btrfs
mkfs.btrfs -f /dev/sdf
mount /dev/def $MNT
time fsmark -D 10000 -S0 -n 100000 -s 0 -L 1 -l /tmp/fs_log.txt \
-d $MNT/0 -d $MNT/1 \
-d $MNT/2 -d $MNT/3 \
-d $MNT/4 -d $MNT/5 \
-d $MNT/6 -d $MNT/7 \
-d $MNT/8 -d $MNT/9 \
-d $MNT/10 -d $MNT/11 \
-d $MNT/12 -d $MNT/13 \
-d $MNT/14 -d $MNT/15
w/o patch:
real 2m27.307s
user 0m12.839s
sys 13m42.831s
w/ patch:
real 1m2.273s
user 0m15.802s
sys 8m16.495s
1.1) latency histogram from funclatency[1]
Overall with the patch, there're ~50% less write lock acquisition and
the 95% max latency that write lock takes also reduces to ~100ms from
>500ms.
--------------------------------------------
w/o patch:
--------------------------------------------
Function = btrfs_tree_lock
msecs : count distribution
0 -> 1 : 2385222 |****************************************|
2 -> 3 : 37147 | |
4 -> 7 : 20452 | |
8 -> 15 : 13131 | |
16 -> 31 : 3877 | |
32 -> 63 : 3900 | |
64 -> 127 : 2612 | |
128 -> 255 : 974 | |
256 -> 511 : 165 | |
512 -> 1023 : 13 | |
Function = btrfs_tree_read_lock
msecs : count distribution
0 -> 1 : 6743860 |****************************************|
2 -> 3 : 2146 | |
4 -> 7 : 190 | |
8 -> 15 : 38 | |
16 -> 31 : 4 | |
--------------------------------------------
w/ patch:
--------------------------------------------
Function = btrfs_tree_lock
msecs : count distribution
0 -> 1 : 1318454 |****************************************|
2 -> 3 : 6800 | |
4 -> 7 : 3664 | |
8 -> 15 : 2145 | |
16 -> 31 : 809 | |
32 -> 63 : 219 | |
64 -> 127 : 10 | |
Function = btrfs_tree_read_lock
msecs : count distribution
0 -> 1 : 6854317 |****************************************|
2 -> 3 : 2383 | |
4 -> 7 : 601 | |
8 -> 15 : 92 | |
2) dbench also proves the improvement,
dbench -t 120 -D /mnt/btrfs 16
w/o patch:
Throughput 158.363 MB/sec
w/ patch:
Throughput 449.52 MB/sec
3) xfstests didn't show any additional failures.
One thing to note is that callers may set path->leave_spinning to have
all nodes in the path stay in spinning mode, which means callers are
ready to not sleep before releasing the path, but it won't cause
problems if they don't want to sleep in blocking mode.
[1]: https://github.com/iovisor/bcc/blob/master/tools/funclatency.py
Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The value of blocking_readers is increased only when the lock is taken
for read, no way we can fail the condition with the write lock.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The replace_wait and bio_counter were mistakenly added to fs_info in
commit c404e0dc2c ("Btrfs: fix use-after-free in the finishing
procedure of the device replace"), but they logically belong to
fs_info::dev_replace. Besides, bio_counter is a very generic name and is
confusing in bare fs_info context.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The exit sequence in btrfs_dev_replace_start does not allow to simply
add a label to the right place so the error handling after starting
transaction failure jumps there. Currently there's a lock that pairs
with the unlock in the section, which is unnecessary and only raises
questions. Add a variable to track the locking status and avoid the
extra locking.
Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Too trivial, the purpose can be simply documented in a comment.
Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The wrapper is too trivial, open coding does not make it less readable.
Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There's a single caller and the function name does not say it's actually
taking the lock, so open coding makes it more explicit.
For now, btrfs_dev_replace_read_lock is used instead of read_lock so
it's paired with the unlocking wrapper in the same block.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This member seems to be copied from the extent_buffer locking scheme and
is at least used to assert that the read lock/unlock is properly nested.
In some way. While the _inc/_dec are called inside the read lock
section, the asserts are both inside and outside, so the ordering is not
guaranteed and we can see read/inc/dec ordered in any way
(theoretically).
A missing call of btrfs_dev_replace_clear_lock_blocking could cause
unexpected read_locks count, so this at least looks like a valid
assertion, but this will become unnecessary with later updates.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Although we have tree level check at tree read runtime, it's completely
based on its parent level.
We still need to do accurate level check to avoid invalid tree blocks
sneak into kernel space.
The check itself is simple, for leaf its level should always be 0.
For nodes its level should be in range [1, BTRFS_MAX_LEVEL - 1].
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Su Yue <suy.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
For qgroup_trace_extent_swap(), if we find one leaf that needs to be
traced, we will also iterate all file extents and trace them.
This is OK if we're relocating data block groups, but if we're
relocating metadata block groups, balance code itself has ensured that
both subtree of file tree and reloc tree contain the same contents.
That's to say, if we're relocating metadata block groups, all file
extents in reloc and file tree should match, thus no need to trace them.
This should reduce the total number of dirty extents processed in metadata
block group balance.
[[Benchmark]] (with all previous enhancement)
Hardware:
VM 4G vRAM, 8 vCPUs,
disk is using 'unsafe' cache mode,
backing device is SAMSUNG 850 evo SSD.
Host has 16G ram.
Mkfs parameter:
--nodesize 4K (To bump up tree size)
Initial subvolume contents:
4G data copied from /usr and /lib.
(With enough regular small files)
Snapshots:
16 snapshots of the original subvolume.
each snapshot has 3 random files modified.
balance parameter:
-m
So the content should be pretty similar to a real world root fs layout.
| v4.19-rc1 | w/ patchset | diff (*)
---------------------------------------------------------------
relocated extents | 22929 | 22851 | -0.3%
qgroup dirty extents | 227757 | 140886 | -38.1%
time (sys) | 65.253s | 37.464s | -42.6%
time (real) | 74.032s | 44.722s | -39.6%
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Reloc tree doesn't contribute to qgroup numbers, as we have accounted
them at balance time (see replace_path()).
Skipping the unneeded subtree tracing should reduce the overhead.
[[Benchmark]]
Hardware:
VM 4G vRAM, 8 vCPUs,
disk is using 'unsafe' cache mode,
backing device is SAMSUNG 850 evo SSD.
Host has 16G ram.
Mkfs parameter:
--nodesize 4K (To bump up tree size)
Initial subvolume contents:
4G data copied from /usr and /lib.
(With enough regular small files)
Snapshots:
16 snapshots of the original subvolume.
each snapshot has 3 random files modified.
balance parameter:
-m
So the content should be pretty similar to a real world root fs layout.
| v4.19-rc1 | w/ patchset | diff (*)
---------------------------------------------------------------
relocated extents | 22929 | 22900 | -0.1%
qgroup dirty extents | 227757 | 167139 | -26.6%
time (sys) | 65.253s | 50.123s | -23.2%
time (real) | 74.032s | 52.551s | -29.0%
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Before this patch, with quota enabled during balance, we need to mark
the whole subtree dirty for quota.
E.g.
OO = Old tree blocks (from file tree)
NN = New tree blocks (from reloc tree)
File tree (src) Reloc tree (dst)
OO (a) NN (a)
/ \ / \
(b) OO OO (c) (b) NN NN (c)
/ \ / \ / \ / \
OO OO OO OO (d) OO OO OO NN (d)
For old balance + quota case, quota will mark the whole src and dst tree
dirty, including all the 3 old tree blocks in reloc tree.
It's doable for small file tree or new tree blocks are all located at
lower level.
But for large file tree or new tree blocks are all located at higher
level, this will lead to mark the whole tree dirty, and be unbelievably
slow.
This patch will change how we handle such balance with quota enabled
case.
Now we will search from (b) and (c) for any new tree blocks whose
generation is equal to @last_snapshot, and only mark them dirty.
In above case, we only need to trace tree blocks NN(b), NN(c) and NN(d).
(NN(a) will be traced when COW happens for nodeptr modification). And
also for tree blocks OO(b), OO(c), OO(d). (OO(a) will be traced when COW
happens for nodeptr modification.)
For above case, we could skip 3 tree blocks, but for larger tree, we can
skip tons of unmodified tree blocks, and hugely speed up balance.
This patch will introduce a new function,
btrfs_qgroup_trace_subtree_swap(), which will do the following main
work:
1) Read out real root eb
And setup basic dst_path for later calls
2) Call qgroup_trace_new_subtree_blocks()
To trace all new tree blocks in reloc tree and their counter
parts in the file tree.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Introduce new function, qgroup_trace_new_subtree_blocks(), to iterate
all new tree blocks in a reloc tree.
So that qgroup could skip unrelated tree blocks during balance, which
should hugely speedup balance speed when quota is enabled.
The function qgroup_trace_new_subtree_blocks() itself only cares about
new tree blocks in reloc tree.
All its main works are:
1) Read out tree blocks according to parent pointers
2) Do recursive depth-first search
Will call the same function on all its children tree blocks, with
search level set to current level -1.
And will also skip all children whose generation is smaller than
@last_snapshot.
3) Call qgroup_trace_extent_swap() to trace tree blocks
So although we have parameter list related to source file tree, it's not
used at all, but only passed to qgroup_trace_extent_swap().
Thus despite the tree read code, the core should be pretty short and all
about recursive depth-first search.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Introduce a new function, qgroup_trace_extent_swap(), which will be used
later for balance qgroup speedup.
The basis idea of balance is swapping tree blocks between reloc tree and
the real file tree.
The swap will happen in highest tree block, but there may be a lot of
tree blocks involved.
For example:
OO = Old tree blocks
NN = New tree blocks allocated during balance
File tree (257) Reloc tree for 257
L2 OO NN
/ \ / \
L1 OO OO (a) OO NN (a)
/ \ / \ / \ / \
L0 OO OO OO OO OO OO NN NN
(b) (c) (b) (c)
When calling qgroup_trace_extent_swap(), we will pass:
@src_eb = OO(a)
@dst_path = [ nodes[1] = NN(a), nodes[0] = NN(c) ]
@dst_level = 0
@root_level = 1
In that case, qgroup_trace_extent_swap() will search from OO(a) to
reach OO(c), then mark both OO(c) and NN(c) as qgroup dirty.
The main work of qgroup_trace_extent_swap() can be split into 3 parts:
1) Tree search from @src_eb
It should acts as a simplified btrfs_search_slot().
The key for search can be extracted from @dst_path->nodes[dst_level]
(first key).
2) Mark the final tree blocks in @src_path and @dst_path qgroup dirty
NOTE: In above case, OO(a) and NN(a) won't be marked qgroup dirty.
They should be marked during preivous (@dst_level = 1) iteration.
3) Mark file extents in leaves dirty
We don't have good way to pick out new file extents only.
So we still follow the old method by scanning all file extents in
the leave.
This function can free us from keeping two pathes, thus later we only need
to care about how to iterate all new tree blocks in reloc tree.
Signed-off-by: Qu Wenruo <wqu@suse.com>
[ copy changelog to function comment ]
Signed-off-by: David Sterba <dsterba@suse.com>
Number of qgroup dirty extents is directly linked to the performance
overhead, so add a new trace event, trace_qgroup_num_dirty_extents(), to
record how many dirty extents is processed in
btrfs_qgroup_account_extents().
This will be pretty handy to analyze later balance performance
improvement.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
fs/btrfs/relocation.c:build_backref_tree() is some code from 2009 era,
although it works pretty fine, it's not that easy to understand.
Especially combined with the complex btrfs backref format.
This patch adds some basic comment for the backref build part of the
code, making it less hard to read, at least for backref searching part.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The only aops we define for symlinks are identical to the aops for
regular files. This has been the case since symlink support was added in
commit 2b8d99a723 ("Btrfs: symlinks and hard links"). As far as I can
tell, there wasn't a good reason to have separate aops then, and there
isn't now, so let's just do what most other filesystems do and reuse the
same structure.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
During buffered writes, we follow this basic series of steps:
again:
lock all the pages
wait for writeback on all the pages
Take the extent range lock
wait for ordered extents on the whole range
clean all the pages
if (copy_from_user_in_atomic() hits a fault) {
drop our locks
goto again;
}
dirty all the pages
release all the locks
The extra waiting, cleaning and locking are there to make sure we don't
modify pages in flight to the drive, after they've been crc'd.
If some of the pages in the range were already dirty when the write
began, and we need to goto again, we create a window where a dirty page
has been cleaned and unlocked. It may be reclaimed before we're able to
lock it again, which means we'll read the old contents off the drive and
lose any modifications that had been pending writeback.
We don't actually need to clean the pages. All of the other locking in
place makes sure we don't start IO on the pages, so we can just leave
them dirty for the duration of the write.
Fixes: 73d59314e6 (the original btrfs merge)
CC: stable@vger.kernel.org # v4.4+
Signed-off-by: Chris Mason <clm@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The helper does the same math and we take care about the special case
when flags is 0 too.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Refactor the delayed refs loop by using the newly introduced
btrfs_run_delayed_refs_for_head function. This greatly simplifies
__btrfs_run_delayed_refs and makes it more obvious what is happening.
We now have 1 loop which iterates the existing delayed_heads and then
each selected ref head is processed by the new helper. All existing
semantics of the code are preserved so no functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This patch introduces a new helper encompassing the implicit inner loop
in __btrfs_run_delayed_refs which processes all the refs for a given
head. The code is mostly copy/paste, the only difference is that if we
detect a newer reference then -EAGAIN is returned so that callers can
react correctly.
Also, at the end of the loop the head is relocked and
btrfs_merge_delayed_refs is run again to retain the pre-refactoring
semantics.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is in preparation to refactor the giant loop in
__btrfs_run_delayed_refs. As a first step define a new function
which implements acquiring a reference to a btrfs_delayed_refs_head and
use it. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Avoid the inline ifdefs and use two sections for self-tests enabled and
disabled.
Though there could be no ifdef and unconditional test_bit of
BTRFS_FS_STATE_DUMMY_FS_INFO, the static inline can help to optimize out
any code that would depend on conditions using btrfs_is_testing.
As this is only for the testing code, drop unlikely().
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The data used only for tests are better placed at the end of the
structure so that they don't change the structure layout. All new
members of btrfs_root should be placed before.
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The helper find_lock_delalloc_range is now conditionally built static,
dpending on whether the self-tests are enabled or not. There's a macro
that is supposed to hide the export, used only once. To discourage
further use, drop it an add a public wrapper for the helper needed by
tests.
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
rb_first_cached() trades an extra pointer "leftmost" for doing the same
job as rb_first() but in O(1).
While resolving indirect refs and missing refs, it always looks for the
first rb entry in a while loop, it's helpful to use rb_first_cached
instead.
For more details about the optimization see patch "Btrfs: delayed-refs:
use rb_first_cached for href_root".
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
rb_first_cached() trades an extra pointer "leftmost" for doing the
same job as rb_first() but in O(1).
As evict_inode_truncate_pages() removes all extent mapping by always
looking for the first rb entry, it's helpful to use rb_first_cached
instead.
For more details about the optimization see patch "Btrfs: delayed-refs:
use rb_first_cached for href_root".
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
rb_first_cached() trades an extra pointer "leftmost" for doing the same job as
rb_first() but in O(1).
Functions manipulating delayed_item need to get the first entry, this converts
it to use rb_first_cached().
For more details about the optimization see patch "Btrfs: delayed-refs:
use rb_first_cached for href_root".
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
rb_first_cached() trades an extra pointer "leftmost" for doing the same
job as rb_first() but in O(1).
Functions manipulating href->ref_tree need to get the first entry, this
converts href->ref_tree to use rb_first_cached().
For more details about the optimization see patch "Btrfs: delayed-refs:
use rb_first_cached for href_root".
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
rb_first_cached() trades an extra pointer "leftmost" for doing the same
job as rb_first() but in O(1).
Functions manipulating href_root need to get the first entry, this
converts href_root to use rb_first_cached().
This patch is first in the sequenct of similar updates to other rbtrees
and this is analysis of the expected behaviour and improvements.
There's a common pattern:
while (node = rb_first) {
entry = rb_entry(node)
next = rb_next(node)
rb_erase(node)
cleanup(entry)
}
rb_first needs to traverse the tree up to logN depth, rb_erase can
completely reshuffle the tree. With the caching we'll skip the traversal
in rb_first. That's a cached memory access vs looped pointer
dereference trade-off that IMHO has a clear winner.
Measurements show there's not much difference in a sample tree with
10000 nodes: 4.5s / rb_first and 4.8s / rb_first_cached. Real effects of
caching and pointer chasing are unpredictable though.
Further optimzations can be done to avoid the expensive rb_erase step.
In some cases it's ok to process the nodes in any order, so the tree can
be traversed in post-order, not rebalancing the children nodes and just
calling free. Care must be taken regarding the next node.
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog from mail discussions ]
Signed-off-by: David Sterba <dsterba@suse.com>
While testing my backport I noticed there was a panic if I ran
generic/416 generic/417 generic/418 all in a row. This just happened to
uncover a race where we had outstanding IO after we destroy all of our
workqueues, and then we'd go to queue the endio work on those free'd
workqueues.
This is because we aren't waiting for the caching threads to be done
before freeing everything up, so to fix this make sure we wait on any
outstanding caching that's being done before we free up the block group,
so we're sure to be done with all IO by the time we get to
btrfs_stop_all_workers(). This fixes the panic I was seeing
consistently in testing.
------------[ cut here ]------------
kernel BUG at fs/btrfs/volumes.c:6112!
SMP PTI
Modules linked in:
CPU: 1 PID: 27165 Comm: kworker/u4:7 Not tainted 4.16.0-02155-g3553e54a578d-dirty #875
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.el7 04/01/2014
Workqueue: btrfs-cache btrfs_cache_helper
RIP: 0010:btrfs_map_bio+0x346/0x370
RSP: 0000:ffffc900061e79d0 EFLAGS: 00010202
RAX: 0000000000000000 RBX: ffff880071542e00 RCX: 0000000000533000
RDX: ffff88006bb74380 RSI: 0000000000000008 RDI: ffff880078160000
RBP: 0000000000000001 R08: ffff8800781cd200 R09: 0000000000503000
R10: ffff88006cd21200 R11: 0000000000000000 R12: 0000000000000000
R13: 0000000000000000 R14: ffff8800781cd200 R15: ffff880071542e00
FS: 0000000000000000(0000) GS:ffff88007fd00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000000000817ffc4 CR3: 0000000078314000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
btree_submit_bio_hook+0x8a/0xd0
submit_one_bio+0x5d/0x80
read_extent_buffer_pages+0x18a/0x320
btree_read_extent_buffer_pages+0xbc/0x200
? alloc_extent_buffer+0x359/0x3e0
read_tree_block+0x3d/0x60
read_block_for_search.isra.30+0x1a5/0x360
btrfs_search_slot+0x41b/0xa10
btrfs_next_old_leaf+0x212/0x470
caching_thread+0x323/0x490
normal_work_helper+0xc5/0x310
process_one_work+0x141/0x340
worker_thread+0x44/0x3c0
kthread+0xf8/0x130
? process_one_work+0x340/0x340
? kthread_bind+0x10/0x10
ret_from_fork+0x35/0x40
RIP: btrfs_map_bio+0x346/0x370 RSP: ffffc900061e79d0
---[ end trace 827eb13e50846033 ]---
Kernel panic - not syncing: Fatal exception
Kernel Offset: disabled
---[ end Kernel panic - not syncing: Fatal exception
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Commit 499f377f49 (btrfs: iterate over unused chunk space in FITRIM)
fixed free space trimming, but introduced latency when it was running.
This is due to it pinning the transaction using both a incremented
refcount and holding the commit root sem for the duration of a single
trim operation.
This was to ensure safety but it's unnecessary. We already hold the the
chunk mutex so we know that the chunk we're using can't be allocated
while we're trimming it.
In order to check against chunks allocated already in this transaction,
we need to check the pending chunks list. To to that safely without
joining the transaction (or attaching than then having to commit it) we
need to ensure that the dev root's commit root doesn't change underneath
us and the pending chunk lists stays around until we're done with it.
We can ensure the former by holding the commit root sem and the latter
by pinning the transaction. We do this now, but the critical section
covers the trim operation itself and we don't need to do that.
This patch moves the pinning and unpinning logic into helpers and unpins
the transaction after performing the search and check for pending
chunks.
Limiting the critical section of the transaction pinning improves the
latency substantially on slower storage (e.g. image files over NFS).
Fixes: 499f377f49 ("btrfs: iterate over unused chunk space in FITRIM")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We check whether any device the file system is using supports discard in
the ioctl call, but then we attempt to trim free extents on every device
regardless of whether discard is supported. Due to the way we mask off
EOPNOTSUPP, we can end up issuing the trim operations on each free range
on devices that don't support it, just wasting time.
Fixes: 499f377f49 ("btrfs: iterate over unused chunk space in FITRIM")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_trim_fs iterates over the fs_devices->alloc_list while holding the
device_list_mutex. The problem is that ->alloc_list is protected by the
chunk mutex. We don't want to hold the chunk mutex over the trim of the
entire file system. Fortunately, the ->dev_list list is protected by
the dev_list mutex and while it will give us all devices, including
read-only devices, we already just skip the read-only devices. Then we
can continue to take and release the chunk mutex while scanning each
device.
Fixes: 499f377f49 ("btrfs: iterate over unused chunk space in FITRIM")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
fstrim on some btrfs only trims the unallocated space, not trimming any
space in existing block groups.
[CAUSE]
Before fstrim_range passed to btrfs_trim_fs(), it gets truncated to
range [0, super->total_bytes). So later btrfs_trim_fs() will only be
able to trim block groups in range [0, super->total_bytes).
While for btrfs, any bytenr aligned to sectorsize is valid, since btrfs
uses its logical address space, there is nothing limiting the location
where we put block groups.
For filesystem with frequent balance, it's quite easy to relocate all
block groups and bytenr of block groups will start beyond
super->total_bytes.
In that case, btrfs will not trim existing block groups.
[FIX]
Just remove the truncation in btrfs_ioctl_fitrim(), so btrfs_trim_fs()
can get the unmodified range, which is normally set to [0, U64_MAX].
Reported-by: Chris Murphy <lists@colorremedies.com>
Fixes: f4c697e640 ("btrfs: return EINVAL if start > total_bytes in fitrim ioctl")
CC: <stable@vger.kernel.org> # v4.4+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Function btrfs_trim_fs() doesn't handle errors in a consistent way. If
error happens when trimming existing block groups, it will skip the
remaining blocks and continue to trim unallocated space for each device.
The return value will only reflect the final error from device trimming.
This patch will fix such behavior by:
1) Recording the last error from block group or device trimming
The return value will also reflect the last error during trimming.
Make developer more aware of the problem.
2) Continuing trimming if possible
If we failed to trim one block group or device, we could still try
the next block group or device.
3) Report number of failures during block group and device trimming
It would be less noisy, but still gives user a brief summary of
what's going wrong.
Such behavior can avoid confusion for cases like failure to trim the
first block group and then only unallocated space is trimmed.
Reported-by: Chris Murphy <lists@colorremedies.com>
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add bg_ret and dev_ret to the messages ]
Signed-off-by: David Sterba <dsterba@suse.com>
When we fail to start a transaction in btrfs_dev_replace_start, we leave
dev_replace->replace_start set to STARTED but clear ->srcdev and
->tgtdev. Later, that can result in an Oops in
btrfs_dev_replace_progress when having state set to STARTED or SUSPENDED
implies that ->srcdev is valid.
Also fix error handling when the state is already STARTED or SUSPENDED
while starting. That, too, will clear ->srcdev and ->tgtdev even though
it doesn't own them. This should be an impossible case to hit since we
should be protected by the BTRFS_FS_EXCL_OP bit being set. Let's add an
ASSERT there while we're at it.
Fixes: e93c89c1aa (Btrfs: add new sources for device replace code)
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
remove_extent_mapping uses the variable "ret" for return value, but it
is not modified after initialzation. Further, I find that any of the
callers do not handle the return value and the callees are only simple
functions so the return values does not need to be passed.
Signed-off-by: zhong jiang <zhongjiang@huawei.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In btrfs_search_old_slot get_old_root is always used with the assumption
it cannot fail. However, this is not true in rare circumstance it can
fail and return null. This will lead to null point dereference when the
header is read. Fix this by checking the return value and properly
handling NULL by setting ret to -EIO and returning gracefully.
Coverity-id: 1087503
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In btrfs_orphan_cleanup the final 'if (ret) goto out' cannot ever be
executed. This is due to the last assignment to 'ret' depending on the
return value of btrfs_iget. If an error other than -ENOENT is returned
then the loop is prematurely terminated by 'goto out'. On the other
hand, if the error value is ENOENT then a subsequent if branch is
executed that always re-assigns 'ret' and in case it's an error just
terminates the loop. No functional changes.
Coverity-id: 1437392
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When we delete an inode,
btrfs_evict_inode() {
truncate_inode_pages_final()
truncate_inode_pages_range()
lock_page()
truncate_cleanup_page()
btrfs_invalidatepage()
wait_on_page_writeback
btrfs_lookup_ordered_range()
cancel_dirty_page()
unlock_page()
...
btrfs_wait_ordered_range()
...
As VFS has called ->invalidatepage() to get all ordered extents done (if
there are any) and truncated all page cache pages (no dirty pages to
writeback after this step), wait_ordered_range() is just a noop.
Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
As long as @eb is marked with EXTENT_BUFFER_DIRTY, all of its pages
are dirty, so no need to set pages dirty again.
Ftrace showed that the loop took 10us on my dev box, so removing this
can save us at least 10us if eb is already dirty and otherwise avoid a
potentially expensive calls to set_page_dirty.
Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Just in case that someone breaks the rule that pages are dirty as long
as eb is dirty. The next patch will dirty the pages conditionally.
Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In the callchain:
btrfs_search_slot()
if (level != 0)
setup_nodes_for_search()
balance_level()
It is just impossible to have level=0 in balance_level, we can drop the
check.
Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Unify the error handling of directory item lookups using IS_ERR_OR_NULL.
No functional changes.
Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
kcalloc is defined as:
kcalloc(size_t n, size_t size, gfp_t flags)
Although this won't cause problems in practice, btrfsic_read_block()
uses kcalloc with n and size in the opposite order.
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Instead of returning an error value and using one of the parameters for
returning the actual object we are interested in just refactor the
function to directly return btrfs_device *. Also bubble up the error
handling for the special BTRFS_ERROR_DEV_MISSING_NOT_FOUND value into
btrfs_rm_device. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function returns a numeric error value and additionally the
device found in one of its input parameters. Simplify this by making
the function directly return a pointer to btrfs_device. Additionally
adjust the caller to handle the case when we want to remove the
'missing' device and ENOENT is returned to return the expected
positive error value, parsed by progs. Finally, unexport the function
since it's not called outside of volume.c. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently this function returns an error code as well as uses one of
its arguments as a return value for struct btrfs_device. Change the
function so that it returns btrfs_device directly and use the usual
"encode error in pointer" mechanics if something goes wrong. No
functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since commit d7df2c796d ("Btrfs attach delayed ref updates to delayed
ref heads"), check_delayed_ref() won't return -ENOENT.
In btrfs_cross_ref_exist(), two variables 'ret' and 'ret2' are
originally used to handle -ENOENT error case. Since the code is not
needed anymore, let's just remove 'ret2'.
Signed-off-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Unless it's going to read inline extents from btree leaf to page,
btrfs_get_extent won't sleep during the period of holding path lock.
This sets leave_spinning at first and sets path to blocking mode right
before reading inline extent if that's the case. The benefit is that a
path in spinning mode typically has lower impact (faster) on waiters
rather than that in the blocking mode.
Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This fixes btrfs_get_extent to be consistent with our existing
declaration style.
Note: For the record, indentation styles that are accepted are both,
aligning under the opening ( and tab or double tab indentation on the
next line. Preferrably not spliting the type or long expressions in the
argument lists.
Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
[ add note ]
Signed-off-by: David Sterba <dsterba@suse.com>
Pointer 'tree' is being assigned but is never used hence it is redundant
and can be removed. This is a leftover from cleanup patch
00032d38ea ("btrfs: drop extent_io_ops::merge_bio_hook
callback").
Cleans up clang warning:
warning: variable 'tree' set but not used [-Wunused-but-set-variable]
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Pointer inode is being assigned but is never used hence it is redundant
and can be removed. It's been unused since the introduction in
38c227d87c ("Btrfs: snapshot-aware defrag").
Cleans up clang warning:
variable ‘inode’ set but not used [-Wunused-but-set-variable]
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since commit 8b62f87bad ("Btrfs: rework outstanding_extents"),
manual operations of outstanding_extent in btrfs_inode are replaced by
btrfs_mod_outstanding_extents().
The one in cluster_pages_for_defrag seems to be lost, so replace it
of btrfs_mod_outstanding_extents().
Fixes: 8b62f87bad ("Btrfs: rework outstanding_extents")
Signed-off-by: Su Yue <suy.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Here we're not releasing any space, but transferring bytes from
->bytes_may_use to ->bytes_reserved. The last change to the code in
commit 18513091af ("btrfs: update btrfs_space_info's
bytes_may_use timely") removed a conditional tracepoint and the logic
changed too but the tracepiont remained.
Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
trace_btrfs_get_extent() has nothing to do with path, place
btrfs_free_path ahead so that we can unlock path on error.
Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
As next_state() is already defined to get the next state, use it in
find_first_extent_bit. No functional changes.
Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
For certain crafted image, whose csum root leaf has missing backref, if
we try to trigger write with data csum, it could cause deadlock with the
following kernel WARN_ON():
WARNING: CPU: 1 PID: 41 at fs/btrfs/locking.c:230 btrfs_tree_lock+0x3e2/0x400
CPU: 1 PID: 41 Comm: kworker/u4:1 Not tainted 4.18.0-rc1+ #8
Workqueue: btrfs-endio-write btrfs_endio_write_helper
RIP: 0010:btrfs_tree_lock+0x3e2/0x400
Call Trace:
btrfs_alloc_tree_block+0x39f/0x770
__btrfs_cow_block+0x285/0x9e0
btrfs_cow_block+0x191/0x2e0
btrfs_search_slot+0x492/0x1160
btrfs_lookup_csum+0xec/0x280
btrfs_csum_file_blocks+0x2be/0xa60
add_pending_csums+0xaf/0xf0
btrfs_finish_ordered_io+0x74b/0xc90
finish_ordered_fn+0x15/0x20
normal_work_helper+0xf6/0x500
btrfs_endio_write_helper+0x12/0x20
process_one_work+0x302/0x770
worker_thread+0x81/0x6d0
kthread+0x180/0x1d0
ret_from_fork+0x35/0x40
[CAUSE]
That crafted image has missing backref for csum tree root leaf. And
when we try to allocate new tree block, since there is no
EXTENT/METADATA_ITEM for csum tree root, btrfs consider it's free slot
and use it.
The extent tree of the image looks like:
Normal image | This fuzzed image
----------------------------------+--------------------------------
BG 29360128 | BG 29360128
One empty slot | One empty slot
29364224: backref to UUID tree | 29364224: backref to UUID tree
Two empty slots | Two empty slots
29376512: backref to CSUM tree | One empty slot (bad type) <<<
29380608: backref to D_RELOC tree | 29380608: backref to D_RELOC tree
... | ...
Since bytenr 29376512 has no METADATA/EXTENT_ITEM, when btrfs try to
alloc tree block, it's an valid slot for btrfs.
And for finish_ordered_write, when we need to insert csum, we try to CoW
csum tree root.
By accident, empty slots at bytenr BG_OFFSET, BG_OFFSET + 8K,
BG_OFFSET + 12K is already used by tree block COW for other trees, the
next empty slot is BG_OFFSET + 16K, which should be the backref for CSUM
tree.
But due to the bad type, btrfs can recognize it and still consider it as
an empty slot, and will try to use it for csum tree CoW.
Then in the following call trace, we will try to lock the new tree
block, which turns out to be the old csum tree root which is already
locked:
btrfs_search_slot() called on csum tree root, which is at 29376512
|- btrfs_cow_block()
|- btrfs_set_lock_block()
| |- Now locks tree block 29376512 (old csum tree root)
|- __btrfs_cow_block()
|- btrfs_alloc_tree_block()
|- btrfs_reserve_extent()
| Now it returns tree block 29376512, which extent tree
| shows its empty slot, but it's already hold by csum tree
|- btrfs_init_new_buffer()
|- btrfs_tree_lock()
| Triggers WARN_ON(eb->lock_owner == current->pid)
|- wait_event()
Wait lock owner to release the lock, but it's
locked by ourself, so it will deadlock
[FIX]
This patch will do the lock_owner and current->pid check at
btrfs_init_new_buffer().
So above deadlock can be avoided.
Since such problem can only happen in crafted image, we will still
trigger kernel warning for later aborted transaction, but with a little
more meaningful warning message.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=200405
Reported-by: Xu Wen <wen.xu@gatech.edu>
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
When mounting certain crafted image, btrfs will trigger kernel BUG_ON()
when trying to recover balance:
kernel BUG at fs/btrfs/extent-tree.c:8956!
invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
CPU: 1 PID: 662 Comm: mount Not tainted 4.18.0-rc1-custom+ #10
RIP: 0010:walk_up_proc+0x336/0x480 [btrfs]
RSP: 0018:ffffb53540c9b890 EFLAGS: 00010202
Call Trace:
walk_up_tree+0x172/0x1f0 [btrfs]
btrfs_drop_snapshot+0x3a4/0x830 [btrfs]
merge_reloc_roots+0xe1/0x1d0 [btrfs]
btrfs_recover_relocation+0x3ea/0x420 [btrfs]
open_ctree+0x1af3/0x1dd0 [btrfs]
btrfs_mount_root+0x66b/0x740 [btrfs]
mount_fs+0x3b/0x16a
vfs_kern_mount.part.9+0x54/0x140
btrfs_mount+0x16d/0x890 [btrfs]
mount_fs+0x3b/0x16a
vfs_kern_mount.part.9+0x54/0x140
do_mount+0x1fd/0xda0
ksys_mount+0xba/0xd0
__x64_sys_mount+0x21/0x30
do_syscall_64+0x60/0x210
entry_SYSCALL_64_after_hwframe+0x49/0xbe
[CAUSE]
Extent tree corruption. In this particular case, reloc tree root's
owner is DATA_RELOC_TREE (should be TREE_RELOC), thus its backref is
corrupted and we failed the owner check in walk_up_tree().
[FIX]
It's pretty hard to take care of every extent tree corruption, but at
least we can remove such BUG_ON() and exit more gracefully.
And since in this particular image, DATA_RELOC_TREE and TREE_RELOC share
the same root (which is obviously invalid), we needs to make
__del_reloc_root() more robust to detect such invalid sharing to avoid
possible NULL dereference as root->node can be NULL in this case.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=200411
Reported-by: Xu Wen <wen.xu@gatech.edu>
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_pin_log_trans defines the variable "ret" for return value, but it
is not modified after initialization. Further, I find that none of the
callers do handles the return value, so it is safe to drop the unneeded
"ret" and make it return void.
Signed-off-by: zhong jiang <zhongjiang@huawei.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_free_reserved_bytes uses the variable "ret" for return value,
but it is not modified after initialzation. Further, I find that any of
the callers do not handle the return value, so it is safe to drop the
unneeded "ret" and return void. There are no callees that would need the
function to handle or pass the value either.
Signed-off-by: zhong jiang <zhongjiang@huawei.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
@path is always NULL when it comes to the if branch.
Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
In the following case, rescan won't zero out the number of qgroup 1/0:
$ mkfs.btrfs -fq $DEV
$ mount $DEV /mnt
$ btrfs quota enable /mnt
$ btrfs qgroup create 1/0 /mnt
$ btrfs sub create /mnt/sub
$ btrfs qgroup assign 0/257 1/0 /mnt
$ dd if=/dev/urandom of=/mnt/sub/file bs=1k count=1000
$ btrfs sub snap /mnt/sub /mnt/snap
$ btrfs quota rescan -w /mnt
$ btrfs qgroup show -pcre /mnt
qgroupid rfer excl max_rfer max_excl parent child
-------- ---- ---- -------- -------- ------ -----
0/5 16.00KiB 16.00KiB none none --- ---
0/257 1016.00KiB 16.00KiB none none 1/0 ---
0/258 1016.00KiB 16.00KiB none none --- ---
1/0 1016.00KiB 16.00KiB none none --- 0/257
So far so good, but:
$ btrfs qgroup remove 0/257 1/0 /mnt
WARNING: quotas may be inconsistent, rescan needed
$ btrfs quota rescan -w /mnt
$ btrfs qgroup show -pcre /mnt
qgoupid rfer excl max_rfer max_excl parent child
-------- ---- ---- -------- -------- ------ -----
0/5 16.00KiB 16.00KiB none none --- ---
0/257 1016.00KiB 16.00KiB none none --- ---
0/258 1016.00KiB 16.00KiB none none --- ---
1/0 1016.00KiB 16.00KiB none none --- ---
^^^^^^^^^^ ^^^^^^^^ not cleared
[CAUSE]
Before rescan we call qgroup_rescan_zero_tracking() to zero out all
qgroups' accounting numbers.
However we don't mark all qgroups dirty, but rely on rescan to do so.
If we have any high level qgroup without children, it won't be marked
dirty during rescan, since we cannot reach that qgroup.
This will cause QGROUP_INFO items of childless qgroups never get updated
in the quota tree, thus their numbers will stay the same in "btrfs
qgroup show" output.
[FIX]
Just mark all qgroups dirty in qgroup_rescan_zero_tracking(), so even if
we have childless qgroups, their QGROUP_INFO items will still get
updated during rescan.
Reported-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com>
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com>
Tested-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
struct scrub_ctx has an ->is_dev_replace member, so there's no point in
passing around is_dev_replace where sctx is available.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When the replace is running the fs_devices::num_devices also includes
the replaced device, however in some operations like device delete and
balance it needs the actual num_devices without the repalced devices.
The function btrfs_num_devices() just provides that.
And here is a scenario how balance and repalce items could co-exist:
Consider balance is started and paused, now start the replace followed
by a unmount or system power-cycle. During following mount, the
open_ctree() first restarts the balance so it must check for the device
replace otherwise our num_devices calculation will be wrong.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
In preparation to add helper function to deduce the num_devices with
replace running, use assert instead of BUG_ON or WARN_ON. The number of
devices would not normally drop to 0 due to other checks so the assert
is sufficient.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog, adjust the assert condition ]
Signed-off-by: David Sterba <dsterba@suse.com>
Kfree has taken the NULL pointer into account. So remove the check
before kfree.
The issue is detected with the help of Coccinelle.
Signed-off-by: zhong jiang <zhongjiang@huawei.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
As we're going to return right after the call, it's not necessary to get
update the new write_lock_level from unlock_up.
Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There are two members in struct btrfs_root which indicate root's
objectid: objectid and root_key.objectid.
They are both set to the same value in __setup_root():
static void __setup_root(struct btrfs_root *root,
struct btrfs_fs_info *fs_info,
u64 objectid)
{
...
root->objectid = objectid;
...
root->root_key.objectid = objecitd;
...
}
and not changed to other value after initialization.
grep in btrfs directory shows both are used in many places:
$ grep -rI "root->root_key.objectid" | wc -l
133
$ grep -rI "root->objectid" | wc -l
55
(4.17, inc. some noise)
It is confusing to have two similar variable names and it seems
that there is no rule about which should be used in a certain case.
Since ->root_key itself is needed for tree reloc tree, let's remove
'objecitd' member and unify code to use ->root_key.objectid in all places.
Signed-off-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since ret must be 0 here, don't have to return. No functional change
and code readability is not hurt.
Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
All callers pass the root tree of dir, we can push that down to the
function itself.
Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Using true and false here is closer to the expected semantic than using
0 and 1. No functional change.
Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Only when send_in_progress, we have to do something different such as
btrfs_warn() and return -EPERM. Therefore, we could check
send_in_progress first and process error handling, after the
root_item_lock has been got.
Just for better readability. No functional change.
Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
FUSE file reads are cached in the page cache, but symlink reads are
not. This patch enables FUSE READLINK operations to be cached which
can improve performance of some FUSE workloads.
In particular, I'm working on a FUSE filesystem for access to source
code and discovered that about a 10% improvement to build times is
achieved with this patch (there are a lot of symlinks in the source
tree).
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
After sending a synchronous READ request from __fuse_direct_read() we only
need to invalidate atime; none of the other attributes should be changed by
a read().
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
If 'auto_inval_data' mode is active, then fuse_file_read_iter() will call
fuse_update_attributes(), which will check the attribute validity and send
a GETATTR request if some of the attributes are no longer valid. The page
cache is then invalidated if the size or mtime have changed.
Then, if a READ request was sent and reply received (which is the case if
the data wasn't cached yet, or if the file is opened for O_DIRECT), the
atime attribute is invalidated.
This will result in the next read() also triggering a GETATTR, ...
This can be fixed by only sending GETATTR if the mode or size are invalid,
we don't need to do a refresh if only atime is invalid.
More generally, none of the callers of fuse_update_attributes() need an
up-to-date atime value, so for now just remove STATX_ATIME from the request
mask when attributes are updated for internal use.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
This patch adds the infrastructure for more fine grained attribute
invalidation. Currently only 'atime' is invalidated separately.
The use of this infrastructure is extended to the statx(2) interface, which
for now means that if only 'atime' is invalid and STATX_ATIME is not
specified in the mask argument, then no GETATTR request will be generated.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
The recent patch to fix the afs_server struct leak didn't actually fix the
bug, but rather fixed some of the symptoms. The problem is that an
asynchronous call that holds a resource pointed to by call->reply[0] will
find the pointer cleared in the call destructor, thereby preventing the
resource from being cleaned up.
In the case of the server record leak, the afs_fs_get_capabilities()
function in devel code sets up a call with reply[0] pointing at the server
record that should be altered when the result is obtained, but this was
being cleared before the destructor was called, so the put in the
destructor does nothing and the record is leaked.
Commit f014ffb025 removed the additional ref obtained by
afs_install_server(), but the removal of this ref is actually used by the
garbage collector to mark a server record as being defunct after the record
has expired through lack of use.
The offending clearance of call->reply[0] upon completion in
afs_process_async_call() has been there from the origin of the code, but
none of the asynchronous calls actually use that pointer currently, so it
should be safe to remove (note that synchronous calls don't involve this
function).
Fix this by the following means:
(1) Revert commit f014ffb025.
(2) Remove the clearance of reply[0] from afs_process_async_call().
Without this, afs_manage_servers() will suffer an assertion failure if it
sees a server record that didn't get used because the usage count is not 1.
Fixes: f014ffb025 ("afs: Fix afs_server struct leak")
Fixes: 08e0e7c82e ("[AF_RXRPC]: Make the in-kernel AFS filesystem use AF_RXRPC.")
Signed-off-by: David Howells <dhowells@redhat.com>
Cc: stable <stable@vger.kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* Fix a livelock in dax_layout_busy_page() present since v4.18. The
lockup triggers when truncating an actively mapped huge page out of a
mapping pinned for direct-I/O.
* Fix mprotect() clobbers of _PAGE_DEVMAP. Broken since v4.5 mprotect()
clears this flag that is needed to communicate the liveness of device
pages to the get_user_pages() path.
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJbwiZhAAoJEB7SkWpmfYgCYFoQAL8ED6c1bfGUPRsWSrTRChU0
ungVZ/Vf1+2ERd3ivUXPQzahNtqH5EWvEVp0aboVpyJUoVllrztInVS2hxaGJE+e
w7WnzaXh36MY0kvLpK+Ny1Cxk7qg2rXnmzOAPRVdSUoSvh0TXOn5HFX1i/OdI7WK
wgJwXraCoyKP9aTItw7oHQy9S36bi1RJVUakOAoEpEx4Vn+fwFxLNIt34G5CRJ+k
iflicM7CPngxlFzwfoiX9v3DhV7toexk1A4LAzzwypG0Aiqd5tW2FG1lwLMPncNk
8FezBm9VjkMwzv6hj7nD9UfU2lbh3GqqGDW0cPX1DPSgDxr/4pOLtKcbYWHh6yas
NtCXk37q90ey3GtD2wYBRkBNly6UWvHJ0d3srtO6ZSl1VN6JQu8rhVhQ6KnON24B
NcWlEVf2brqf0uaW4byCVbdVfIDp96/qgEvCo1pq3olXwCdDyOBJjYxaBcnu5JDV
YsItMCJ49AxS/qoCt3vam7vC5TGhfYHL5xJPaF06cdjYvgfqOIV3VQT1ujBx4cvh
MBFRBKDc6oDiJFgkrdYqHwJfn5fCQVS180Oy5S0AFGsVAzsJalKBZBLx2f2RQn8c
r+kczvvPjpczEeEqzaqsxTgjowo/75Q8PRXc2PbwQzNxfkHuKf+xxQpnUg0mN6Hf
w8zPSaCcCs2Wo21Kd/ua
=VXnU
-----END PGP SIGNATURE-----
Merge tag 'libnvdimm-fixes-4.19-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
Dan writes:
"libnvdimm/dax 4.19-rc8
* Fix a livelock in dax_layout_busy_page() present since v4.18. The
lockup triggers when truncating an actively mapped huge page out of
a mapping pinned for direct-I/O.
* Fix mprotect() clobbers of _PAGE_DEVMAP. Broken since v4.5
mprotect() clears this flag that is needed to communicate the
liveness of device pages to the get_user_pages() path."
* tag 'libnvdimm-fixes-4.19-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
mm: Preserve _PAGE_DEVMAP across mprotect() calls
filesystem-dax: Fix dax_layout_busy_page() livelock
ubifs_assert() is not WARN_ON(), so we have to invert
the checks.
Randy faced this warning with UBIFS being a module, since
most users use UBIFS as builtin because UBIFS is the rootfs
nobody noticed so far. :-(
Including me.
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Fixes: 54169ddd38 ("ubifs: Turn two ubifs_assert() into a WARN_ON()")
Signed-off-by: Richard Weinberger <richard@nod.at>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
On non-preempt kernels this loop can take a long time (more than 50 ticks)
processing through entries.
Link: http://lkml.kernel.org/r/20181010172623.57033-1-khazhy@google.com
Signed-off-by: Khazhismel Kumykov <khazhy@google.com>
Acked-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Fix the following compile warning:
fs/ocfs2/dlmglue.c:99:30: warning: lockdep_keys defined but not used [-Wunused-variable]
static struct lock_class_key lockdep_keys[OCFS2_NUM_LOCK_TYPES];
Link: http://lkml.kernel.org/r/1536938148-32110-1-git-send-email-zhongjiang@huawei.com
Signed-off-by: zhong jiang <zhongjiang@huawei.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
TIOCLINUX is handled by ->compat_ioctl() in the only place that has
native ->ioctl() recognizing it, TIOC{START,STOP} are simply useless
these days - unrecognized compat ioctl won't spew into syslog
anymore.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Conflicts were easy to resolve using immediate context mostly,
except the cls_u32.c one where I simply too the entire HEAD
chunk.
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix a leak of afs_server structs. The routine that installs them in the
various lookup lists and trees gets a ref on leaving the function, whether
it added the server or a server already exists. It shouldn't increment
the refcount if it added the server.
The effect of this that "rmmod kafs" will hang waiting for the leaked
server to become unused.
Fixes: d2ddc776a4 ("afs: Overhaul volume and server record caching and fileserver rotation")
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
It turns out that the fix in commit 6636c3cc56 is bad; the assertion
that the iomap code no longer creates buffer heads is incorrect for
filesystems that set the IOMAP_F_BUFFER_HEAD flag.
Instead, what's happening is that gfs2_iomap_begin_write treats all
files that have the jdata flag set as journaled files, which is
incorrect as long as those files are inline ("stuffed"). We're handling
stuffed files directly via the page cache, which is why we ended up with
pages without buffer heads in gfs2_page_add_databufs.
Fix this by handling stuffed journaled files correctly in
gfs2_iomap_begin_write.
This reverts commit 6636c3cc5690c11631e6366cf9a28fb99c8b25bb.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
It's possible for ext4_show_quota_options() to try reading
s_qf_names[i] while it is being modified by ext4_remount() --- most
notably, in ext4_remount's error path when the original values of the
quota file name gets restored.
Reported-by: syzbot+a2872d6feea6918008a9@syzkaller.appspotmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org # 3.2+
Function rgblk_free can only deal with one resource group at a time, so
pass that resource group is as a parameter. Several of the callers
already have the resource group at hand, so we only need additional
lookup code in a few places.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Reviewed-by: Steven Whitehouse <swhiteho@redhat.com>
The state parameter of gfs2_rlist_alloc is set to LM_ST_EXCLUSIVE in all
calls, so remove it and hardcode that state in gfs2_rlist_alloc instead.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Reviewed-by: Steven Whitehouse <swhiteho@redhat.com>
Reservations in gfs can span multiple gfs2_bitmaps (but they won't span
multiple resource groups). When removing a reservation, we want to
clear the GBF_FULL flags of all involved gfs2_bitmaps, not just that of
the first bitmap.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Reviewed-by: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Reviewed-by: Steven Whitehouse <swhiteho@redhat.com>
This field indicates the size of the bitmap in bytes, similar to how the
bi_blocks field indicates the size of the bitmap in blocks.
In count_unlinked, replace an instance of bi_bytes * GFS2_NBBY by
bi_blocks.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Reviewed-by: Steven Whitehouse <swhiteho@redhat.com>
This definition is only used to define RGRP_RSRV_MINBLKS, with no
benefit over defining RGRP_RSRV_MINBLKS directly.
In addition, instead of forcing RGRP_RSRV_MINBLKS to be of type u32,
cast it to that type where that type is required.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Reviewed-by: Steven Whitehouse <swhiteho@redhat.com>
Move the rs_sizehint and rs_rgd_gh fields from struct gfs2_blkreserv
into the inode: they are more closely related to the inode than to a
particular reservation.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Reviewed-by: Steven Whitehouse <swhiteho@redhat.com>
We already have a function that checks if a block is within a resource
group, so use that in gfs2_rbm_from_block as well.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Reviewed-by: Steven Whitehouse <swhiteho@redhat.com>
When gfs2_rbm_from_block fails, the rbm it returns is undefined, so we
always want to make sure gfs2_rbm_from_block has succeeded before
looking at the rbm.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Reviewed-by: Steven Whitehouse <swhiteho@redhat.com>
Access to the list of cells by /proc/net/afs/cells has a couple of
problems:
(1) It should be checking against SEQ_START_TOKEN for the keying the
header line.
(2) It's only holding the RCU read lock, so it can't just walk over the
list without following the proper RCU methods.
Fix these by using an hlist instead of an ordinary list and using the
appropriate accessor functions to follow it with RCU.
Since the code that adds a cell to the list must also necessarily change,
sort the list on insertion whilst we're at it.
Fixes: 989782dcdc ("afs: Overhaul cell database management")
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Update for 4.19-rc7 to fix numerous file clone and deduplication issues.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJbvo1jAAoJEK3oKUf0dfodurwP/3OGvJo11XjE5Wuh5tbcSOUm
duJdJvN+b0gx8Lb1IL2lHaBEt/Pw3PKi+KPx6d+pd47aflCYNMiU7I1kzejxvq9I
TBVdhxffAmNBU02VU/qmhucMnKfpVr/UX19f0sHZcfXbZkpIuHrSpI25MNSYqbEs
bRoMDkgbuu377hCJu6tnBAUm38z4iZdfJaGPVmJmuVMn8JJE8KA3Une/daxg0/HY
zbCMWP3SGJwqx7SyyeurQSkRS/8GG3LgbnYOz0FlaHfBd4JVJscHOZU08nrYmMAN
9SOowvahaPkDcyO8c6gRkyE6NIbOsb79718g+HTm4eqzyChnh+jnFTzitcTMuK2c
vjblXZSDKnN/P4UaEEzIAnQ1Eew4WhLrKuBr+3fCr2bKTGfj0iiX36CjEgk1k+Df
t1DV/Hj6me6crZpWO7GQZVLnDsiJBzaOsIgpYnB3vcJyof/cgm+5bfGFedDZFdpV
XTh0oBN8B7oQztd4MvcfQOGXSktrMtq+6RomO8kv77mVtHFnjnroKHq/wIft2nyI
rs7u7DFmCLyB9Lbm8aoeMPo4+0zxTp9385HSJ+XznHcjGXxkxIIGhhdU3omKxANa
j3UQB1+VRC4lHSluUXOW7vH0j1tX3gaS3z/yJ+gfcAdizUmdp0h0rMx8c9VR8SEJ
xv/fpbrcKHyBwuNvKtBm
=yj1P
-----END PGP SIGNATURE-----
Merge tag 'xfs-fixes-for-4.19-rc7' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Dave writes:
"xfs: fixes for 4.19-rc7
Update for 4.19-rc7 to fix numerous file clone and deduplication issues."
* tag 'xfs-fixes-for-4.19-rc7' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: fix data corruption w/ unaligned reflink ranges
xfs: fix data corruption w/ unaligned dedupe ranges
xfs: update ctime and remove suid before cloning files
xfs: zero posteof blocks when cloning above eof
xfs: refactor clonerange preparation into a separate helper
On 32-bit systems, the buffer allocated by kernel_read_file() is too
small if the file size is > SIZE_MAX, due to truncation to size_t.
Fortunately, since the 'count' argument to kernel_read() is also
truncated to size_t, only the allocated space is filled; then, -EIO is
returned since 'pos != i_size' after the read loop.
But this is not obvious and seems incidental. We should be more
explicit about this case. So, fail early if i_size > SIZE_MAX.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Mimi Zohar <zohar@linux.ibm.com>
Currently accessing various /sys/fs/orangefs files will spam the
kernel log with the following info message when the client is not
running:
[ 491.489284] sysfs_service_op_show: Client not running :-5:
Rate limit this info message to make it less spammy.
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
default_acl and acl of newly created inode will be initiated
as ACL_NOT_CACHED in vfs function inode_init_always() and later
will be updated by calling xxx_init_acl() in specific filesystems.
Howerver, when default_acl and acl are NULL then they keep the value
of ACL_NOT_CACHED, this patch tries to cache NULL for acl/default_acl
in this case.
Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
We get lower layer dentries, find their parents, do lock_rename() and
proceed to vfs_rename(). However, we do not check that dentries still
have the same parents and are not unlinked. Need to check that...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Commit 64bc06bb32 broke buffered writes to journaled files (chattr
+j): we'll try to journal the buffer heads of the page being written to
in gfs2_iomap_journaled_page_done. However, the iomap code no longer
creates buffer heads, so we'll BUG() in gfs2_page_add_databufs. Fix
that by creating buffer heads ourself when needed.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Add support for the GETFSLABEL ioctl in gfs2.
I tested this patch and it works as expected.
Signed-off-by: Steve Whitehouse <swhiteho@redhat.com>
Tested-by: Abhi Das <adas@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Flushing the workqueue can cause operations to happen which might
call gfs2_log_reserve(), or get stuck waiting for locks taken by such
operations. gfs2_log_reserve() can io_schedule(). If this happens, it
will never wake because the only thing which can wake it is gfs2_logd()
which was already stopped.
This causes umount of a gfs2 filesystem to wedge permanently if, for
example, the umount immediately follows a large delete operation.
When this occured, the following stack trace was obtained from the
umount command
[<ffffffff81087968>] flush_workqueue+0x1c8/0x520
[<ffffffffa0666e29>] gfs2_make_fs_ro+0x69/0x160 [gfs2]
[<ffffffffa0667279>] gfs2_put_super+0xa9/0x1c0 [gfs2]
[<ffffffff811b7edf>] generic_shutdown_super+0x6f/0x100
[<ffffffff811b7ff7>] kill_block_super+0x27/0x70
[<ffffffffa0656a71>] gfs2_kill_sb+0x71/0x80 [gfs2]
[<ffffffff811b792b>] deactivate_locked_super+0x3b/0x70
[<ffffffff811b79b9>] deactivate_super+0x59/0x60
[<ffffffff811d2998>] cleanup_mnt+0x58/0x80
[<ffffffff811d2a12>] __cleanup_mnt+0x12/0x20
[<ffffffff8108c87d>] task_work_run+0x7d/0xa0
[<ffffffff8106d7d9>] exit_to_usermode_loop+0x73/0x98
[<ffffffff81003961>] syscall_return_slowpath+0x41/0x50
[<ffffffff815a594c>] int_ret_from_sys_call+0x25/0x8f
[<ffffffffffffffff>] 0xffffffffffffffff
Signed-off-by: Tim Smith <tim.smith@citrix.com>
Signed-off-by: Mark Syms <mark.syms@citrix.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Lianbo reported a build error with a particular 32-bit config, see Link
below for details.
Provide a weak copy_oldmem_page_encrypted() function which architectures
can override, in the same manner other functionality in that file is
supplied.
Reported-by: Lianbo Jiang <lijiang@redhat.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
CC: x86@kernel.org
Link: http://lkml.kernel.org/r/710b9d95-2f70-eadf-c4a1-c3dc80ee4ebb@redhat.com
In the presence of multi-order entries the typical
pagevec_lookup_entries() pattern may loop forever:
while (index < end && pagevec_lookup_entries(&pvec, mapping, index,
min(end - index, (pgoff_t)PAGEVEC_SIZE),
indices)) {
...
for (i = 0; i < pagevec_count(&pvec); i++) {
index = indices[i];
...
}
index++; /* BUG */
}
The loop updates 'index' for each index found and then increments to the
next possible page to continue the lookup. However, if the last entry in
the pagevec is multi-order then the next possible page index is more
than 1 page away. Fix this locally for the filesystem-dax case by
checking for dax-multi-order entries. Going forward new users of
multi-order entries need to be similarly careful, or we need a generic
way to report the page increment in the radix iterator.
Fixes: 5fac7408d8 ("mm, fs, dax: handle layout changes to pinned dax...")
Cc: <stable@vger.kernel.org>
Cc: Ross Zwisler <zwisler@kernel.org>
Cc: Matthew Wilcox <willy@infradead.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
When alloc_percpu() fails, sdp gets freed but sb->s_fs_info still points
to the same address. Move the assignment after that error check so that
s_fs_info can only point to a valid sdp or NULL, which is checked for
later in the error path, in gfs2_kill_super().
Reported-by: syzbot+dcb8b3587445007f5808@syzkaller.appspotmail.com
Signed-off-by: Andrew Price <anprice@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
In order to identify which thread triggered the event in a
multi-threaded program, add the FAN_REPORT_TID flag in fanotify_init to
opt-in for reporting the event creator's thread id information.
Signed-off-by: nixiaoming <nixiaoming@huawei.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
default_acl and acl of newly created inode will be initiated as
ACL_NOT_CACHED in vfs function inode_init_always() and later will be
updated by calling xxx_init_acl() in specific filesystems. However,
when default_acl and acl are NULL then they keep the value of
ACL_NOT_CACHED. This patch changes the code to cache NULL for acl /
default_acl in this case to save unnecessary ACL lookup attempt.
Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
When reflinking sub-file ranges, a data corruption can occur when
the source file range includes a partial EOF block. This shares the
unknown data beyond EOF into the second file at a position inside
EOF, exposing stale data in the second file.
XFS only supports whole block sharing, but we still need to
support whole file reflink correctly. Hence if the reflink
request includes the last block of the souce file, only proceed with
the reflink operation if it lands at or past the destination file's
current EOF. If it lands within the destination file EOF, reject the
entire request with -EINVAL and make the caller go the hard way.
This avoids the data corruption vector, but also avoids disruption
of returning EINVAL to userspace for the common case of whole file
cloning.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
A deduplication data corruption is Exposed by fstests generic/505 on
XFS. It is caused by extending the block match range to include the
partial EOF block, but then allowing unknown data beyond EOF to be
considered a "match" to data in the destination file because the
comparison is only made to the end of the source file. This corrupts
the destination file when the source extent is shared with it.
XFS only supports whole block dedupe, but we still need to appear to
support whole file dedupe correctly. Hence if the dedupe request
includes the last block of the souce file, don't include it in the
actual XFS dedupe operation. If the rest of the range dedupes
successfully, then report the partial last block as deduped, too, so
that userspace sees it as a successful dedupe rather than return
EINVAL because we can't dedupe unaligned blocks.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
In dlm_init_lockres() we access and modify res->tracking and
dlm->tracking_list without holding dlm->track_lock. This can cause list
corruptions and can end up in kernel panic.
Fix this by locking res->tracking and dlm->tracking_list with
dlm->track_lock instead of dlm->spinlock.
Link: http://lkml.kernel.org/r/1529951192-4686-1-git-send-email-ashish.samant@oracle.com
Signed-off-by: Ashish Samant <ashish.samant@oracle.com>
Reviewed-by: Changwei Ge <ge.changwei@h3c.com>
Acked-by: Joseph Qi <jiangqi903@gmail.com>
Acked-by: Jun Piao <piaojun@huawei.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <ge.changwei@h3c.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Currently, you can use /proc/self/task/*/stack to cause a stack walk on
a task you control while it is running on another CPU. That means that
the stack can change under the stack walker. The stack walker does
have guards against going completely off the rails and into random
kernel memory, but it can interpret random data from your kernel stack
as instruction pointers and stack pointers. This can cause exposure of
kernel stack contents to userspace.
Restrict the ability to inspect kernel stacks of arbitrary tasks to root
in order to prevent a local attacker from exploiting racy stack unwinding
to leak kernel task stack contents. See the added comment for a longer
rationale.
There don't seem to be any users of this userspace API that can't
gracefully bail out if reading from the file fails. Therefore, I believe
that this change is unlikely to break things. In the case that this patch
does end up needing a revert, the next-best solution might be to fake a
single-entry stack based on wchan.
Link: http://lkml.kernel.org/r/20180927153316.200286-1-jannh@google.com
Fixes: 2ec220e27f ("proc: add /proc/*/stack")
Signed-off-by: Jann Horn <jannh@google.com>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Ken Chen <kenchen@google.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
ocfs2_duplicate_clusters_by_page() may crash if one of the extent's pages
is dirty. When a page has not been written back, it is still in dirty
state. If ocfs2_duplicate_clusters_by_page() is called against the dirty
page, the crash happens.
To fix this bug, we can just unlock the page and wait until the page until
its not dirty.
The following is the backtrace:
kernel BUG at /root/code/ocfs2/refcounttree.c:2961!
[exception RIP: ocfs2_duplicate_clusters_by_page+822]
__ocfs2_move_extent+0x80/0x450 [ocfs2]
? __ocfs2_claim_clusters+0x130/0x250 [ocfs2]
ocfs2_defrag_extent+0x5b8/0x5e0 [ocfs2]
__ocfs2_move_extents_range+0x2a4/0x470 [ocfs2]
ocfs2_move_extents+0x180/0x3b0 [ocfs2]
? ocfs2_wait_for_recovery+0x13/0x70 [ocfs2]
ocfs2_ioctl_move_extents+0x133/0x2d0 [ocfs2]
ocfs2_ioctl+0x253/0x640 [ocfs2]
do_vfs_ioctl+0x90/0x5f0
SyS_ioctl+0x74/0x80
do_syscall_64+0x74/0x140
entry_SYSCALL_64_after_hwframe+0x3d/0xa2
Once we find the page is dirty, we do not wait until it's clean, rather we
use write_one_page() to write it back
Link: http://lkml.kernel.org/r/20180829074740.9438-1-lchen@suse.com
[lchen@suse.com: update comments]
Link: http://lkml.kernel.org/r/20180830075041.14879-1-lchen@suse.com
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Larry Chen <lchen@suse.com>
Acked-by: Changwei Ge <ge.changwei@h3c.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The code cleaning transaction's lists of checkpoint buffers has a bug
where it increases bh refcount only after releasing
journal->j_list_lock. Thus the following race is possible:
CPU0 CPU1
jbd2_log_do_checkpoint()
jbd2_journal_try_to_free_buffers()
__journal_try_to_free_buffer(bh)
...
while (transaction->t_checkpoint_io_list)
...
if (buffer_locked(bh)) {
<-- IO completes now, buffer gets unlocked -->
spin_unlock(&journal->j_list_lock);
spin_lock(&journal->j_list_lock);
__jbd2_journal_remove_checkpoint(jh);
spin_unlock(&journal->j_list_lock);
try_to_free_buffers(page);
get_bh(bh) <-- accesses freed bh
Fix the problem by grabbing bh reference before unlocking
journal->j_list_lock.
Fixes: dc6e8d669c ("jbd2: don't call get_bh() before calling __jbd2_journal_remove_checkpoint()")
Fixes: be1158cc61 ("jbd2: fold __process_buffer() into jbd2_log_do_checkpoint()")
Reported-by: syzbot+7f4a27091759e2fe7453@syzkaller.appspotmail.com
CC: stable@vger.kernel.org
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
the victim might've been rmdir'ed just before the lock_rename();
unlike the normal callers, we do not look the source up after the
parents are locked - we know it beforehand and just recheck that it's
still the child of what used to be its parent. Unfortunately,
the check is too weak - we don't spot a dead directory since its
->d_parent is unchanged, dentry is positive, etc. So we sail all
the way to ->rename(), with hosting filesystems _not_ expecting
to be asked renaming an rmdir'ed subdirectory.
The fix is easy, fortunately - the lock on parent is sufficient for
making IS_DEADDIR() on child safe.
Cc: stable@vger.kernel.org
Fixes: 9ae326a690 (CacheFiles: A cache that backs onto a mounted filesystem)
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Before this patch, various errors and messages were reported using
the pr_* functions: pr_err, pr_warn, pr_info, etc., but that does
not tell you which gfs2 mount had the problem, which is often vital
to debugging. This patch changes the calls from pr_* to fs_* in
most of the messages so that the file system id is printed along
with the message.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
When an io error is hit, it calls gfs2_io_error_bh_i for every
journal buffer it can't write. Since we changed gfs2_io_error_bh_i
recently to withdraw later in the cycle, it sends a flood of
errors to the console. This patch checks for the file system already
being withdrawn, and if so, doesn't send more messages. It doesn't
stop the flood of messages, but it slows it down and keeps it more
reasonable.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAlu2jLMACgkQiiy9cAdy
T1GLMQwAiUaUTtR0MN2ws+IHNnwFX5YnWFa1Cr/6Tgz9WBo5TrrvQcBGhTIUQok2
7w2+I0rXPO7P7Tq2odWn66BT1/l0o2U10aB4OHYlS7zqfCAp6GuwKGtATivRqjsg
ENn2mmBRuTmz+XtGmlVklmQpHjn4+MTR2GSLA57oo1fqX5YBeSxcq2UEciCGttRQ
nAvn923NrPSbGRvL/eDDWGXas8gMATXnrNlN8st2AsxnoFC6CrzUuRM0s/IFgp6r
Orh0jIlovlT2Q6LdND7n7xXnEVoyUTETG89IjRVwcBJ9LtXrJXpuWo7/sOr11cMy
sGeUkILNlmkIdLtIlUaZcGaclq/yfeKdMxziBT8tWP6wXIH4LrGCzvZ9jDtrzygU
EWL2nLsqMJNh0+wsYSZM/m9VFC1ReeB1AgIC2EHu/xzqWoTeL40i2OhKi/732FEX
H+FlTZUoorhFoI96AMJMrqXjXyqZUZSQ1b5iJ7/jq888Vdey+m7HiMUtxv0+5yzC
+lQksh8X
=4PHD
-----END PGP SIGNATURE-----
Merge tag '4.19-rc6-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6
Steve writes:
"SMB3 fixes
four small SMB3 fixes: one for stable, the others to address a more
recent regression"
* tag '4.19-rc6-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6:
smb3: fix lease break problem introduced by compounding
cifs: only wake the thread for the very last PDU in a compound
cifs: add a warning if we try to to dequeue a deleted mid
smb2: fix missing files in root share directory listing
Running "./nfstest_delegation --runtest recall26" uncovers that
client doesn't recover the lock when we have an appending open,
where the initial open got a write delegation.
Instead of checking for the passed in open context against
the file lock's open context. Check that the state is the same.
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Before cloning into a file, update the ctime and remove sensitive
attributes like suid, just like we'd do for a regular file write.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
When we're reflinking between two files and the destination file range
is well beyond the destination file's EOF marker, zero any posteof
speculative preallocations in the destination file so that we don't
expose stale disk contents. The previous strategy of trying to clear
the preallocations does not work if the destination file has the
PREALLOC flag set.
Uncovered by shared/010.
Reported-by: Zorro Lang <zlang@redhat.com>
Bugzilla-id: https://bugzilla.kernel.org/show_bug.cgi?id=201259
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Refactor all the reflink preparation steps into a separate helper
that we'll use to land all the upcoming fixes for insufficient input
checks.
This rework also moves the invalidation of the destination range to
the prep function so that it is done before the range is remapped.
This ensures that nobody can access the data in range being remapped
until the remap is complete.
[dgc: fix xfs_reflink_remap_prep() return value and caller check to
handle vfs_clone_file_prep_inodes() returning 0 to mean "nothing to
do". ]
[dgc: make sure length changed by vfs_clone_file_prep_inodes() gets
propagated back to XFS code that does the remapping. ]
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCW7YOCQAKCRDh3BK/laaZ
PAmGAQC1/3+mkfHXqyDx+zv4f7A348ZULW26EWFWwMwlEljWowD+PlRFBgUu5v5c
194mCApZU7hl6o0u07aijjz6m81e6QE=
=dM89
-----END PGP SIGNATURE-----
Merge tag 'ovl-fixes-4.19-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs
Miklos writes:
"overlayfs fixes for 4.19-rc7
This update fixes a couple of regressions in the stacked file update
added in this cycle, as well as some older bugs uncovered by
syzkaller.
There's also one trivial naming change that touches other parts of
the fs subsystem."
* tag 'ovl-fixes-4.19-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
ovl: fix format of setxattr debug
ovl: fix access beyond unterminated strings
ovl: make symbol 'ovl_aops' static
vfs: swap names of {do,vfs}_clone_file_range()
ovl: fix freeze protection bypass in ovl_clone_file_range()
ovl: fix freeze protection bypass in ovl_write_iter()
ovl: fix memory leak on unlink of indexed file
Accumlated regression and bug fixes for 4.19-rc6, including:
o make iomap correctly mark dirty pages for sub-page block sizes
o fix regression in handling extent-to-btree format conversion errors
o fix torn log wrap detection for new logs
o various corrupt inode detection fixes
o various delalloc state fixes
o cleanup all the missed transaction cancel cases missed from changes merged
in 4.19-rc1
o fix lockdep false positive on transaction allocation
o fix locking and reference counting on buffer log items
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJbtVyHAAoJEK3oKUf0dfodc8YP/1hT+TdXZDBVPo9kXQwmrnra
8P1J8IEuGj851PwYQobUhifdDh4eQ+PpcYEIfqTSGhtjTg7cqyQOvTo4uUKHLn50
8mFXQb6JrAFYnjGjorOCde+lpoQrtj3oKASscs7RaL/W1RNffY74UGdiE0yOJnJs
9IC7TpJT4ZzAgYBgC61whfYWmszLmRuKlVZWgXrM8hkMqliHdfrXeZk7MXTeiflz
Q5+9eVnCCgTtC0TWgVAkFMvlZs7UtNMWIIn6zt+HQLB7Vms6q4ArpIT+fF45njwG
94tyTcFT0AcIJ8OIi5i85fOXcDGjTFFf554Yf80PlWGlk3SbXPHFQao/ItDA4S1Z
eCl2eurUOXlNXKuyzWoV5KjOBiiBicbOwl6eX0K996cGehhEdFOvBihvAGjJy5rm
c4plTTy6bhYKZWr3JLuaPDlNWDMr/P/aUkiq9tFXyaMmVKNeyxd6UKzIokl5uUfi
Ycik4Ik8zRgpFEUx5Clvb2W5qH9pqpR0WcFhrcj0mDnJa10TpBWesA0g2F+hM0Jc
0mSuGxfQeJk7tuVcvsBpPNzgVEPKabvnYbOlGK7HvCSOPRkg0Fhmj+iJK721ccVl
Ltiz0Y485eAGLKn9kOWKr47A5GAAFpuK29OrD2a8XizHPZ3Sr2CG5X1jI0gtMb8n
BiBAxO6dojRQATDuTI6w
=5ylA
-----END PGP SIGNATURE-----
Merge tag 'xfs-fixes-for-4.19-rc6' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Dave writes:
"XFS fixes for 4.19-rc6
Accumlated regression and bug fixes for 4.19-rc6, including:
o make iomap correctly mark dirty pages for sub-page block sizes
o fix regression in handling extent-to-btree format conversion errors
o fix torn log wrap detection for new logs
o various corrupt inode detection fixes
o various delalloc state fixes
o cleanup all the missed transaction cancel cases missed from changes merged
in 4.19-rc1
o fix lockdep false positive on transaction allocation
o fix locking and reference counting on buffer log items"
* tag 'xfs-fixes-for-4.19-rc6' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: fix error handling in xfs_bmap_extents_to_btree
iomap: set page dirty after partial delalloc on mkwrite
xfs: remove invalid log recovery first/last cycle check
xfs: validate inode di_forkoff
xfs: skip delalloc COW blocks in xfs_reflink_end_cow
xfs: don't treat unknown di_flags2 as corruption in scrub
xfs: remove duplicated include from alloc.c
xfs: don't bring in extents in xfs_bmap_punch_delalloc_range
xfs: fix transaction leak in xfs_reflink_allocate_cow()
xfs: avoid lockdep false positives in xfs_trans_alloc
xfs: refactor xfs_buf_log_item reference count handling
xfs: clean up xfs_trans_brelse()
xfs: don't unlock invalidated buf on aborted tx commit
xfs: remove last of unnecessary xfs_defer_cancel() callers
xfs: don't crash the vfs on a garbage inline symlink
Format has a typo: it was meant to be "%.*s", not "%*s". But at some point
callers grew nonprintable values as well, so use "%*pE" instead with a
maximized length.
Reported-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Fixes: 3a1e819b4e ("ovl: store file handle of lower inode on copy up")
Cc: <stable@vger.kernel.org> # v4.12
Also define the FANOTIFY_EVENT_FLAGS consisting of the extra flags
FAN_ONDIR and FAN_ON_CHILD.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
The BUG_ON() statements to verify number of bits in ALL_FSNOTIFY_BITS
and ALL_INOTIFY_BITS are converted to build time check of the constant.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
We do not want to add new bits to the FAN_ALL_* uapi constants
because they have been exposed to userspace. If there are programs
out there using these constants, those programs could break if
re-compiled with modified FAN_ALL_* constants and run on an old kernel.
We deprecate the uapi constants FAN_ALL_* and define new FANOTIFY_*
constants for internal use to replace them. New feature bits will be
added only to the new constants.
Cc: <linux-api@vger.kernel.org>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
fanotify mark add/remove code jumps through hoops to avoid setting the
FS_ISDIR in the commulative object mask.
That was just papering over a bug in fsnotify() handling of the FS_ISDIR
extra flag. This bug is now fixed, so all the hoops can be removed along
with the unneeded internal flag FAN_MARK_ONDIR.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
FS_EVENT_ON_CHILD gets a special treatment in fsnotify() because it is
not a flag specifying an event type, but rather an extra flags that may
be reported along with another event and control the handling of the
event by the backend.
FS_ISDIR is also an "extra flag" and not an "event type" and therefore
desrves the same treatment. With inotify/dnotify backends it was never
possible to set FS_ISDIR in mark masks, so it did not matter.
With fanotify backend, mark adding code jumps through hoops to avoid
setting the FS_ISDIR in the commulative object mask.
Separate the constant ALL_FSNOTIFY_EVENTS to ALL_FSNOTIFY_FLAGS and
ALL_FSNOTIFY_EVENTS, so the latter can be used to test for specific
event types.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
AF_RXRPC opens an IPv6 socket through which to send and receive network
packets, both IPv6 and IPv4. It currently turns AF_INET addresses into
AF_INET-as-AF_INET6 addresses based on an assumption that this was
necessary; on further inspection of the code, however, it turns out that
the IPv6 code just farms packets aimed at AF_INET addresses out to the IPv4
code.
Fix AF_RXRPC to use AF_INET addresses directly when given them.
Fixes: 7b674e390e ("rxrpc: Fix IPv6 support")
Signed-off-by: David Howells <dhowells@redhat.com>
Sort address lists so that they are in logical ascending order rather than
being partially in ascending order of the BE representations of those
values.
Signed-off-by: David Howells <dhowells@redhat.com>
Make the address list string parser use the helper functions for adding
addresses to an address list so that they end up appropriately sorted.
This will better handles overruns and make them easier to compare.
It also reduces the number of places that addresses are handled, making it
easier to fix the handling.
Signed-off-by: David Howells <dhowells@redhat.com>
Note the maximum allocated capacity in an afs_addr_list struct and discard
addresses that would exceed it in afs_merge_fs_addr{4,6}().
Also, since the current maximum capacity is less than 255, reduce the
relevant members to bytes.
Signed-off-by: David Howells <dhowells@redhat.com>
We return most failure of dquota_initialize() except
inode evict, this could make a bit sense, for example
we allow file removal even quota files are broken?
But it dosen't make sense to allow setting project
if quota files etc are broken.
Signed-off-by: Wang Shilong <wshilong@ddn.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Linus recently observed that if we did not worry about the padding
member in struct siginfo it is only about 48 bytes, and 48 bytes is
much nicer than 128 bytes for allocating on the stack and copying
around in the kernel.
The obvious thing of only adding the padding when userspace is
including siginfo.h won't work as there are sigframe definitions in
the kernel that embed struct siginfo.
So split siginfo in two; kernel_siginfo and siginfo. Keeping the
traditional name for the userspace definition. While the version that
is used internally to the kernel and ultimately will not be padded to
128 bytes is called kernel_siginfo.
The definition of struct kernel_siginfo I have put in include/signal_types.h
A set of buildtime checks has been added to verify the two structures have
the same field offsets.
To make it easy to verify the change kernel_siginfo retains the same
size as siginfo. The reduction in size comes in a following change.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Currently, project quota could be changed by fssetxattr
ioctl, and existed permission check inode_owner_or_capable()
is obviously not enough, just think that common users could
change project id of file, that could make users to
break project quota easily.
This patch try to follow same regular of xfs project
quota:
"Project Quota ID state is only allowed to change from
within the init namespace. Enforce that restriction only
if we are trying to change the quota ID state.
Everything else is allowed in user namespaces."
Besides that, check and set project id'state should
be an atomic operation, protect whole operation with
inode lock, ext4_ioctl_setproject() is only used for
ioctl EXT4_IOC_FSSETXATTR, we have held mnt_want_write_file()
before ext4_ioctl_setflags(), and ext4_ioctl_setproject()
is called after ext4_ioctl_setflags(), we could share
codes, so remove it inside ext4_ioctl_setproject().
Signed-off-by: Wang Shilong <wshilong@ddn.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Cc: stable@kernel.org
Al writes:
"xattrs regression fix from Andreas; sat in -next for quite a while."
* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
sysfs: Do not return POSIX ACL xattrs via listxattr
Return type of ext4_page_mkwrite and ext4_filemap_fault are
changed to use vm_fault_t type.
With this patch all the callers of block_page_mkwrite_return()
are changed to handle vm_fault_t. So converting the return type
of block_page_mkwrite_return() to vm_fault_t.
Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Matthew Wilcox <willy@infradead.org>
Variable retries is not initialized in ext4_da_write_inline_data_begin()
which can lead to nondeterministic number of retries in case we hit
ENOSPC. Initialize retries to zero as we do everywhere else.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Fixes: bc0ca9df3b ("ext4: retry allocation when inline->extent conversion failed")
Cc: stable@kernel.org
When running fault injection test, I hit somewhat wrong behavior in f2fs_gc ->
gc_data_segment():
0. fault injection generated some PageError'ed pages
1. gc_data_segment
-> f2fs_get_read_data_page(REQ_RAHEAD)
2. move_data_page
-> f2fs_get_lock_data_page()
-> f2f_get_read_data_page()
-> f2fs_submit_page_read()
-> submit_bio(READ)
-> return EIO due to PageError
-> fail to move data
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Fixes problem (discovered by Aurelien) introduced by recent commit:
commit b24df3e30c
("cifs: update receive_encrypted_standard to handle compounded responses")
which broke the ability to respond to some lease breaks
(lease breaks being ignored is a problem since can block
server response for duration of the lease break timeout).
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
For compounded PDUs we whould only wake the waiting thread for the
very last PDU of the compound.
We do this so that we are guaranteed that the demultiplex_thread will
not process or access any of those MIDs any more once the send/recv
thread starts processing.
Else there is a race where at the end of the send/recv processing we
will try to delete all the mids of the compound. If the multiplex
thread still has other mids to process at this point for this compound
this can lead to an oops.
Needed to fix recent commit:
commit 730928c8f4
("cifs: update smb2_queryfs() to use compounding")
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
cifs_delete_mid() is called once we are finished handling a mid and we
expect no more work done on this mid.
Needed to fix recent commit:
commit 730928c8f4
("cifs: update smb2_queryfs() to use compounding")
Add a warning if someone tries to dequeue a mid that has already been
flagged to be deleted.
Also change list_del() to list_del_init() so that if we have similar bugs
resurface in the future we will not oops.
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
When mounting a Windows share that is the root of a drive (eg. C$)
the server does not return . and .. directory entries. This results in
the smb2 code path erroneously skipping the 2 first entries.
Pseudo-code of the readdir() code path:
cifs_readdir(struct file, struct dir_context)
initiate_cifs_search <-- if no reponse cached yet
server->ops->query_dir_first
dir_emit_dots
dir_emit <-- adds "." and ".." if we're at pos=0
find_cifs_entry
initiate_cifs_search <-- if pos < start of current response
(restart search)
server->ops->query_dir_next <-- if pos > end of current response
(fetch next search res)
for(...) <-- loops over cur response entries
starting at pos
cifs_filldir <-- skip . and .., emit entry
cifs_fill_dirent
dir_emit
pos++
A) dir_emit_dots() always adds . & ..
and sets the current dir pos to 2 (0 and 1 are done).
Therefore we always want the index_to_find to be 2 regardless of if
the response has . and ..
B) smb1 code initializes index_of_last_entry with a +2 offset
in cifssmb.c CIFSFindFirst():
psrch_inf->index_of_last_entry = 2 /* skip . and .. */ +
psrch_inf->entries_in_buffer;
Later in find_cifs_entry() we want to find the next dir entry at pos=2
as a result of (A)
first_entry_in_buffer = cfile->srch_inf.index_of_last_entry -
cfile->srch_inf.entries_in_buffer;
This var is the dir pos that the first entry in the buffer will
have therefore it must be 2 in the first call.
If we don't offset index_of_last_entry by 2 (like in (B)),
first_entry_in_buffer=0 but we were instructed to get pos=2 so this
code in find_cifs_entry() skips the 2 first which is ok for non-root
shares, as it skips . and .. from the response but is not ok for root
shares where the 2 first are actual files
pos_in_buf = index_to_find - first_entry_in_buffer;
// pos_in_buf=2
// we skip 2 first response entries :(
for (i = 0; (i < (pos_in_buf)) && (cur_ent != NULL); i++) {
/* go entry by entry figuring out which is first */
cur_ent = nxt_dir_entry(cur_ent, end_of_smb,
cfile->srch_inf.info_level);
}
C) cifs_filldir() skips . and .. so we can safely ignore them for now.
Sample program:
int main(int argc, char **argv)
{
const char *path = argc >= 2 ? argv[1] : ".";
DIR *dh;
struct dirent *de;
printf("listing path <%s>\n", path);
dh = opendir(path);
if (!dh) {
printf("opendir error %d\n", errno);
return 1;
}
while (1) {
de = readdir(dh);
if (!de) {
if (errno) {
printf("readdir error %d\n", errno);
return 1;
}
printf("end of listing\n");
break;
}
printf("off=%lu <%s>\n", de->d_off, de->d_name);
}
return 0;
}
Before the fix with SMB1 on root shares:
<.> off=1
<..> off=2
<$Recycle.Bin> off=3
<bootmgr> off=4
and on non-root shares:
<.> off=1
<..> off=4 <-- after adding .., the offsets jumps to +2 because
<2536> off=5 we skipped . and .. from response buffer (C)
<411> off=6 but still incremented pos
<file> off=7
<fsx> off=8
Therefore the fix for smb2 is to mimic smb1 behaviour and offset the
index_of_last_entry by 2.
Test results comparing smb1 and smb2 before/after the fix on root
share, non-root shares and on large directories (ie. multi-response
dir listing):
PRE FIX
=======
pre-1-root VS pre-2-root:
ERR pre-2-root is missing [bootmgr, $Recycle.Bin]
pre-1-nonroot VS pre-2-nonroot:
OK~ same files, same order, different offsets
pre-1-nonroot-large VS pre-2-nonroot-large:
OK~ same files, same order, different offsets
POST FIX
========
post-1-root VS post-2-root:
OK same files, same order, same offsets
post-1-nonroot VS post-2-nonroot:
OK same files, same order, same offsets
post-1-nonroot-large VS post-2-nonroot-large:
OK same files, same order, same offsets
REGRESSION?
===========
pre-1-root VS post-1-root:
OK same files, same order, same offsets
pre-1-nonroot VS post-1-nonroot:
OK same files, same order, same offsets
BugLink: https://bugzilla.samba.org/show_bug.cgi?id=13107
Signed-off-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Paulo Alcantara <palcantara@suse.deR>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
CC: Stable <stable@vger.kernel.org>
The code EXT4_IOC_SWAP_BOOT ioctl hasn't been updated in a while, and
it's a bit broken with respect to more modern ext4 kernels, especially
metadata checksums.
Other problems fixed with this commit:
* Don't allow installing a DAX, swap file, or an encrypted file as a
boot loader.
* Respect the immutable and append-only flags.
* Wait until any DIO operations are finished *before* calling
truncate_inode_pages().
* Don't swap inode->i_flags, since these flags have nothing to do with
the inode blocks --- and it will give the IMA/audit code heartburn
when the inode is evicted.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Reported-by: syzbot+e81ccd4744c6c4f71354@syzkaller.appspotmail.com
Enabling DX_DEBUG triggers the build error below. info is an attribute
of the dxroot structure.
linux/fs/ext4/namei.c:2264:12: error: ‘info’
undeclared (first use in this function); did you mean ‘insl’?
info->indirect_levels));
Fixes: e08ac99fa2 ("ext4: add largedir feature")
Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.co.uk>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
If the starting block number of either the source or destination file
exceeds the EOF, EXT4_IOC_MOVE_EXT should return EINVAL.
Also fixed the helper function mext_check_coverage() so that if the
logical block is beyond EOF, make it return immediately, instead of
looping until the block number wraps all the away around. This takes
long enough that if there are multiple threads trying to do pound on
an the same inode doing non-sensical things, it can end up triggering
the kernel's soft lockup detector.
Reported-by: syzbot+c61979f6f2cba5cb3c06@syzkaller.appspotmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Add new code to count canceled pending cluster reservations on bigalloc
file systems and to reduce the cluster reservation count on all file
systems using delayed allocation. This replaces old code in
ext4_da_page_release_reservations that was incorrect.
Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Modify ext4_ext_remove_space() and the code it calls to correct the
reserved cluster count for pending reservations (delayed allocated
clusters shared with allocated blocks) when a block range is removed
from the extent tree. Pending reservations may be found for the clusters
at the ends of written or unwritten extents when a block range is removed.
If a physical cluster at the end of an extent is freed, it's necessary
to increment the reserved cluster count to maintain correct accounting
if the corresponding logical cluster is shared with at least one
delayed and unwritten extent as found in the extents status tree.
Add a new function, ext4_rereserve_cluster(), to reapply a reservation
on a delayed allocated cluster sharing blocks with a freed allocated
cluster. To avoid ENOSPC on reservation, a flag is applied to
ext4_free_blocks() to briefly defer updating the freeclusters counter
when an allocated cluster is freed. This prevents another thread
from allocating the freed block before the reservation can be reapplied.
Redefine the partial cluster object as a struct to carry more state
information and to clarify the code using it.
Adjust the conditional code structure in ext4_ext_remove_space to
reduce the indentation level in the main body of the code to improve
readability.
Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Ext4 does not always reduce the reserved cluster count by the number
of clusters allocated when mapping a delayed extent. It sometimes
adds back one or more clusters after allocation if delalloc blocks
adjacent to the range allocated by ext4_ext_map_blocks() share the
clusters newly allocated for that range. However, this overcounts
the number of clusters needed to satisfy future mapping requests
(holding one or more reservations for clusters that have already been
allocated) and premature ENOSPC and quota failures, etc., result.
Ext4 also does not reduce the reserved cluster count when allocating
clusters for non-delayed allocated writes that have previously been
reserved for delayed writes. This also results in overcounts.
To make it possible to handle reserved cluster accounting for
fallocated regions in the same manner as used for other non-delayed
writes, do the reserved cluster accounting for them at the time of
allocation. In the current code, this is only done later when a
delayed extent sharing the fallocated region is finally mapped.
Address comment correcting handling of unsigned long long constant
from Jan Kara's review of RFC version of this patch.
Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
The code in ext4_da_map_blocks sometimes reserves space for more
delayed allocated clusters than it should, resulting in premature
ENOSPC, exceeded quota, and inaccurate free space reporting.
Fix this by checking for written and unwritten blocks shared in the
same cluster with the newly delayed allocated block. A cluster
reservation should not be made for a cluster for which physical space
has already been allocated.
Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Add new pending reservation mechanism to help manage reserved cluster
accounting. Its primary function is to avoid the need to read extents
from the disk when invalidating pages as a result of a truncate, punch
hole, or collapse range operation.
Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Ext4 contains a few functions that are used to search for delayed
extents or blocks in the extents status tree. Rather than duplicate
code to add new functions to search for extents with different status
values, such as written or a combination of delayed and unwritten,
generalize the existing code to search for caller-specified extents
status values. Also, move this code into extents_status.c where it
is better associated with the data structures it operates upon, and
where it can be more readily used to implement new extents status tree
functions that might want a broader scope for i_es_lock.
Three missing static specifiers in RFC version of patch reported and
fixed by Fengguang Wu <fengguang.wu@intel.com>.
Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAluw4MIACgkQONu9yGCS
aT7+8xAAiYnc4khUsxeInm3z44WPfRX1+UF51frTNSY5C8Nn5nvRSnTUNLuKkkrz
8RbwCL6UYyJxF9I/oZdHPsPOD4IxXkQY55tBjz7ZbSBIFEwYM6RJMm8mAGlXY7wq
VyWA5MhlpGHM9DjrguB4DMRipnrSc06CVAnC+ZyKLjzblzU1Wdf2dYu+AW9pUVXP
j4r74lFED5djPY1xfqfzEwmYRCeEGYGx7zMqT3GrrF5uFPqj1H6O5klEsAhIZvdl
IWnJTU2coC8R/Sd17g4lHWPIeQNnMUGIUbu+PhIrZ/lDwFxlocg4BvarPXEdzgYi
gdZzKBfovpEsSu5RCQsKWG4IGQxY7I1p70IOP9eqEFHZy77qT1YcHVAWrK1Y/bJd
UA08gUOSzRnhKkNR3+PsaMflUOl9WkpyHECZu394cyRGMutSS50aWkavJPJ/o1Qi
D/oGqZLLcKFyuNcchG+Met1TzY3LvYEDgSburqwqeUZWtAsGs8kmiiq7qvmXx4zV
IcgM8ERqJ8mbfhfsXQU7hwydIrPJ3JdIq19RnM5ajbv2Q4C/qJCyAKkQoacrlKR4
aiow/qvyNrP80rpXfPJB8/8PiWeDtAnnGhM+xySZNlw3t8GR6NYpUkIzf5TdkSb3
C8KuKg6FY9QAS62fv+5KK3LB/wbQanxaPNruQFGe5K1iDQ5Fvzw=
=dMl4
-----END PGP SIGNATURE-----
Merge tag 'v4.19-rc6' into for-4.20/block
Merge -rc6 in, for two reasons:
1) Resolve a trivial conflict in the blk-mq-tag.c documentation
2) A few important regression fixes went into upstream directly, so
they aren't in the 4.20 branch.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
* tag 'v4.19-rc6': (780 commits)
Linux 4.19-rc6
MAINTAINERS: fix reference to moved drivers/{misc => auxdisplay}/panel.c
cpufreq: qcom-kryo: Fix section annotations
perf/core: Add sanity check to deal with pinned event failure
xen/blkfront: correct purging of persistent grants
Revert "xen/blkfront: When purging persistent grants, keep them in the buffer"
selftests/powerpc: Fix Makefiles for headers_install change
blk-mq: I/O and timer unplugs are inverted in blktrace
dax: Fix deadlock in dax_lock_mapping_entry()
x86/boot: Fix kexec booting failure in the SEV bit detection code
bcache: add separate workqueue for journal_write to avoid deadlock
drm/amd/display: Fix Edid emulation for linux
drm/amd/display: Fix Vega10 lightup on S3 resume
drm/amdgpu: Fix vce work queue was not cancelled when suspend
Revert "drm/panel: Add device_link from panel device to DRM device"
xen/blkfront: When purging persistent grants, keep them in the buffer
clocksource/drivers/timer-atmel-pit: Properly handle error cases
block: fix deadline elevator drain for zoned block devices
ACPI / hotplug / PCI: Don't scan for non-hotplug bridges if slot is not bridge
drm/syncobj: Don't leak fences when WAIT_FOR_SUBMIT is set
...
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Writeback caching currently allocates requests with the maximum number of
possible pages, while the actual number of pages per request depends on a
couple of factors that cannot be determined when the request is allocated
(whether page is already under writeback, whether page is contiguous with
previous pages already added to a request).
This patch allows such requests to start with no page allocation (all pages
inline) and grow the page array on demand.
If the max_pages tunable remains the default value, then this will mean
just one allocation that is the same size as before. If the tunable is
larger, then this adds at most 3 additional memory allocations (which is
generously compensated by the improved performance from the larger
request).
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Replace FUSE_MAX_PAGES_PER_REQ with the configurable parameter max_pages to
improve performance.
Old RFC with detailed description of the problem and many fixes by Mitsuo
Hayasaka (mitsuo.hayasaka.hu@hitachi.com):
- https://lkml.org/lkml/2012/7/5/136
We've encountered performance degradation and fixed it on a big and complex
virtual environment.
Environment to reproduce degradation and improvement:
1. Add lag to user mode FUSE
Add nanosleep(&(struct timespec){ 0, 1000 }, NULL); to xmp_write_buf in
passthrough_fh.c
2. patch UM fuse with configurable max_pages parameter. The patch will be
provided latter.
3. run test script and perform test on tmpfs
fuse_test()
{
cd /tmp
mkdir -p fusemnt
passthrough_fh -o max_pages=$1 /tmp/fusemnt
grep fuse /proc/self/mounts
dd conv=fdatasync oflag=dsync if=/dev/zero of=fusemnt/tmp/tmp \
count=1K bs=1M 2>&1 | grep -v records
rm fusemnt/tmp/tmp
killall passthrough_fh
}
Test results:
passthrough_fh /tmp/fusemnt fuse.passthrough_fh \
rw,nosuid,nodev,relatime,user_id=0,group_id=0 0 0
1073741824 bytes (1.1 GB) copied, 1.73867 s, 618 MB/s
passthrough_fh /tmp/fusemnt fuse.passthrough_fh \
rw,nosuid,nodev,relatime,user_id=0,group_id=0,max_pages=256 0 0
1073741824 bytes (1.1 GB) copied, 1.15643 s, 928 MB/s
Obviously with bigger lag the difference between 'before' and 'after'
will be more significant.
Mitsuo Hayasaka, in 2012 (https://lkml.org/lkml/2012/7/5/136),
observed improvement from 400-550 to 520-740.
Signed-off-by: Constantine Shulyupin <const@MakeLinux.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
When allocating page array for a request the array for the page pointers
and the array for page descriptors are allocated by two separate kmalloc()
calls. Merge these into one allocation.
Also instead of initializing the request and the page arrays with memset(),
use the zeroing allocation variants.
Reserved requests never carry pages (page array size is zero). Make that
explicit by initializing the page array pointers to NULL and make sure the
assumption remains true by adding a WARN_ON().
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Do this by grouping fields used for cached writes and putting them into a
union with fileds used for cached readdir (with obviously no overlap, since
we don't have hybrid objects).
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Use the internal iversion counter to make sure modifications of the
directory through this filesystem are not missed by the mtime check (due to
mtime granularity).
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Store the modification time of the directory in the cache, obtained before
starting to fill the cache.
When reading the cache, verify that the directory hasn't changed, by
checking if current modification time is the same as the one stored in the
cache.
This only needs to be done when the current file position is at the
beginning of the directory, as mandated by POSIX.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Allow the cache to be invalidated when page(s) have gone missing. In this
case increment the version of the cache and reset to an empty state.
Add a version number to the directory stream in struct fuse_file as well,
indicating the version of the cache it's supposed to be reading. If the
cache version doesn't match the stream's version, then reset the stream to
the beginning of the cache.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
The cache is only used if it's completed, not while it's still being
filled; this constraint could be lifted later, if it turns out to be
useful.
Introduce state in struct fuse_file that indicates the position within the
cache. After a seek, reset the position to the beginning of the cache and
search the cache for the current position. If the current position is not
found in the cache, then fall back to uncached readdir.
It can also happen that page(s) disappear from the cache, in which case we
must also fall back to uncached readdir.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
This patch just adds the cache filling functions, which are invoked if
FOPEN_CACHE_DIR flag is set in the OPENDIR reply.
Cache reading and cache invalidation are added by subsequent patches.
The directory cache uses the page cache. Directory entries are packed into
a page in the same format as in the READDIR reply. A page only contains
whole entries, the space at the end of the page is cleared. The page is
locked while being modified.
Multiple parallel readdirs on the same directory can fill the cache; the
only constraint is that continuity must be maintained (d_off of last entry
points to position of current entry).
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Normally, DIO uses in-pllace-update, but in LFS mode, f2fs doesn't
allow triggering any in-place-update writes, so we fallback direct
write to buffered write, result in bad performance of large size
write.
This patch adds to support triggering out-place-update for direct IO
to enhance its performance.
Note that it needs to exclude direct read IO during direct write,
since new data writing to new block address will no be valid until
write finished.
storage: zram
time xfs_io -f -d /mnt/f2fs/file -c "pwrite 0 1073741824" -c "fsync"
Before:
real 0m13.061s
user 0m0.327s
sys 0m12.486s
After:
real 0m6.448s
user 0m0.228s
sys 0m6.212s
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Thread A Thread B
- f2fs_vm_page_mkwrite
- f2fs_setattr
- down_write(i_mmap_sem)
- truncate_setsize
- f2fs_truncate
- up_write(i_mmap_sem)
- f2fs_reserve_block
reserve NEW_ADDR
- skip dirty page due to truncation
1. we don't need to rserve new block address for a truncated page.
2. dn.data_blkaddr is used out of node page lock coverage.
Refactor ->page_mkwrite() flow to fix above issues:
- use __do_map_lock() to avoid racing checkpoint()
- lock data page in prior to dnode page
- cover f2fs_reserve_block with i_mmap_sem lock
- wait page writeback before zeroing page
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
There is one case that we can leave bio in f2fs, result in hanging
page writeback waiter.
Thread A Thread B
- f2fs_write_cache_pages
- f2fs_submit_page_write
page #0 cached in bio #0 of cold log
- f2fs_submit_page_write
page #1 cached in bio #1 of warm log
- f2fs_write_cache_pages
- f2fs_submit_page_write
bio is full, submit bio #1 contain page #1
- f2fs_submit_merged_write_cond(, page #1)
fail to submit bio #0 due to page #1 is not in any cached bios.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Now we support crc32 checksum for superblock.
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Junling Zheng <zhengjunling@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch adds to account skip count of background GC, and show stat
info via 'status' debugfs entry.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch supports to account meta IO, it enables to show write IO
from f2fs more comprehensively via 'status' debugfs entry.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Commit 01239d77b9 ("xfs: fix a null pointer dereference in
xfs_bmap_extents_to_btree") attempted to fix a null pointer
dreference when a fuzzing corruption of some kind was found.
This fix was flawed, resulting in assert failures like:
XFS: Assertion failed: ifp->if_broot == NULL, file: fs/xfs/libxfs/xfs_bmap.c, line: 715
.....
Call Trace:
xfs_bmap_extents_to_btree+0x6b9/0x7b0
__xfs_bunmapi+0xae7/0xf00
? xfs_log_reserve+0x1c8/0x290
xfs_reflink_remap_extent+0x20b/0x620
xfs_reflink_remap_blocks+0x7e/0x290
xfs_reflink_remap_range+0x311/0x530
vfs_dedupe_file_range_one+0xd7/0xe0
vfs_dedupe_file_range+0x15b/0x1a0
do_vfs_ioctl+0x267/0x6c0
The problem is that the error handling code now asserts that the
inode fork is not in btree format before the error handling code
undoes the modifications that put the fork back in extent format.
Fix this by moving the assert back to after the xfs_iroot_realloc()
call that returns the fork to extent format, and clean up the jump
labels to be meaningful.
Also, returning ENOSPC when xfs_btree_get_bufl() fails to
instantiate the buffer that was allocated (the actual fix in the
commit mentioned above) is incorrect. This is a fatal error - only
an invalid block address or a filesystem shutdown can result in
failing to get a buffer here.
Hence change this to EFSCORRUPTED so that the higher layer knows
this was a corruption related failure and should not treat it as an
ENOSPC error. This should result in a shutdown (via cancelling a
dirty transaction) which is necessary as we do not attempt to clean
up the (invalid) block that we have already allocated.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
If we're revalidating an existing dentry in order to open a file, we need
to ensure that we check the directory has not changed before we optimise
away the lookup.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Refactor the code in nfs_lookup_revalidate() as a stepping stone towards
optimising and fixing nfs4_lookup_revalidate().
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
We need to ensure that inode and dentry revalidation occurs correctly
on reopen of a file that is already open. Currently, we can end up
not revalidating either in the case of NFSv4.0, due to the 'cached open'
path.
Let's fix that by ensuring that we only do cached open for the special
cases of open recovery and delegation return.
Reported-by: Stan Hu <stanhu@gmail.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Now that each struct nfs_pgio_header corresponds to one RPC call, we
only have one writer to the struct nfs_pgio_header.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
When the server fails to return post-op attributes, the client's
attempt to place read data directly in the page cache fails, and
so we have to do an extra copy in order to realign the data with
page borders.
This patch attempts to detect servers that don't return post-op
attributes on read (e.g. for pNFS) and adjusts the placement
calculation accordingly.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
The convention in the rest of the code is to have a separate function
for anything that might be ifdef-ed out.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
This is to match kernel coding style for switch statements.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Most places in the kernel tend to line up cases with the switch to
reduce indentation, so move this over to match that style.
Additionally, I handle the (status >= 0) case in the switch so that we
only "goto restart" from a single place after error handling.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Moving all of this into a new function removes the need for cramped
indentation, making the code overall easier to look at. I also take
this chance to switch copy recovery over to using
nfs4_stateid_match_other()
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
for tightly coupled DSes client must ignore provided synthetic uid and
gid as stated in draft-ietf-nfsv4-flex-files-19#section-5.1.
Signed-off-by: Tigran Mkrtchyan <tigran.mkrtchyan@desy.de>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
The intention of nfs4_session_set_rwsize() was to cap the r/wsize to the
buffer sizes negotiated by the CREATE_SESSION. The initial code had a
bug whereby we would not check the values negotiated by nfs_probe_fsinfo()
(the assumption being that CREATE_SESSION will always negotiate buffer values
that are sane w.r.t. the server's preferred r/wsizes) but would only check
values set by the user in the 'mount' command.
The code was changed in 4.11 to _always_ set the r/wsize, meaning that we
now never use the server preferred r/wsizes. This is the regression that
this patch fixes.
Also rename the function to nfs4_session_limit_rwsize() in order to avoid
future confusion.
Fixes: 033853325f (NFSv4.1 respect server's max size in CREATE_SESSION")
Cc: stable@vger.kernel.org # v4.11+
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Reduce contention on the inode->i_lock by ensuring that we use RCU
when looking up the NFS open context.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Speed up lookups of an existing lock context by avoiding the inode->i_lock,
and using RCU instead.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
For the 'files' and 'flexfiles' layout types, we do not expect the reply
to be any larger than 4k. The block and scsi layout types are a little more
greedy, so we keep allocating the maximum response size for now.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
If the RPC call relies on the receive call allocating pages as buffers,
then let's label it so that we
a) Don't leak memory by allocating pages for requests that do not expect
this behaviour
b) Can optimise for the common case where calls do not require allocation.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Commit 072ebb3bff ("ext4: add nonstring annotations to ext4.h")
introduced a local definition of __nonstring to suppress some false
positives in gcc 8's -Wstringop-truncation.
Since now we support __nonstring for everyone, remove it.
Tested-by: Sedat Dilek <sedat.dilek@gmail.com> # on top of v4.19-rc5, clang 7
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Reviewed-by: Luc Van Oostenryck <luc.vanoostenryck@gmail.com>
Signed-off-by: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com>
As reported by nixiaoming, with some minor clarifications:
1) memory leak in ramoops_register_dummy():
dummy_data = kzalloc(sizeof(*dummy_data), GFP_KERNEL);
but no kfree() if platform_device_register_data() fails.
2) memory leak in ramoops_init():
Missing platform_device_unregister(dummy) and kfree(dummy_data)
if platform_driver_register(&ramoops_driver) fails.
I've clarified the purpose of ramoops_register_dummy(), and added a
common cleanup routine for all three failure paths to call.
Reported-by: nixiaoming <nixiaoming@huawei.com>
Cc: stable@vger.kernel.org
Cc: Anton Vorontsov <anton@enomsg.org>
Cc: Colin Cross <ccross@android.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Fix a deadlock in the new for 4.19 dax_lock_mapping_entry() routine.
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJbsDBrAAoJEB7SkWpmfYgCi54QAIz8yFZI5+5+amG/L/F9mGe4
sagcSPsk67EzzTDhnKASTlRmpm0+LWzQckY7o/fDRoM0VQVKjXVDke4VTDnHFg7W
JfZMN24dg6Pcbq3CxuSXMOiWd8vXSnLL2Myin+fQ/kY1rxnIz2ZYNWxCQsLvdPiC
VKJAbpYlcG41HZPPnRkMaRBxf2INUraSgyHFoehbgvlwLD7YUOzPh9strauutK5M
xljv2d/yjfaW4U6DhQhUSo+sDYRLGDkbqQw6ZoVqbODA0IXdY6ytiCujLLD9xODg
lDKF68jCX/+lFIURm8BRpX9iqHvfILC5el61a4bTxjJ6XUf+Ok5vgkeZFDfQKziC
rLqm09NTQ5Xu0MJ8Ql+5cqAFqBMA7Uy1zF6l8DnGFCtMV/S0H/TgdXWLzHjRXQvE
18ekLqTcRk5UmPXJYJ829ln0TKTd3zyuVgwuLuGAeO97m431y3K2Q74ncPahgE9+
W0nduPFTmMikohcKah2P3mQWGtUAYWodQsEs+Y9gJPyoDic+fmjo+mI0xg7CeFL4
kpfug45i8hdbnlHrHOJ6bz7fRq7CvaaRaI3gOvFfuN2TJVY8Qfs/8JD4HN8F7u+r
zDPVnvkutaYV1uOOBU4nDzPJ+naVGlpOj1/tsMU4ikj3LbfkfW+gxsr6XGZYPU81
qYjEfXm60ritFoAA5dVV
=3l8a
-----END PGP SIGNATURE-----
Merge tag 'libnvdimm-fixes2-4.19-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
Dan writes:
"filesystem-dax for 4.19-rc6
Fix a deadlock in the new for 4.19 dax_lock_mapping_entry() routine."
* tag 'libnvdimm-fixes2-4.19-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
dax: Fix deadlock in dax_lock_mapping_entry()
Introduce xarray value entries and tagged pointers to replace radix
tree exceptional entries. This is a slight change in encoding to allow
the use of an extra bit (we can now store BITS_PER_LONG - 1 bits in a
value entry). It is also a change in emphasis; exceptional entries are
intimidating and different. As the comment explains, you can choose
to store values or pointers in the xarray and they are both first-class
citizens.
Signed-off-by: Matthew Wilcox <willy@infradead.org>
Reviewed-by: Josef Bacik <jbacik@fb.com>
The iomap page fault mechanism currently dirties the associated page
after the full block range of the page has been allocated. This
leaves the page susceptible to delayed allocations without ever
being set dirty on sub-page block sized filesystems.
For example, consider a page fault on a page with one preexisting
real (non-delalloc) block allocated in the middle of the page. The
first iomap_apply() iteration performs delayed allocation on the
range up to the preexisting block, the next iteration finds the
preexisting block, and the last iteration attempts to perform
delayed allocation on the range after the prexisting block to the
end of the page. If the first allocation succeeds and the final
allocation fails with -ENOSPC, iomap_apply() returns the error and
iomap_page_mkwrite() fails to dirty the page having already
performed partial delayed allocation. This eventually results in the
page being invalidated without ever converting the delayed
allocation to real blocks.
This problem is reliably reproduced by generic/083 on XFS on ppc64
systems (64k page size, 4k block size). It results in leaked
delalloc blocks on inode reclaim, which triggers an assert failure
in xfs_fs_destroy_inode() and filesystem accounting inconsistency.
Move the set_page_dirty() call from iomap_page_mkwrite() to the
actor callback, similar to how the buffer head implementation works.
The actor callback is called iff ->iomap_begin() returns success, so
ensures the page is dirtied as soon as possible after an allocation.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
One of the first steps of log recovery is to check for the special
case of a zeroed log. If the first cycle in the log is zero or the
tail portion of the log is zeroed, the head is set to the first
instance of cycle 0. xlog_find_zeroed() includes a sanity check that
enforces that the first cycle in the log must be 1 if the last cycle
is 0. While this is true in most cases, the check is not totally
valid because it doesn't consider the case where the filesystem
crashed after a partial/out of order log buffer completion that
wraps around the end of the physical log.
For example, consider a filesystem that has completed most of the
first cycle of the log, reaches the end of the physical log and
splits the next single log buffer write into two in order to wrap
around the end of the log. If these I/Os are reordered, the second
(wrapped) I/O completes and the first happens to fail, the log is
left in a state where the last cycle of the log is 0 and the first
cycle is 2. This causes the xlog_find_zeroed() sanity check to fail
and prevents the filesystem from mounting. This situation has been
reproduced on particular systems via repeated runs of generic/475.
This is an expected state that log recovery already knows how to
deal with, however. Since the log is still partially zeroed, the
head is detected correctly and points to a valid tail. The
subsequent stale block detection clears blocks beyond the head up to
the tail (within a maximum range), with the express purpose of
clearing such out of order writes. As expected, this removes the out
of order cycle 2 blocks at the physical start of the log.
In other words, the only thing that prevents a clean mount and
recovery of the filesystem in this scenario is the specific (last ==
0 && first != 1) sanity check in xlog_find_zeroed(). Since the log
head/tail are now independently validated via cycle, log record and
CRC checks, this highly specific first cycle check is of dubious
value. Remove it and rely on the higher level validation to
determine whether log content is sane and recoverable.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Verify the inode di_forkoff, lifted from xfs_repair's
process_check_inode_forkoff().
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
The iomap direct I/O code issues a single ->end_io call for the whole
I/O request, and if some of the extents cowered needed a COW operation
it will call xfs_reflink_end_cow over the whole range.
When we do AIO writes we drop the iolock after doing the initial setup,
but before the I/O completion. Between dropping the lock and completing
the I/O we can have a racing buffered write create new delalloc COW fork
extents in the region covered by the outstanding direct I/O write, and
thus see delalloc COW fork extents in xfs_reflink_end_cow. As
concurrent writes are fundamentally racy and no guarantees are given we
can simply skip those.
This can be easily reproduced with xfstests generic/208 in always_cow
mode.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
xchk_inode_flags2() currently treats any di_flags2 values that the
running kernel doesn't recognize as corruption, and calls
xchk_ino_set_corrupt() if they are set. However, it's entirely possible
that these flags were set in some newer kernel and are quite valid,
but ignored in this kernel.
(Validators don't care one bit about unknown di_flags2.)
Call xchk_ino_set_warning instead, because this may or may not actually
indicate a problem.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Remove duplicated include xfs_alloc.h
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
This function is only used to punch out delayed allocations on I/O
failure, which means we need to have read the extents earlier.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
When xfs_reflink_allocate_cow() allocates a transaction, it drops
the ILOCK to perform the operation. This Introduces a race condition
where another thread modifying the file can perform the COW
allocation operation underneath us. This result in the retry loop
finding an allocated block and jumping straight to the conversion
code. It does not, however, cancel the transaction it holds and so
this gets leaked. This results in a lockdep warning:
================================================
WARNING: lock held when returning to user space!
4.18.5 #1 Not tainted
------------------------------------------------
worker/6123 is leaving the kernel with locks still held!
1 lock held by worker/6123:
#0: 000000009eab4f1b (sb_internal#2){.+.+}, at: xfs_trans_alloc+0x17c/0x220
And eventually the filesystem deadlocks because it runs out of log
space that is reserved by the leaked transaction and never gets
released.
The logic flow in xfs_reflink_allocate_cow() is a convoluted mess of
gotos - it's no surprise that it has bug where the flow through
several goto jumps then fails to clean up context from a non-obvious
logic path. CLean up the logic flow and make sure every path does
the right thing.
Reported-by: Alexander Y. Fomichev <git.user@gmail.com>
Tested-by: Alexander Y. Fomichev <git.user@gmail.com>
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=200981
Signed-off-by: Dave Chinner <dchinner@redhat.com>
[hch: slight refactor]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
We've had a few reports of lockdep tripping over memory reclaim
context vs filesystem freeze "deadlocks". They all have looked
to be false positives on analysis, but it seems that they are
being tripped because we take freeze references before we run
a GFP_KERNEL allocation for the struct xfs_trans.
We can avoid this false positive vector just by re-ordering the
operations in xfs_trans_alloc(). That is. we need allocate the
structure before we take the freeze reference and enter the GFP_NOFS
allocation context that follows the xfs_trans around. This prevents
lockdep from seeing the GFP_KERNEL allocation inside the transaction
context, and that prevents it from triggering the freeze level vs
alloc context vs reclaim warnings.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
The xfs_buf_log_item structure has a reference counter with slightly
tricky semantics. In the common case, a buffer is logged and
committed in a transaction, committed to the on-disk log (added to
the AIL) and then finally written back and removed from the AIL. The
bli refcount covers two potentially overlapping timeframes:
1. the bli is held in an active transaction
2. the bli is pinned by the log
The caveat to this approach is that the reference counter does not
purely dictate the lifetime of the bli. IOW, when a dirty buffer is
physically logged and unpinned, the bli refcount may go to zero as
the log item is inserted into the AIL. Only once the buffer is
written back can the bli finally be freed.
The above semantics means that it is not enough for the various
refcount decrementing contexts to release the bli on decrement to
zero. xfs_trans_brelse(), transaction commit (->iop_unlock()) and
unpin (->iop_unpin()) must all drop the associated reference and
make additional checks to determine if the current context is
responsible for freeing the item.
For example, if a transaction holds but does not dirty a particular
bli, the commit may drop the refcount to zero. If the bli itself is
clean, it is also not AIL resident and must be freed at this time.
The same is true for xfs_trans_brelse(). If the transaction dirties
a bli and then aborts or an unpin results in an abort due to a log
I/O error, the last reference count holder is expected to explicitly
remove the item from the AIL and release it (since an abort means
filesystem shutdown and metadata writeback will never occur).
This leads to fairly complex checks being replicated in a few
different places. Since ->iop_unlock() and xfs_trans_brelse() are
nearly identical, refactor the logic into a common helper that
implements and documents the semantics in one place. This patch does
not change behavior.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
xfs_trans_brelse() is a bit of a historical mess, similar to
xfs_buf_item_unlock(). It is unnecessarily verbose, has snippets of
commented out code, inconsistency with regard to stale items, etc.
Clean up xfs_trans_brelse() to use similar logic and flow as
xfs_buf_item_unlock() with regard to bli reference count handling.
This patch makes no functional changes, but facilitates further
refactoring of the common bli reference count handling code.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
xfstests generic/388,475 occasionally reproduce assertion failures
in xfs_buf_item_unpin() when the final bli reference is dropped on
an invalidated buffer and the buffer is not locked as it is expected
to be. Invalidated buffers should remain locked on transaction
commit until the final unpin, at which point the buffer is removed
from the AIL and the bli is freed since stale buffers are not
written back.
The assert failures are associated with filesystem shutdown,
typically due to log I/O errors injected by the test. The
problematic situation can occur if the shutdown happens to cause a
race between an active transaction that has invalidated a particular
buffer and an I/O error on a log buffer that contains the bli
associated with the same (now stale) buffer.
Both transaction and log contexts acquire a bli reference. If the
transaction has already invalidated the buffer by the time the I/O
error occurs and ends up aborting due to shutdown, the transaction
and log hold the last two references to a stale bli. If the
transaction cancel occurs first, it treats the buffer as non-stale
due to the aborted state: the bli reference is dropped and the
buffer is released/unlocked. The log buffer I/O error handling
eventually calls into xfs_buf_item_unpin(), drops the final
reference to the bli and treats it as stale. The buffer wasn't left
locked by xfs_buf_item_unlock(), however, so the assert fails and
the buffer is double unlocked. The latter problem is mitigated by
the fact that the fs is shutdown and no further damage is possible.
->iop_unlock() of an invalidated buffer should behave consistently
with respect to the bli refcount, regardless of aborted state. If
the refcount remains elevated on commit, we know the bli is awaiting
an unpin (since it can't be in another transaction) and will be
handled appropriately on log buffer completion. If the final bli
reference of an invalidated buffer is dropped in ->iop_unlock(), we
can assume the transaction has aborted because invalidation implies
a dirty transaction. In the non-abort case, the log would have
acquired a bli reference in ->iop_pin() and prevented bli release at
->iop_unlock() time. In the abort case the item must be freed and
buffer unlocked because it wasn't pinned by the log.
Rework xfs_buf_item_unlock() to simplify the currently circuitous
and duplicate logic and leave invalidated buffers locked based on
bli refcount, regardless of aborted state. This ensures that a
pinned, stale buffer is always found locked when eventually
unpinned.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Now that deferred operations are completely managed via
transactions, it's no longer necessary to cancel the dfops in error
paths that already cancel the associated transaction. There are a
few such calls lingering throughout the codebase.
Remove all remaining unnecessary calls to xfs_defer_cancel(). This
leaves xfs_defer_cancel() calls in two places. The first is the call
in the transaction cancel path itself, which facilitates this patch.
The second is made via the xfs_defer_finish() error path to provide
consistent error semantics with transaction commit. For example,
xfs_trans_commit() expects an xfs_defer_finish() failure to clean up
the dfops structure before it returns.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
The VFS routine that calls ->get_link blindly copies whatever's returned
into the user's buffer. If we return a NULL pointer, the vfs will
crash on the null pointer. Therefore, return -EFSCORRUPTED instead of
blowing up the kernel.
[dgc: clean up with hch's suggestions]
Reported-by: wen.xu@gatech.edu
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
This patch fixes missing up_read call.
Fixes: c9b60788fc ("f2fs: fix to do sanity check with block address in main area")
Cc: <stable@vger.kernel.org> # 4.19+
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch avoids BUG_ON when f2fs_get_meta_page_nofail got EIO during
xfstests/generic/475.
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
We noticed the performance bottleneck in FUSE running our Virtuozzo storage
over rdma. On some types of workload we observe 20% of times spent in
request_find() in profiler. This function is iterating over long requests
list, and it scales bad.
The patch introduces hash table to reduce the number of iterations, we do
in this function. Hash generating algorithm is taken from hash_add()
function, while 256 lines table is used to store pending requests. This
fixes problem and improves the performance.
Reported-by: Alexey Kuznetsov <kuznet@virtuozzo.com>
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
This field is not needed after the previous patch, since we can easily
convert request ID to interrupt request ID and vice versa.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Using of two unconnected IDs req->in.h.unique and req->intr_unique does not
allow to link requests to a hash table. We need can't use none of them as a
key to calculate hash.
This patch changes the algorithm of allocation of IDs for a request. Plain
requests obtain even ID, while interrupt requests are encoded in the low
bit. So, in next patches we will be able to use the rest of ID bits to
calculate hash, and the hash will be the same for plain and interrupt
requests.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Currently, we take fc->lock there only to check for fc->connected.
But this flag is changed only on connection abort, which is very
rare operation.
So allow checking fc->connected under just fc->bg_lock and use this lock
(as well as fc->lock) when resetting fc->connected.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
To reduce contention of fc->lock, this patch introduces bg_lock for
protection of fields related to background queue. These are:
max_background, congestion_threshold, num_background, active_background,
bg_queue and blocked.
This allows next patch to make async reads not requiring fc->lock, so async
reads and writes will have better performance executed in parallel.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Functions sequences like request_end()->flush_bg_queue() require that
max_background and congestion_threshold are constant during their
execution. Otherwise, checks like
if (fc->num_background == fc->max_background)
made in different time may behave not like expected.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Since they are of unsigned int type, it's allowed to read them
unlocked during reporting to userspace. Let's underline this fact
with READ_ONCE() macroses.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
This cleanup patch makes the function to use the primitive
instead of direct dereferencing.
Also, move fiq dereferencing out of cycle, since it's
always constant.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
There are several FUSE filesystems that can implement server-side copy
or other efficient copy/duplication/clone methods. The copy_file_range()
syscall is the standard interface that users have access to while not
depending on external libraries that bypass FUSE.
Signed-off-by: Niels de Vos <ndevos@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Using waitqueue_active() is racy. Make sure we issue a wake_up()
unconditionally after storing into fc->blocked. After that it's okay to
optimize with waitqueue_active() since the first wake up provides the
necessary barrier for all waiters, not the just the woken one.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Fixes: 3c18ef8117 ("fuse: optimize wake_up")
Cc: <stable@vger.kernel.org> # v3.10
Otherwise fuse_dev_do_write() could come in and finish off the request, and
the set_bit(FR_SENT, ...) could trigger the WARN_ON(test_bit(FR_SENT, ...))
in request_end().
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Reported-by: syzbot+ef054c4d3f64cd7f7cec@syzkaller.appspotmai
Fixes: 46c34a348b ("fuse: no fc->lock for pqueue parts")
Cc: <stable@vger.kernel.org> # v4.2
When dax_lock_mapping_entry() has to sleep to obtain entry lock, it will
fail to unlock mapping->i_pages spinlock and thus immediately deadlock
against itself when retrying to grab the entry lock again. Fix the
problem by unlocking mapping->i_pages before retrying.
Fixes: c2a7d2a115 ("filesystem-dax: Introduce dax_lock_mapping_entry()")
Reported-by: Barret Rhoden <brho@google.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
This averts the need to re-generate flags in fanotify_show_fdinfo()
and sets the scene for addition of more upcoming flags without growing
new members to the fanotify_data struct.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Mark inode dirty explicitly in the end of recover_inode() to make sure
that all recoverable fields can be persisted later.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
inode.i_gc_failures is used to indicate that skip count of migrating
on blocks of inode, we should guarantee it can be recovered in sudden
power-off case.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Testcase to reproduce this bug:
1. mkfs.f2fs /dev/sdd
2. mount -t f2fs /dev/sdd /mnt/f2fs
3. touch /mnt/f2fs/file
4. sync
5. chattr +A /mnt/f2fs/file
6. xfs_io -f /mnt/f2fs/file -c "fsync"
7. godown /mnt/f2fs
8. umount /mnt/f2fs
9. mount -t f2fs /dev/sdd /mnt/f2fs
10. lsattr /mnt/f2fs/file
-----------------N- /mnt/f2fs/file
But actually, we expect the corrct result is:
-------A---------N- /mnt/f2fs/file
The reason is we didn't recover inode.i_flags field during mount,
fix it.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This is related to
ee70daaba8 ("xfs: update i_size after unwritten conversion in dio completion")
If we update i_size during dio_write, dio_read can read out stale data, which
breaks xfstests/465.
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Currently we show mount option "io_bits=%u" as "io_size=%uKB",
it will cause option parsing problem(unrecognized mount option)
in remount.
Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Fixes gcc '-Wunused-but-set-variable' warning:
fs/nfsd/vfs.c: In function 'nfsd_create':
fs/nfsd/vfs.c:1279:16: warning:
variable 'dirp' set but not used [-Wunused-but-set-variable]
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Upon receiving a request for async copy, create a new kthread. If we
get asynchronous request, make sure to copy the needed arguments/state
from the stack before starting the copy. Then start the thread and reply
back to the client indicating copy is asynchronous.
nfsd_copy_file_range() will copy in a loop over the total number of
bytes is needed to copy. In case a failure happens in the middle, we
ignore the error and return how much we copied so far. Once done
creating a workitem for the callback workqueue and send CB_OFFLOAD with
the results.
The lifetime of the copy stateid is bound to the vfs copy. This way we
don't need to keep the nfsd_net structure for the callback. We could
keep it around longer so that an OFFLOAD_STATUS that came late would
still get results, but clients should be able to deal without that.
We handle OFFLOAD_CANCEL by sending a signal to the copy thread and
calling kthread_stop.
A client should cancel any ongoing copies before calling DESTROY_CLIENT;
if not, we return a CLIENT_BUSY error.
If the client is destroyed for some other reason (lease expiration, or
server shutdown), we must clean up any ongoing copies ourselves.
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
[colin.king@canonical.com: fix leak in error case]
[bfields@fieldses.org: remove signalling, merge patches]
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
* (2) fixes for the dax error handling updates that were merged for
v4.19-rc1. My mails to Al have been bouncing recently, so I do not have
his ack but the uaccess change is of the trivial / obviously correct
variety. The address_space_operations fixes a regression.
* A filesystem-dax fix to correct the zero page lookup to be compatible
with non-x86 (mips and s390) architectures.
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJbqoecAAoJEB7SkWpmfYgCaPYP/1Pf2ADt0pOskSk0ixM06UI9
1lR2g2/ICMc505+oB+wUv9TkZOh9jcIS9o8GfLhgNvP7AU4woRvudeyr4NUc0mdw
rtHRA6TIimbXa3+O2qMg4gqUjXRxj6urQp5oeQi8mQ/vefZv1aisEw/Klae8klVC
HGoMFii19WGXXyPM2vNUb2L+JGZt1p/nl/Z8ydPavn1XkIIGb7c+MivDiaemjjgT
487TmFULgLVhTCtQXlhkH7UCcCQ3+l3yxKaS1/g2hFpWE4LncBIvq8XBPwf5RQSL
H/5rH/sd30XR2L0NMERxr0ENvCJf2iIf4bIqckODN4L9ojRE8zmBZMsSeRKmHufm
3ZfTBLHjPUwwKWy7PlKSsFk2J8KjErsqRlfZQSMSJShpEgL1jCjYtuTEtupaegbU
v8TwzsNDgJ1B6JuKJ7lh7hOF7vUQ/i65xG8SFACvNoiih8RGSW3llra442k2hmFu
IEMXa9S4tvqHfXUb0J/6hLLi+xoV+KsYPWRiCuovy7t6EfAWUnNuGCldjfsQtZZv
npHS7F7lkWlSCneDbE4cMdkkwjBKjAw0sjIWDrPVVCoITVe1j9+bEwE9reX1VOS+
L+PB/WgcVH72MeiQnmPPTcUyEmNgCbku7NEhwrMHJngxuo9HmXE+BN8jdFDYZFfL
uWDV25XOxviOC9xBosw4
=mAaH
-----END PGP SIGNATURE-----
erge tag 'libnvdimm-fixes-4.19-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
Dan writes:
"libnvdimm/dax for 4.19-rc6
* (2) fixes for the dax error handling updates that were merged for
v4.19-rc1. My mails to Al have been bouncing recently, so I do not have
his ack but the uaccess change is of the trivial / obviously correct
variety. The address_space_operations fixes a regression.
* A filesystem-dax fix to correct the zero page lookup to be compatible
with non-x86 (mips and s390) architectures."
* tag 'libnvdimm-fixes-4.19-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
device-dax: Add missing address_space_operations
uaccess: Fix is_source param for check_copy_size() in copy_to_iter_mcsafe()
filesystem-dax: Fix use of zero page
Fixes the following sparse warning:
fs/overlayfs/inode.c:507:39: warning:
symbol 'ovl_aops' was not declared. Should it be static?
Fixes: 5b910bd615 ("ovl: fix GPF in swapfile_activate of file from overlayfs over xfs")
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
If macro CONFIG_QUOTA is not enabled then mount option flag
of usrquota/grpquota will not be set, so we can remove some
building macro check safely in ext2_shwo_options().
Additionally, I think it's better to define EXT2_MOUNT_DAX
regardless macro CONFIG_FS_DAX is enabled just like acl/xattr.
Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Commit 031a072a0b ("vfs: call vfs_clone_file_range() under freeze
protection") created a wrapper do_clone_file_range() around
vfs_clone_file_range() moving the freeze protection to former, so
overlayfs could call the latter.
The more common vfs practice is to call do_xxx helpers from vfs_xxx
helpers, where freeze protecction is taken in the vfs_xxx helper, so
this anomality could be a source of confusion.
It seems that commit 8ede205541 ("ovl: add reflink/copyfile/dedup
support") may have fallen a victim to this confusion -
ovl_clone_file_range() calls the vfs_clone_file_range() helper in the
hope of getting freeze protection on upper fs, but in fact results in
overlayfs allowing to bypass upper fs freeze protection.
Swap the names of the two helpers to conform to common vfs practice
and call the correct helpers from overlayfs and nfsd.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Tested by doing clone on overlayfs while upper xfs+reflink is frozen:
xfs_io -f /ovl/y
fsfreeze -f /xfs
xfs_io> reflink /ovl/x
Before the fix xfs_io enters xfs_reflink_remap_range() and blocks
in xfs_trans_alloc(). After the fix, xfs_io blocks outside xfs code
in ovl_clone_file_range().
Fixes: 8ede205541 ("ovl: add reflink/copyfile/dedup support")
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Tested by re-writing to an open overlayfs file while upper ext4 is frozen:
xfs_io -f /ovl/x
xfs_io> pwrite 0 4096
fsfreeze -f /ext4
xfs_io> pwrite 0 4096
WARNING: CPU: 0 PID: 1492 at fs/ext4/ext4_jbd2.c:53 \
ext4_journal_check_start+0x48/0x82
After the fix, the second write blocks in ovl_write_iter() and avoids
hitting WARN_ON(sb->s_writers.frozen == SB_FREEZE_COMPLETE) in
ext4_journal_check_start().
Fixes: 2a92e07edc ("ovl: add ovl_write_iter()")
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
The memory leak was detected by kmemleak when running xfstests
overlay/051,053
Fixes: caf70cb2ba ("ovl: cleanup orphan index entries")
Cc: <stable@vger.kernel.org> # v4.13
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
One of the goals of this series is to remove a separate reference to
the css of the bio. This can and should be accessed via bio_blkcg. In
this patch, the wbc_init_bio call is changed such that it must be called
after a queue has been associated with the bio.
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
- A wrong UBIFS assertion in mount code
- Fix for a NULL pointer deref in mount code
- Revert of a bad fix for xattrs
-----BEGIN PGP SIGNATURE-----
iQJKBAABCAA0FiEEdgfidid8lnn52cLTZvlZhesYu8EFAlukuf8WHHJpY2hhcmRA
c2lnbWEtc3Rhci5hdAAKCRBm+VmF6xi7wRnyD/40jc4PA1pP3dJcXqdqwSeN+vJZ
YWu307rE+giWWg6vRKa8zu3z614579UC7EE/I9miSrPNPYCi9VTLNzJGNMvn8aSf
8seMrbhW0kz29T4YZ5fESr6x3hMsSdmIWZkPobruul8oBGmNcUm3Ag3MBM4XbQTm
5AcWUdXVQUXPROK/Xm04hNa5pJ6wkSu5CxA+mf9YM2ZYYPPdCblxxq3NsDX99vDw
gZup7JndSzpk+IpZMQIEKUP83vh5YgCU06YPjXi5YIfXl+sQnpH0fyrWm3+1C0Yz
4T80rd3fm+7+obXdAI29mhcoZ09fGayozfqjU/fV3edW43+pUvRWced13vuLqz9Q
JannCFeN+v/nB6OPlj4JGkgNunZ5M3r6ZZIQBbY9RJuZgG0/FMbpYb62RY1z/jg9
YA7e3J/R02i7tHnPbcIz6ngKE0c42VKxTdATaybXVunx5ZPip7txj9OeJbfIYUrr
CbLMdiqIJUAAheHMUrTiF3jDNwEiMhBZA4dAGDvLmpLuyfMlN8Lwje9BdRd5VNKp
zwEGUQkDWLxniFYLtmbceFSa4IQT6z70XJn1pqdDvgjxfp8trEqhtp9GMxA2u6TA
adcug5gUzRWMQ2QuyoKIv6tkjmnII5N2KzhZH9hwBReRJUlKY9WQ9jowcGrPXhbw
F2prIwBxp8drneOeKg==
=CGLq
-----END PGP SIGNATURE-----
Merge tag 'upstream-4.19-rc4' of git://git.infradead.org/linux-ubifs
Richard writes:
"This pull request contains fixes for UBIFS:
- A wrong UBIFS assertion in mount code
- Fix for a NULL pointer deref in mount code
- Revert of a bad fix for xattrs"
* tag 'upstream-4.19-rc4' of git://git.infradead.org/linux-ubifs:
Revert "ubifs: xattr: Don't operate on deleted inodes"
ubifs: drop false positive assertion
ubifs: Check for name being NULL while mounting
Step to reproduce this bug:
1. logon as root
2. mount -t f2fs /dev/sdd /mnt;
3. touch /mnt/file;
4. chown system /mnt/file; chgrp system /mnt/file;
5. xfs_io -f /mnt/file -c "fsync";
6. godown /mnt;
7. umount /mnt;
8. mount -t f2fs /dev/sdd /mnt;
After step 8) we will expect file's uid/gid are all system, but during
recovery, these two fields were not been recovered, fix it.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If we have an error in f2fs_build_free_nids, we're able to fall into a loop
to find free nids.
Suggested-by: Chao Yu <chao@kernel.org>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The 'm' kcore_list item could point to kclist_head, and it is incorrect to
look at m->addr / m->size in this case.
There is no choice but to run through the list of entries for every
address if we did not find any entry in the previous iteration
Reset 'm' to NULL in that case at Omar Sandoval's suggestion.
[akpm@linux-foundation.org: add comment]
Link: http://lkml.kernel.org/r/1536100702-28706-1-git-send-email-asmadeus@codewreck.org
Fixes: bf991c2231 ("proc/kcore: optimize multiple page reads")
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Omar Sandoval <osandov@osandov.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: James Morse <james.morse@arm.com>
Cc: Bhupesh Sharma <bhsharma@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This reverts commit 11a6fc3dc7.
UBIFS wants to assert that xattr operations are only issued on files
with positive link count. The said patch made this operations return
-ENOENT for unlinked files such that the asserts will no longer trigger.
This was wrong since xattr operations are perfectly fine on unlinked
files.
Instead the assertions need to be fixed/removed.
Cc: <stable@vger.kernel.org>
Fixes: 11a6fc3dc7 ("ubifs: xattr: Don't operate on deleted inodes")
Reported-by: Koen Vandeputte <koen.vandeputte@ncentric.com>
Tested-by: Joel Stanley <joel@jms.id.au>
Signed-off-by: Richard Weinberger <richard@nod.at>
The following sequence triggers
ubifs_assert(c, c->lst.taken_empty_lebs > 0);
at the end of ubifs_remount_fs():
mount -t ubifs /dev/ubi0_0 /mnt
echo 1 > /sys/kernel/debug/ubifs/ubi0_0/ro_error
umount /mnt
mount -t ubifs -o ro /dev/ubix_y /mnt
mount -o remount,ro /mnt
The resulting
UBIFS assert failed in ubifs_remount_fs at 1878 (pid 161)
is a false positive. In the case above c->lst.taken_empty_lebs has
never been changed from its initial zero value. This will only happen
when the deferred recovery is done.
Fix this by doing the assertion only when recovery has been done
already.
Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de>
Signed-off-by: Richard Weinberger <richard@nod.at>
The requested device name can be NULL or an empty string.
Check for that and refuse to continue. UBIFS has to do this manually
since we cannot use mount_bdev(), which checks for this condition.
Fixes: 1e51764a3c ("UBIFS: add new flash file system")
Reported-by: syzbot+38bd0f7865e5c6379280@syzkaller.appspotmail.com
Signed-off-by: Richard Weinberger <richard@nod.at>
We don't need to call dquot_initialize() twice in jfs_evict_inode(),
remove one of them for cleanup.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
This helps to control the frequency of submission of discard and
GC requests independently, based on the need.
Suggested-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Sahitya Tummala <stummala@codeaurora.org>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Sync syscall to DAX file needs to flush processor cache, but it
currently does not flush to existing DAX files. This is because
'ext2_da_aops' is set to address_space_operations of existing DAX
files, instead of 'ext2_dax_aops', since S_DAX flag is set after
ext2_set_aops() in the open path.
Similar to ext4, change ext2_iget() to initialize i_flags before
ext2_set_aops().
Fixes: fb094c9074 ("ext2, dax: introduce ext2_dax_aops")
Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Suggested-by: Jan Kara <jack@suse.cz>
Cc: Jan Kara <jack@suse.cz>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: <stable@vger.kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Commit 786534b92f introduced a regression that caused listxattr to
return the POSIX ACL attribute names even though sysfs doesn't support
POSIX ACLs. This happens because simple_xattr_list checks for NULL
i_acl / i_default_acl, but inode_init_always initializes those fields
to ACL_NOT_CACHED ((void *)-1). For example:
$ getfattr -m- -d /sys
/sys: system.posix_acl_access: Operation not supported
/sys: system.posix_acl_default: Operation not supported
Fix this in simple_xattr_list by checking if the filesystem supports POSIX ACLs.
Fixes: 786534b92f ("tmpfs: listxattr should include POSIX ACL xattrs")
Reported-by: Marc Aurèle La France <tsi@tuyoix.net>
Tested-by: Marc Aurèle La France <tsi@tuyoix.net>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Cc: stable@vger.kernel.org # v4.5+
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
maliciously crafted file systems, and some DAX fixes.
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAlufGncACgkQ8vlZVpUN
gaPwuQf9FKp9yRvjBkjtnH3+s4Ps8do9r067+90y1k2DJMxKoaBUhGSW2MJJ04j+
5F6Ndp/TZHw+LfPnzsqlrAAoP3CG5+kacfJ7xeVKR0umvACm6rLMsCUct7/rFoSl
PgzCALFIJvQ9+9shuO9qrgmjJrfrlTVUgR9Mu3WUNEvMFbMjk3FMI8gi5kjjWemE
G9TDYH2lMH2sL0cWF51I2gOyNXOXrihxe+vP7j6i/rUkV+YLpKZhE1ss3Sfn6pR2
p/KjnXdupLJpgYLJne9kMrq2r8xYmDfA0S+Dec7nkox5FUOFUHssl3+q8C7cDwO9
zl6VyVFwybjFRJ/Y59wox6eqVPlIWw==
=1P1w
-----END PGP SIGNATURE-----
Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Ted writes:
Various ext4 bug fixes; primarily making ext4 more robust against
maliciously crafted file systems, and some DAX fixes.
* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
ext4, dax: set ext4_dax_aops for dax files
ext4, dax: add ext4_bmap to ext4_dax_aops
ext4: don't mark mmp buffer head dirty
ext4: show test_dummy_encryption mount option in /proc/mounts
ext4: close race between direct IO and ext4_break_layouts()
ext4: fix online resizing for bigalloc file systems with a 1k block size
ext4: fix online resize's handling of a too-small final block group
ext4: recalucate superblock checksum after updating free blocks/inodes
ext4: avoid arithemetic overflow that can trigger a BUG
ext4: avoid divide by zero fault when deleting corrupted inline directories
ext4: check to make sure the rename(2)'s destination is not freed
ext4: add nonstring annotations to ext4.h
The terminating NUL byte is only there because the buffer is
allocated with kzalloc(PAGE_SIZE, GFP_KERNEL), but since the
range-check is off-by-one, and PAGE_SIZE==PATH_MAX, the
returned string may not be zero-terminated if it is exactly
PATH_MAX characters long. Furthermore also the initial loop
may theoretically exceed PATH_MAX and cause a fault.
Signed-off-by: Bernd Edlinger <bernd.edlinger@hotmail.de>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Sync syscall to DAX file needs to flush processor cache, but it
currently does not flush to existing DAX files. This is because
'ext4_da_aops' is set to address_space_operations of existing DAX
files, instead of 'ext4_dax_aops', since S_DAX flag is set after
ext4_set_aops() in the open path.
New file
--------
lookup_open
ext4_create
__ext4_new_inode
ext4_set_inode_flags // Set S_DAX flag
ext4_set_aops // Set aops to ext4_dax_aops
Existing file
-------------
lookup_open
ext4_lookup
ext4_iget
ext4_set_aops // Set aops to ext4_da_aops
ext4_set_inode_flags // Set S_DAX flag
Change ext4_iget() to initialize i_flags before ext4_set_aops().
Fixes: 5f0663bb4a ("ext4, dax: introduce ext4_dax_aops")
Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Suggested-by: Jan Kara <jack@suse.cz>
Cc: stable@vger.kernel.org
Ext4 mount path calls .bmap to the journal inode. This currently
works for the DAX mount case because ext4_iget() always set
'ext4_da_aops' to any regular files.
In preparation to fix ext4_iget() to set 'ext4_dax_aops' for ext4
DAX files, add ext4_bmap() to 'ext4_dax_aops', since bmap works for
DAX inodes.
Fixes: 5f0663bb4a ("ext4, dax: introduce ext4_dax_aops")
Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Suggested-by: Jan Kara <jack@suse.cz>
Cc: stable@vger.kernel.org
Marking mmp bh dirty before writing it will make writeback
pick up mmp block later and submit a write, we don't want the
duplicate write as kmmpd thread should have full control of
reading and writing the mmp block.
Another reason is we will also have random I/O error on
the writeback request when blk integrity is enabled, because
kmmpd could modify the content of the mmp block(e.g. setting
new seq and time) while the mmp block is under I/O requested
by writeback.
Signed-off-by: Li Dongyang <dongyangli@ddn.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Cc: stable@vger.kernel.org
When in effect, add "test_dummy_encryption" to _ext4_show_options() so
that it is shown in /proc/mounts and other relevant procfs files.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAlucKDwACgkQiiy9cAdy
T1EhEgwAqgVXTujce2UVPtaFY/MaGmaIwAimh+aYOCAADxLYJHkjtRzHd5PQgf+L
n55R2hFcMeelWxOMEb/aRmxIKLk8fmJYVWClM2+S7Z79M3GHexDbMS8+oDZnzCTB
EknvaTbi+vOt4HGABkbJ/jiQCgonmeobon30gLWaYa3XGeYc7ZV5gR+EXL9xSdvh
+I+x3rSDpm8fQ5njkB7RKgfB+ha4NQqZ6dXlYQzcb0vMO3/lhQ56Ypgn95Jlu5UW
pcLxUFE1do+JeGvIU1it2SCRJ5499g180Rxucl7X1xFBQ44Qss9QOeWkFTZ8784V
PIYVZMTUqO0Km4H22qXD8lIY5GjDuAXLYM3AddFkJpbKaw6g++ZsUXSfoA5zRIDn
10edaPK/hDIQeFaV+ySTN5g/Qh3YFnmY4kDL3t3CRZDe4+DTW/+cmrF3sGkhZDQt
+nDo0JxJsjNnJW6loB5Lb76lygvsng01owSsYSAChjhYwBvCDgp9/D85pDQ/oYkl
HKD9tiF3
=YNSx
-----END PGP SIGNATURE-----
Merge tag '4.19-rc3-smb3-cifs' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs fixes from Steve French:
"Fixes for four CIFS/SMB3 potential pointer overflow issues, one minor
build fix, and a build warning cleanup"
* tag '4.19-rc3-smb3-cifs' of git://git.samba.org/sfrench/cifs-2.6:
cifs: read overflow in is_valid_oplock_break()
cifs: integer overflow in in SMB2_ioctl()
CIFS: fix wrapping bugs in num_entries()
cifs: prevent integer overflow in nxt_dir_entry()
fs/cifs: require sha512
fs/cifs: suppress a string overflow warning
Stable bugfixes:
- v4.17+: Fix a tracepoint Oops in initiate_file_draining()
- v4.17+: Fix a tracepoint Oops in initiate_file_draining()
- v4.11+: Fix an infinite loop on I/O
Other fixes:
- Return errors if a waiting layoutget is killed
- Don't open code clearing of delegation state
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAlucGhMACgkQ18tUv7Cl
QOsMCA/9ET0bbzus25DWcnbT10bpEdtW4p6dR/2ztGqUGe0FVyCyVT70jBKFbnAL
cO1pqElKLj7TMPhTsxK63dDxXGELXCqDtmsHELaD8jf0h6270KAPariJBQ5+N0ud
5U5CXswW/zbQekE9GMTDQtAAGBzfht33PavFt2+5oVYTAQ5K6Pwvq2qMoifQxMlk
wjtVjypz0QjBy5bHBO6XGxX58JIc23EwA0/KDS4cU3vkwDXmEZcVYIUdqJF4gtCz
JdJdnT4b9ebtbdHENx8rkot3L1VSx6JfW9pPMvxLjxn8IG1rj7zQXtc7kpnoF8RY
WVGWuf4rn7Zo7YXf11SXNebMYgrljx5/0KcmUtSgSmCCqVUmY1e/a69e0fhkKfxn
/W/+fYIdC1wG0JXtrCN8eJbGropYj3B8Ln5TBV4LN91hFMI8Lx/4D1lKLoK7RNLJ
3Q6VmKIhKVMHgK6eAivWyN7X/WzIvAj37a2ix2xSjENUIP+l3ePAekc4f5XGMe8O
wx6wxgvVSomVrsM565XGcjw+LXUyzlXowS6JhR+Zn6fYmbDkPk0ZMC5HBFidakkB
YxO7aWjBvyYinOrBMWODeWatt4q50bI6YtQpxxWyDIdRXWbvQDjBILveTh6/sRGQ
KbA3r6XC/DSMri4Mo/r+92fbBEYV2otnkkYraz04VgBVYYgiJts=
=C83+
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-4.19-2' of git://git.linux-nfs.org/projects/anna/linux-nfs
Pull NFS client bugfixes from Anna Schumaker:
"These are a handful of fixes for problems that Trond found. Patch #1
and #3 have the same name, a second issue was found after applying the
first patch.
Stable bugfixes:
- v4.17+: Fix tracepoint Oops in initiate_file_draining()
- v4.11+: Fix an infinite loop on I/O
Other fixes:
- Return errors if a waiting layoutget is killed
- Don't open code clearing of delegation state"
* tag 'nfs-for-4.19-2' of git://git.linux-nfs.org/projects/anna/linux-nfs:
NFS: Don't open code clearing of delegation state
NFSv4.1 fix infinite loop on I/O.
NFSv4: Fix a tracepoint Oops in initiate_file_draining()
pNFS: Ensure we return the error if someone kills a waiting layoutget
NFSv4: Fix a tracepoint Oops in initiate_file_draining()
Add a helper for the case when the nfs4 open state has been set to use
a delegation stateid, and we want to revert to using the open stateid.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
The previous fix broke recovery of delegated stateids because it assumes
that if we did not mark the delegation as suspect, then the delegation has
effectively been revoked, and so it removes that delegation irrespectively
of whether or not it is valid and still in use. While this is "mostly
harmless" for ordinary I/O, we've seen pNFS fail with LAYOUTGET spinning
in an infinite loop while complaining that we're using an invalid stateid
(in this case the all-zero stateid).
What we rather want to do here is ensure that the delegation is always
correctly marked as needing testing when that is the case. So we want
to close the loophole offered by nfs4_schedule_stateid_recovery(),
which marks the state as needing to be reclaimed, but not the
delegation that may be backing it.
Fixes: 0e3d3e5df0 ("NFSv4.1 fix infinite loop on IO BAD_STATEID error")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: stable@vger.kernel.org # v4.11+
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Now that the value of 'ino' can be NULL or an ERR_PTR(), we need to
change the test in the tracepoint.
Fixes: ce5624f7e6 ("NFSv4: Return NFS4ERR_DELAY when a layout fails...")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: stable@vger.kernel.org # v4.17+
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
If someone interrupts a wait on one or more outstanding layoutgets in
pnfs_update_layout() then return the ERESTARTSYS/EINTR error.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Now that the value of 'ino' can be NULL or an ERR_PTR(), we need to
change the test in the tracepoint.
Fixes: ce5624f7e6 ("NFSv4: Return NFS4ERR_DELAY when a layout fails...")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: stable@vger.kernel.org # v4.17+
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
ioctls that are
* callable only via tty_ioctl()
* not driver-specific
* not demand data structure conversions
* either always need passing arg as is or always demand compat_ptr()
get intercepted in tty_compat_ioctl() from the very beginning and
redirecter to tty_ioctl(). As the result, their entries in fs/compat_ioctl.c
(some of those had been missing, BTW) got removed, as well as
n_tty_compat_ioctl_helper() (now it's never called with any cmd it would accept).
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCW5qpOgAKCRDh3BK/laaZ
PDCQAQCIKLg0aLeWOkfUO76mBjlp5srKgJfrqpFoyuozD6l2fQEAl/W2x9NOduV+
PK4sCYMT8SpI0hMrbv9P4zZ683kmaA8=
=RnZU
-----END PGP SIGNATURE-----
Merge tag 'ovl-fixes-4.19-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs
Pull overlayfs fixes from Miklos Szeredi:
"This fixes a regression in the recent file stacking update, reported
and fixed by Amir Goldstein. The fix is fairly trivial, but involves
adding a fadvise() f_op and the associated churn in the vfs. As
discussed on -fsdevel, there are other possible uses for this method,
than allowing proper stacking for overlays.
And there's one other fix for a syzkaller detected oops"
* tag 'ovl-fixes-4.19-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
ovl: fix oopses in ovl_fill_super() failure paths
ovl: add ovl_fadvise()
vfs: implement readahead(2) using POSIX_FADV_WILLNEED
vfs: add the fadvise() file operation
Documentation/filesystems: update documentation of file_operations
ovl: fix GPF in swapfile_activate of file from overlayfs over xfs
ovl: respect FIEMAP_FLAG_SYNC flag
persistent_ram_vmap() returns the page start vaddr.
persistent_ram_iomap() supports non-page-aligned mapping.
persistent_ram_buffer_map() always adds offset-in-page to the vaddr
returned from these two functions, which causes incorrect mapping of
non-page-aligned persistent ram buffer.
By default ftrace_size is 4096 and max_ftrace_cnt is nr_cpu_ids. Without
this patch, the zone_sz in ramoops_init_przs() is 4096/nr_cpu_ids which
might not be page aligned. If the offset-in-page > 2048, the vaddr will be
in next page. If the next page is not mapped, it will cause kernel panic:
[ 0.074231] BUG: unable to handle kernel paging request at ffffa19e0081b000
...
[ 0.075000] RIP: 0010:persistent_ram_new+0x1f8/0x39f
...
[ 0.075000] Call Trace:
[ 0.075000] ramoops_init_przs.part.10.constprop.15+0x105/0x260
[ 0.075000] ramoops_probe+0x232/0x3a0
[ 0.075000] platform_drv_probe+0x3e/0xa0
[ 0.075000] driver_probe_device+0x2cd/0x400
[ 0.075000] __driver_attach+0xe4/0x110
[ 0.075000] ? driver_probe_device+0x400/0x400
[ 0.075000] bus_for_each_dev+0x70/0xa0
[ 0.075000] driver_attach+0x1e/0x20
[ 0.075000] bus_add_driver+0x159/0x230
[ 0.075000] ? do_early_param+0x95/0x95
[ 0.075000] driver_register+0x70/0xc0
[ 0.075000] ? init_pstore_fs+0x4d/0x4d
[ 0.075000] __platform_driver_register+0x36/0x40
[ 0.075000] ramoops_init+0x12f/0x131
[ 0.075000] do_one_initcall+0x4d/0x12c
[ 0.075000] ? do_early_param+0x95/0x95
[ 0.075000] kernel_init_freeable+0x19b/0x222
[ 0.075000] ? rest_init+0xbb/0xbb
[ 0.075000] kernel_init+0xe/0xfc
[ 0.075000] ret_from_fork+0x3a/0x50
Signed-off-by: Bin Yang <bin.yang@intel.com>
[kees: add comments describing the mapping differences, updated commit log]
Fixes: 24c3d2f342 ("staging: android: persistent_ram: Make it possible to use memory outside of bootmem")
Cc: stable@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
We need to verify that the "data_offset" is within bounds.
Reported-by: Dr Silvio Cesare of InfoSect <silvio.cesare@gmail.com>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
This patch adds to support injecting error for write IO, this can simulate
IO error like fail_make_request or dm_flakey does.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Remove the verbose license text from f2fs files and replace them with
SPDX tags. This does not change the license of any of the code.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
It's a little bit strange when fault_injection related
options fail with -EINVAL which were already disabled
from config, so surround all fault_injection related option
parsing code using CONFIG_F2FS_FAULT_INJECTION. Meanwhile,
slightly change warning message to keep consistency with
option POSIX_ACL and FS_XATTR.
Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The VIDEO_STILLPICTURE is only implemented by one driver, while
VIDEO_GET_EVENT has two users in tree. In both cases, it is fairly
easy to handle the compat ioctls in the native handler rather
than relying on translation in fs/compat_ioctls.
In effect, this means that now the drivers implement both structure
layouts in both native and compat mode, but I don't see anything
wrong with that.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Most DVB audio and video ioctl commands are completely compatible,
and are implemented by just two drivers: ttpci and ivtv. In both
cases, we can use the same ioctl handler for both native and
compat ioctl handling, and remove the entries from the global
lookup table.
In case of ttpci, this directly hooks into the file_operations
structure, and for ivtv, we have to set the compat_ioctl32
method in v4l2_file_operations. For all I can tell, setting it
to video_ioctl2 will still do the right thing for all commands.
Note that for the VIDEO_STILLPICTURE and VIDEO_GET_EVENT commands,
a translation handler in fs/compat_ioctl.c is still used. This
works because the command numbers are different on 32-bit
systems.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
All the CEC ioctls are compatible, and they are only implemented
in one driver, so we can simply let this driver handle them
natively.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Hans Verkuil <hans.verkuil@cisco.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
All dmx ioctls are compatible, and they are only implemented
in one file, so we can replace the list of commands in
fs/compat_ioctl.c with a single line in dmxdev.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
The VIDEO_GET_EVENT and VIDEO_STILLPICTURE was added back in 2005 but
it never worked because the command number is wrong.
Using the right command number means we have a better chance of them
actually doing the right thing, though clearly nobody has ever tried
it successfully.
I noticed these while auditing the remaining users of compat_time_t
for y2038 bugs. This one is fine in that regard, it just never did
anything.
Fixes: 6e87abd0b8 ("[DVB]: Add compat ioctl handling.")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
The "le32_to_cpu(rsp->OutputOffset) + *plen" addition can overflow and
wrap around to a smaller value which looks like it would lead to an
information leak.
Fixes: 4a72dafa19 ("SMB2 FSCTL and IOCTL worker function")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
CC: Stable <stable@vger.kernel.org>
The problem is that "entryptr + next_offset" and "entryptr + len + size"
can wrap. I ended up changing the type of "entryptr" because it makes
the math easier when we don't have to do so much casting.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
CC: Stable <stable@vger.kernel.org>
The "old_entry + le32_to_cpu(pDirInfo->NextEntryOffset)" can wrap
around so I have added a check for integer overflow.
Reported-by: Dr Silvio Cesare of InfoSect <silvio.cesare@gmail.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
CC: Stable <stable@vger.kernel.org>
Use my_zero_pfn instead of ZERO_PAGE(), and pass the vaddr to it instead
of zero so it works on MIPS and s390 who reference the vaddr to select a
zero page.
Cc: <stable@vger.kernel.org>
Fixes: 91d25ba8a6 ("dax: use common 4k zero page for dax mmap reads")
Signed-off-by: Matthew Wilcox <willy@infradead.org>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Currently, project quota could be changed by fssetxattr
ioctl, and existed permission check inode_owner_or_capable()
is obviously not enough, just think that common users could
change project id of file, that could make users to
break project quota easily.
This patch try to follow same regular of xfs project
quota:
"Project Quota ID state is only allowed to change from
within the init namespace. Enforce that restriction only
if we are trying to change the quota ID state.
Everything else is allowed in user namespaces."
Besides that, check and set project id'state should
be an atomic operation, protect whole operation with
inode lock.
Signed-off-by: Wang Shilong <wshilong@ddn.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In the call trace below, we might sleep in function dput().
So in order to avoid sleeping under spin_lock, we remove f2fs_mark_inode_dirty_sync
from __try_update_largest_extent && __drop_largest_extent.
BUG: sleeping function called from invalid context at fs/dcache.c:796
Call trace:
dump_backtrace+0x0/0x3f4
show_stack+0x24/0x30
dump_stack+0xe0/0x138
___might_sleep+0x2a8/0x2c8
__might_sleep+0x78/0x10c
dput+0x7c/0x750
block_dump___mark_inode_dirty+0x120/0x17c
__mark_inode_dirty+0x344/0x11f0
f2fs_mark_inode_dirty_sync+0x40/0x50
__insert_extent_tree+0x2e0/0x2f4
f2fs_update_extent_tree_range+0xcf4/0xde8
f2fs_update_extent_cache+0x114/0x12c
f2fs_update_data_blkaddr+0x40/0x50
write_data_page+0x150/0x314
do_write_data_page+0x648/0x2318
__write_data_page+0xdb4/0x1640
f2fs_write_cache_pages+0x768/0xafc
__f2fs_write_data_pages+0x590/0x1218
f2fs_write_data_pages+0x64/0x74
do_writepages+0x74/0xe4
__writeback_single_inode+0xdc/0x15f0
writeback_sb_inodes+0x574/0xc98
__writeback_inodes_wb+0x190/0x204
wb_writeback+0x730/0xf14
wb_check_old_data_flush+0x1bc/0x1c8
wb_workfn+0x554/0xf74
process_one_work+0x440/0x118c
worker_thread+0xac/0x974
kthread+0x1a0/0x1c8
ret_from_fork+0x10/0x1c
Signed-off-by: Zhikang Zhang <zhangzhikang1@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Add a plug to merge readahead IO in readdir(), expecting it can
reduce bio count before submitting to block layer.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In fill_super -> init_percpu_info, we should destroy percpu counter
in error path, otherwise memory allcoated for percpu counter will
leak.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
It needs to release memory allocated for sbi->write_io in error path,
otherwise, it will cause memory leak.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
default_acl and acl of newly created inode will be initiated
as ACL_NOT_CACHED in vfs function inode_init_always() and later
will be updated by calling xxx_init_acl() in specific filesystems.
Howerver, when default_acl and acl are NULL then they keep the value
of ACL_NOT_CACHED, this patch tries to cache NULL for acl/default_acl
in this case.
Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Acked-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
generic/417 reported as blow:
------------[ cut here ]------------
kernel BUG at /home/yuchao/git/devf2fs/inode.c:695!
invalid opcode: 0000 [#1] PREEMPT SMP
CPU: 1 PID: 21697 Comm: umount Tainted: G W O 4.18.0-rc2+ #39
Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
EIP: f2fs_evict_inode+0x556/0x580 [f2fs]
Call Trace:
? _raw_spin_unlock+0x2c/0x50
evict+0xa8/0x170
dispose_list+0x34/0x40
evict_inodes+0x118/0x120
generic_shutdown_super+0x41/0x100
? rcu_read_lock_sched_held+0x97/0xa0
kill_block_super+0x22/0x50
kill_f2fs_super+0x6f/0x80 [f2fs]
deactivate_locked_super+0x3d/0x70
deactivate_super+0x40/0x60
cleanup_mnt+0x39/0x70
__cleanup_mnt+0x10/0x20
task_work_run+0x81/0xa0
exit_to_usermode_loop+0x59/0xa7
do_fast_syscall_32+0x1f5/0x22c
entry_SYSENTER_32+0x53/0x86
EIP: f2fs_evict_inode+0x556/0x580 [f2fs]
It can simply reproduced with scripts:
Enable quota feature during mkfs.
Testcase1:
1. mkfs.f2fs /dev/zram0
2. mount -t f2fs /dev/zram0 /mnt/f2fs
3. xfs_io -f /mnt/f2fs/file -c "pwrite 0 4k" -c "fsync"
4. godown /mnt/f2fs
5. umount /mnt/f2fs
6. mount -t f2fs -o ro /dev/zram0 /mnt/f2fs
7. umount /mnt/f2fs
Testcase2:
1. mkfs.f2fs /dev/zram0
2. mount -t f2fs /dev/zram0 /mnt/f2fs
3. touch /mnt/f2fs/file
4. create process[pid = x] do:
a) open /mnt/f2fs/file;
b) unlink /mnt/f2fs/file
5. godown -f /mnt/f2fs
6. kill process[pid = x]
7. umount /mnt/f2fs
8. mount -t f2fs -o ro /dev/zram0 /mnt/f2fs
9. umount /mnt/f2fs
The reason is: during recovery, i_{c,m}time of inode will be updated, then
the inode can be set dirty w/o being tracked in sbi->inode_list[DIRTY_META]
global list, so later write_checkpoint will not flush such dirty inode into
node page.
Once umount is called, sync_filesystem() in generic_shutdown_super() will
skip syncng dirty inodes due to sb_rdonly check, leaving dirty inodes
there.
To solve this issue, during umount, add remove SB_RDONLY flag in
sb->s_flags, to make sure sync_filesystem() will not be skipped.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
None of the callers use the it so remove it.
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
If the refcount of a page is lowered between the time that it is returned
by dax_busy_page() and when the refcount is again checked in
ext4_break_layouts() => ___wait_var_event(), the waiting function
ext4_wait_dax_page() will never be called. This means that
ext4_break_layouts() will still have 'retry' set to false, so we'll stop
looping and never check the refcount of other pages in this inode.
Instead, always continue looping as long as dax_layout_busy_page() gives us
a page which it found with an elevated refcount.
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
We treat quota option as usrquota, so remove quota option from ignore
list.
Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
1) no point putting it into fs/compat_ioctl.c when you handle it in
your ->compat_ioctl() anyway.
2) HIDPCONNADD is *not* COMPATIBLE_IOCTL() stuff at all - it does
layout massage (pointer-chasing there)
3) use compat_ptr()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
use compat_ptr() properly and don't bother with fs/compat_ioctl.c -
it's all handled in ->compat_ioctl() anyway.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
ovl_free_fs() dereferences ofs->workbasedir and ofs->upper_mnt in cases when
those might not have been initialized yet.
Fix the initialization order for these fields.
Reported-by: syzbot+c75f181dc8429d2eb887@syzkaller.appspotmail.com
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: <stable@vger.kernel.org> # v4.15
Fixes: 95e6d4177c ("ovl: grab reference to workbasedir early")
Fixes: a9075cdb46 ("ovl: factor out ovl_free_fs() helper")
This got lost in commit 0fdfef9aa7,
which removed CONFIG_CIFS_SMB311.
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Fixes: 0fdfef9aa7 ("smb3: simplify code by removing CONFIG_CIFS_SMB311")
CC: Stable <stable@vger.kernel.org>
CC: linux-cifs@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
A powerpc build of cifs with gcc v8.2.0 produces this warning:
fs/cifs/cifssmb.c: In function ‘CIFSSMBNegotiate’:
fs/cifs/cifssmb.c:605:3: warning: ‘strncpy’ writing 16 bytes into a region of size 1 overflows the destination [-Wstringop-overflow=]
strncpy(pSMB->DialectsArray+count, protocols[i].name, 16);
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Since we are already doing a strlen() on the source, change the strncpy
to a memcpy().
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Steve French <stfrench@microsoft.com>
Fix the cell specification mechanism to allow cells to be pre-created
without having to specify at least one address (the addresses will be
upcalled for).
This allows the cell information preload service to avoid the need to issue
loads of DNS lookups during boot to get the addresses for each cell (500+
lookups for the 'standard' cell list[*]). The lookups can be done later as
each cell is accessed through the filesystem.
Also remove the print statement that prints a line every time a new cell is
added.
[*] There are 144 cells in the list. Each cell is first looked up for an
SRV record, and if that fails, for an AFSDB record. These get a list
of server names, each of which then has to be looked up to get the
addresses for that server. E.g.:
dig srv _afs3-vlserver._udp.grand.central.org
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Sometimes, some merged IOs could get a chance to be submitted, resulting in
system hang in shutdown test. This issues IOs all the time after shutdown.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
went into -rc1 and a use-after-free fix.
The rbd changes have been sitting in a branch for quite a while but
couldn't be included into the -rc1 pull request because of a pending
wire protocol backwards compatibility fixup that only got committed
early this week.
-----BEGIN PGP SIGNATURE-----
iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAluSrJYTHGlkcnlvbW92
QGdtYWlsLmNvbQAKCRBKf944AhHzi/N8B/4sZzRCJMCejvU/yRq91NlaPDrxbVHh
nfICZ/8Fsy/fmvK8NWNyHcCIWx+nWrbCvCJMj0fxWMhk/1t75yC+TdyCJnyuhsQU
V/CPTs9BTdwrSUiTB83/n/ukGL6mpESk0CQ1er/l1EO6FnNOXvgzHDnCqUQZLdzU
1aRcx5JQWWo/QlCmzt2KWENhfQRMvLAtf04F5cUuR+JTrMjwWia6MAuRGuOhVQkW
XIlFNakBKab89Vod1pmA7BrG/+sHXCpVGX6sjAp9vQUWO3WWKBRnNtVwo9dPSHah
hBR8IzOkihw7HfTlINWVpiR69nTfM80PQHXJkFSp36E6Sfq8EShRpFIZ
=pga5
-----END PGP SIGNATURE-----
Merge tag 'ceph-for-4.19-rc3' of https://github.com/ceph/ceph-client
Pull ceph fixes from Ilya Dryomov:
"Two rbd patches to complete support for images within namespaces that
went into -rc1 and a use-after-free fix.
The rbd changes have been sitting in a branch for quite a while but
couldn't be included into the -rc1 pull request because of a pending
wire protocol backwards compatibility fixup that only got committed
early this week"
* tag 'ceph-for-4.19-rc3' of https://github.com/ceph/ceph-client:
rbd: support cloning across namespaces
rbd: factor out get_parent_info()
ceph: avoid a use-after-free in ceph_destroy_options()
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAluSNfQACgkQnJ2qBz9k
QNloSAf/RpsqUnmQvJKK7hQUVNMCQP/Kf3KND5iN5RfMbhU9r7tzERkNvqhdA6QZ
uoPi8dEecI+ihY5F8ddyw1Chaou4MToWKdNz4ojwJXVrN6bb+pq+xj0hTvT5FjFh
iM1JXHtSEk6W+CnXPE5CycrZppIHxJfJxeaWg7av5Zyc4nkTesxtG8PycMBxROW8
detUcJt15VGBswi19udztf7XY/lwDwUQ9LwC0W5B+o8pKIwuN3ENMVVOeAriAyoy
hXTpPA8twBhM7i8D/1eppDCkYLTr08bquNsDpn8kUEf2RxcxiFJuDLOeXiH3sQRq
BZmf/QIIRA8R+SPeFiuxY/795FDC6Q==
=CWu1
-----END PGP SIGNATURE-----
Merge tag 'for_v4.19-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull fsnotify fix from Jan Kara:
"A small fsnotify fix from Amir"
* tag 'for_v4.19-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
fsnotify: fix ignore mask logic in fsnotify()
the 9p client code overwrites our glock.client_id pointing to a static
buffer by an allocated string holding the network provided value which
we do not care about; free and reset the value as appropriate.
This is almost identical to the leak in v9fs_file_getlock() fixed by
Al Viro in commit ce85dd58ad ("9p: we are leaking glock.client_id
in v9fs_file_getlock()"), which was returned as an error by a coverity
false positive -- while we are here attempt to make the code slightly
more robust to future change of the net/9p/client code and hopefully
more clear to coverity that there is no problem.
Link: http://lkml.kernel.org/r/1536339057-21974-5-git-send-email-asmadeus@codewreck.org
Signed-off-by: Dominique Martinet <dominique.martinet@cea.fr>
iattr is passed to v9fs_vfs_setattr_dotl which does send various
values from iattr over the wire, even if it tells the server to
only look at iattr.ia_valid fields this could leak some stack data.
Link: http://lkml.kernel.org/r/1536339057-21974-2-git-send-email-asmadeus@codewreck.org
Addresses-Coverity-ID: 1195601 ("Uninitalized scalar variable")
Signed-off-by: Dominique Martinet <dominique.martinet@cea.fr>
Replace "fallthough" with a proper "fall through" annotation.
This fix is part of the ongoing efforts to enabling -Wimplicit-fallthrough
Link: http://lkml.kernel.org/r/20180903193806.GA11258@embeddedor.com
Addresses-Coverity-ID: 402012 ("Missing break in switch")
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Dominique Martinet <dominique.martinet@cea.fr>
We don't support Free Space Table and Free Space Bitmap as specified by
UDF standard for writing as we don't support erasing blocks before
overwriting them. Just drop the handling of these structures as
partition descriptor checking code already makes sure such filesystems
can be mounted only read-only.
Signed-off-by: Jan Kara <jack@suse.cz>
Fix a crash during an attempt to mount a filesystem that has both
Unallocated Space Table and Unallocated Space Bitmap. Such filesystem
actually violates the UDF standard so we just have to properly detect
such situation and refuse to mount such filesystem read-write. When we
are at it, verify also other constraints on the allocation information
mandated by the standard.
Reported-by: Anatoly Trosinenko <anatoly.trosinenko@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
There are certain filesystem features which we support for reading but
not for writing. We properly refuse to mount such filesystems read-write
however for some features (such as read-only partitions), we don't check
for these features when remounting the filesystem from read-only to
read-write. Thus such filesystems could be remounted read-write leading
to strange behavior (most likely crashes).
Fix the problem by marking in superblock whether the filesystem has some
features that are supported in read-only mode and check this flag during
remount.
Signed-off-by: Jan Kara <jack@suse.cz>
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAluRb84ACgkQiiy9cAdy
T1GBuAv/ZkXC5cxNpE84dRLii4ey+Y0u5ip3VIyzxwburDDxQh99zWTs8FdyYe1f
5CYD4PKcNajVceNUg1EdfkNC4ss21I2HcxujquCo8gZJqrt5lHvZXELJ31d8ovXG
Y2MRl5+KZKLx1sBgzsGJf3aZOneObp7EEAnL/bjeziX7caD6uO/F52MXcMgWrpoM
krvWSMzS5iY+jRltvPLhTzUmfbaPoS86FRNIHHOiA8AgQLvx3CT4lL+kJOHv1bHX
haZc9zKy1iUU+yK05vnNLOHVlPeZDa8j/Q8lcEfTrDVa0J/je4DkI9HsB1X54vz0
65JluAf4G4vaSYU/hnLaWt4PZ7owjTr7fJlu0c8TE6aPqAZEYYoVhjbrdXg7OI4j
9nUsoEY1hyBH2X5qStvpb9GiZbhR19bhcvnvOAgZ7p4VnyKfexe7o2QHTCPPAxaP
uYf8OeHjmnYF3MQI0s6rA0TPdZxc77acgrZQWUoOMPN/bwkyP1bSfA1aURJAsx/k
OodjAog1
=WRXo
-----END PGP SIGNATURE-----
Merge tag '4.19-rc2-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs fixes from Steve French:
"Four small SMB3 fixes, three for stable, and one minor debug
clarification"
* tag '4.19-rc2-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6:
cifs: connect to servername instead of IP for IPC$ share
smb3: check for and properly advertise directory lease support
smb3: minor debugging clarifications in rfc1001 len processing
SMB3: Backup intent flag missing for directory opens with backupuid mounts
fs/cifs: don't translate SFM_SLASH (U+F026) to backslash
default_acl and acl of newly created inode will be initiated
as ACL_NOT_CACHED in vfs function inode_init_always() and later
will be updated by calling xxx_init_acl() in specific filesystems.
Howerver, when default_acl and acl are NULL then they keep the value
of ACL_NOT_CACHED, this patch tries to cache NULL for acl/default_acl
in this case.
Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAluRLa8ACgkQxWXV+ddt
WDvc+BAAqxTMVngZ60WfktXzsS56OB6fu/R3DORgYcSZ0BCD4zTwoDlCjLhrCK6E
cmC+BMj+AspDQYiYESwGyFcN10sK0X7w7fa3wypTc4GNWxpkRm0Z6zT/kCvLUhdI
NlkMqAfsZ9N6iIXcR0qOxI7G55e3mpXPZGdFTk5rmDTv/9TqU0TMp9s8Zw5scn6R
ctdE+iE0lpRfNjF8ZDH1BtYIV4g2X81sZF/fkGz621HQfMTCjjPHFdlz+jlirBaf
BrYR4w4zjVuMKd3ZC5FHffVchbkvt29h6fAr4sEpJTwFJwd8pjI7GuPYWDQ918NB
TGX6EUP6usQqDK2zD405jCS6MbMshJm3uh5kmEpeNgK/tKJTln8Sbef/Xs93yIn2
+k9BMKOIcUHHBiv6PgCaZomcWCpii2S2u6vncqCnNuI4wK1RN3gHJc5YPhJArlrB
NUFJiTCQE6LWYOP2Hw+rggcrtBxli0bX7Mqp5FYFVdh5KBvolJE1o3B/JS8qpqRF
u0dPwbLHtTpTpXM5EfmM8a45S+DxuxTDBh3vdoAOM9LN/ivpeqqnFbHrIGmrTMjo
pQJ8aTrCwYMEMNu6oCV1cniFrOYRZ439hYjg524MjVXYCRyxhzAdVmVTEBaLjWCW
9GlGqEC7YZY2wLi5lPEGqxsIaVVELpettJB9KbBKmYB47VFWEf0=
=fu93
-----END PGP SIGNATURE-----
Merge tag 'for-4.19-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- fix for improper fsync after hardlink
- fix for a corruption during file deduplication
- use after free fixes
- RCU warning fix
- fix for buffered write to nodatacow file
* tag 'for-4.19-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: Fix suspicious RCU usage warning in btrfs_debug_in_rcu
btrfs: use after free in btrfs_quota_enable
btrfs: btrfs_shrink_device should call commit transaction at the end
btrfs: fix qgroup_free wrong num_bytes in btrfs_subvolume_reserve_metadata
Btrfs: fix data corruption when deduplicating between different files
Btrfs: sync log after logging new name
Btrfs: fix unexpected failure of nocow buffered writes after snapshotting when low on space
syzbot reported a use-after-free in ceph_destroy_options(), called from
ceph_mount(). The problem was that create_fs_client() consumed the opt
pointer on some errors, but not on all of them. Make sure it always
consumes both libceph and ceph options.
Reported-by: syzbot+8ab6f1042021b4eed062@syzkaller.appspotmail.com
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
1. Create a file in an encrypted directory
2. Do GC & drop caches
3. Read stale data before its bio for metapage was not issued yet
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Don't limit printing log, so that we will not miss any key messages.
This reverts commit a36c106dff.
In addition, we use printk_ratelimited to avoid too many log prints.
- error injection
- discard submission failure
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
When dev is busy, discard thread wake up timeout can be aligned with the
exact time that it needs to wait for dev to come out of busy. This helps
to avoid unnecessary periodic wakeups and thus save some power.
Signed-off-by: Sahitya Tummala <stummala@codeaurora.org>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
https://bugzilla.kernel.org/show_bug.cgi?id=200951
These is a NULL pointer dereference issue reported in bugzilla:
Hi,
in the setup there is a SATA SSD connected to a SATA-to-USB bridge.
The disc is "Samsung SSD 850 PRO 256G" which supports TRIM.
There are four partitions:
sda1: FAT /boot
sda2: F2FS /
sda3: F2FS /home
sda4: F2FS
The bridge is ASMT1153e which uses the "uas" driver.
There is no TRIM pass-through, so, when mounting it reports:
mounting with "discard" option, but the device does not support discard
The USB host is USB3.0 and UASP capable. It is the one on RK3399.
Given this everything works fine, except there is no TRIM support.
In order to enable TRIM a new UDEV rule is added [1]:
/etc/udev/rules.d/10-sata-bridge-trim.rules:
ACTION=="add|change", ATTRS{idVendor}=="174c", ATTRS{idProduct}=="55aa", SUBSYSTEM=="scsi_disk", ATTR{provisioning_mode}="unmap"
After reboot any F2FS write hangs forever and dmesg reports:
Unable to handle kernel NULL pointer dereference
Also tested on a x86_64 system: works fine even with TRIM enabled.
same disc
same bridge
different usb host controller
different cpu architecture
not root filesystem
Regards,
Vicenç.
[1] Post #5 in https://bbs.archlinux.org/viewtopic.php?id=236280
Unable to handle kernel NULL pointer dereference at virtual address 000000000000003e
Mem abort info:
ESR = 0x96000004
Exception class = DABT (current EL), IL = 32 bits
SET = 0, FnV = 0
EA = 0, S1PTW = 0
Data abort info:
ISV = 0, ISS = 0x00000004
CM = 0, WnR = 0
user pgtable: 4k pages, 48-bit VAs, pgdp = 00000000626e3122
[000000000000003e] pgd=0000000000000000
Internal error: Oops: 96000004 [#1] SMP
Modules linked in: overlay snd_soc_hdmi_codec rc_cec dw_hdmi_i2s_audio dw_hdmi_cec snd_soc_simple_card snd_soc_simple_card_utils snd_soc_rockchip_i2s rockchip_rga snd_soc_rockchip_pcm rockchipdrm videobuf2_dma_sg v4l2_mem2mem rtc_rk808 videobuf2_memops analogix_dp videobuf2_v4l2 videobuf2_common dw_hdmi dw_wdt cec rc_core videodev drm_kms_helper media drm rockchip_thermal rockchip_saradc realtek drm_panel_orientation_quirks syscopyarea sysfillrect sysimgblt fb_sys_fops dwmac_rk stmmac_platform stmmac pwm_bl squashfs loop crypto_user gpio_keys hid_kensington
CPU: 5 PID: 957 Comm: nvim Not tainted 4.19.0-rc1-1-ARCH #1
Hardware name: Sapphire-RK3399 Board (DT)
pstate: 00000005 (nzcv daif -PAN -UAO)
pc : update_sit_entry+0x304/0x4b0
lr : update_sit_entry+0x108/0x4b0
sp : ffff00000ca13bd0
x29: ffff00000ca13bd0 x28: 000000000000003e
x27: 0000000000000020 x26: 0000000000080000
x25: 0000000000000048 x24: ffff8000ebb85cf8
x23: 0000000000000253 x22: 00000000ffffffff
x21: 00000000000535f2 x20: 00000000ffffffdf
x19: ffff8000eb9e6800 x18: ffff8000eb9e6be8
x17: 0000000007ce6926 x16: 000000001c83ffa8
x15: 0000000000000000 x14: ffff8000f602df90
x13: 0000000000000006 x12: 0000000000000040
x11: 0000000000000228 x10: 0000000000000000
x9 : 0000000000000000 x8 : 0000000000000000
x7 : 00000000000535f2 x6 : ffff8000ebff3440
x5 : ffff8000ebff3440 x4 : ffff8000ebe3a6c8
x3 : 00000000ffffffff x2 : 0000000000000020
x1 : 0000000000000000 x0 : ffff8000eb9e5800
Process nvim (pid: 957, stack limit = 0x0000000063a78320)
Call trace:
update_sit_entry+0x304/0x4b0
f2fs_invalidate_blocks+0x98/0x140
truncate_node+0x90/0x400
f2fs_remove_inode_page+0xe8/0x340
f2fs_evict_inode+0x2b0/0x408
evict+0xe0/0x1e0
iput+0x160/0x260
do_unlinkat+0x214/0x298
__arm64_sys_unlinkat+0x3c/0x68
el0_svc_handler+0x94/0x118
el0_svc+0x8/0xc
Code: f9400800 b9488400 36080140 f9400f01 (387c4820)
---[ end trace a0f21a307118c477 ]---
The reason is it is possible to enable discard flag on block queue via
UDEV, but during mount, f2fs will initialize se->discard_map only if
this flag is set, once the flag is set after mount, f2fs may dereference
NULL pointer on se->discard_map.
So this patch does below changes to fix this issue:
- initialize and update se->discard_map all the time.
- don't clear DISCARD option if device has no QUEUE_FLAG_DISCARD flag
during mount.
- don't issue small discard on zoned block device.
- introduce some functions to enhance the readability.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Tested-by: Vicente Bergas <vicencb@gmail.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Add additinal sanity check for irregular case(e.g. corruption).
If size of extended attribution is smaller than size of acl header,
then return -EINVAL.
Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Remove the verbose license text from NILFS2 files and replace them with
SPDX tags. This does not change the license of any of the code.
Link: http://lkml.kernel.org/r/1535624528-5982-1-git-send-email-konishi.ryusuke@lab.ntt.co.jp
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Introduce CONFIG_STACKLEAK_METRICS providing STACKLEAK information about
tasks via the /proc file system. In particular, /proc/<pid>/stack_depth
shows the maximum kernel stack consumption for the current and previous
syscalls. Although this information is not precise, it can be useful for
estimating the STACKLEAK performance impact for your workloads.
Suggested-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Alexander Popov <alex.popov@linux.com>
Tested-by: Laura Abbott <labbott@redhat.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
These are unused, undesired, and have never actually been used by
anybody. The original authors of this code have changed their mind about
its inclusion. While originally proposed for disk encryption on low-end
devices, the idea was discarded [1] in favor of something else before
that could really get going. Therefore, this patch removes Speck.
[1] https://marc.info/?l=linux-crypto-vger&m=153359499015659
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Acked-by: Eric Biggers <ebiggers@google.com>
Cc: stable@vger.kernel.org
Acked-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
An online resize of a file system with the bigalloc feature enabled
and a 1k block size would be refused since ext4_resize_begin() did not
understand s_first_data_block is 0 for all bigalloc file systems, even
when the block size is 1k.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
Avoid growing the file system to an extent so that the last block
group is too small to hold all of the metadata that must be stored in
the block group.
This problem can be triggered with the following reproducer:
umount /mnt
mke2fs -F -m0 -b 4096 -t ext4 -O resize_inode,^has_journal \
-E resize=1073741824 /tmp/foo.img 128M
mount /tmp/foo.img /mnt
truncate --size 1708M /tmp/foo.img
resize2fs /dev/loop0 295400
umount /mnt
e2fsck -fy /tmp/foo.img
Reported-by: Torsten Hilbrich <torsten.hilbrich@secunet.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
Add another mark type flag FAN_MARK_FILESYSTEM for add/remove/flush
of super block mark type.
A super block watch gets all events on the filesystem, regardless of
the mount from which the mark was added, unless an ignore mask exists
on either the inode or the mount where the event was generated.
Only one of FAN_MARK_MOUNT and FAN_MARK_FILESYSTEM mark type flags
may be provided to fanotify_mark() or no mark type flag for inode mark.
Cc: <linux-api@vger.kernel.org>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Send events to group if super block mark mask matches the event
and unless the same group has an ignore mask on the vfsmount or
the inode on which the event occurred.
Soon, fanotify backend is going to support super block marks and
fanotify backend only supports path type events.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Add the infrastructure to attach a mark to a super_block struct
and detach all attached marks when super block is destroyed.
This is going to be used by fanotify backend to setup super block
marks.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
There have been multiple kernel vulnerabilities that permitted userspace to
pass completely unchecked pointers through to userspace accessors:
- the waitid() bug - commit 96ca579a1e ("waitid(): Add missing
access_ok() checks")
- the sg/bsg read/write APIs
- the infiniband read/write APIs
These don't happen all that often, but when they do happen, it is hard to
test for them properly; and it is probably also hard to discover them with
fuzzing. Even when an unmapped kernel address is supplied to such buggy
code, it just returns -EFAULT instead of doing a proper BUG() or at least
WARN().
Try to make such misbehaving code a bit more visible by refusing to do a
fixup in the pagefault handler code when a userspace accessor causes a #PF
on a kernel address and the current context isn't whitelisted.
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Kees Cook <keescook@chromium.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: kernel-hardening@lists.openwall.com
Cc: dvyukov@google.com
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: "Naveen N. Rao" <naveen.n.rao@linux.vnet.ibm.com>
Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Cc: Borislav Petkov <bp@alien8.de>
Link: https://lkml.kernel.org/r/20180828201421.157735-7-jannh@google.com
Commit 92183a4289 ("fsnotify: fix ignore mask logic in
send_to_group()") acknoledges the use case of ignoring an event on
an inode mark, because of an ignore mask on a mount mark of the same
group (i.e. I want to get all events on this file, except for the events
that came from that mount).
This change depends on correctly merging the inode marks and mount marks
group lists, so that the mount mark ignore mask would be tested in
send_to_group(). Alas, the merging of the lists did not take into
account the case where event in question is not in the mask of any of
the mount marks.
To fix this, completely remove the tests for inode and mount event masks
from the lists merging code.
Fixes: 92183a4289 ("fsnotify: fix ignore mask logic in send_to_group")
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
default_acl and acl of newly created inode will be initiated as
ACL_NOT_CACHED in vfs function inode_init_always() and later will be
updated by calling xxx_init_acl() in specific filesystems. However,
when default_acl and acl are NULL, then they keep the value of
ACL_NOT_CACHED. This patch changes the code to cache NULL for acl /
default_acl in this case to save unnecessary ACL lookup in future.
Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Variables group_start and nr_groups are being assigned but are never used
hence they are redundant and can be removed.
Cleans up clang warning:
variable 'group_start' set but not used [-Wunused-but-set-variable]
variable 'nr_groups' set but not used [-Wunused-but-set-variable]
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Jan Kara <jack@suse.cz>
This patch is required allows access to a Microsoft fileserver failover
cluster behind a 1:1 NAT firewall.
The change also provides stronger context for authentication and share
connection (see MS-SMB2 3.3.5.7 and MS-SRVS 3.1.6.8) as noted by
Tom Talpey, and addresses comments about the buffer size for the UNC
made by Aurélien Aptel.
Signed-off-by: Thomas Werschlein <thomas.werschlein@geo.uzh.ch>
Signed-off-by: Steve French <stfrench@microsoft.com>
CC: Tom Talpey <ttalpey@microsoft.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
CC: Stable <stable@vger.kernel.org>
Although servers will typically ignore unsupported features,
we should advertise the support for directory leases (as
Windows e.g. does) in the negotiate protocol capabilities we
pass to the server, and should check for the server capability
(CAP_DIRECTORY_LEASING) before sending a lease request for an
open of a directory. This will prevent us from accidentally
sending directory leases to SMB2.1 or SMB2 server for example.
Signed-off-by: Steve French <stfrench@microsoft.com>
CC: Stable <stable@vger.kernel.org>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
I ran into some cases where server was returning the wrong length
on frames but I couldn't easily match them to the command in the
network trace (or server logs) since I need the command and/or
multiplex id to find the offending SMB2/SMB3 command. Add these
two fields to the log message. In the case of padding too much
it may not be a problem in all cases but might have correlated
to a network disconnect case in some problems we have been
looking at. In the case of frame too short is even more important.
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
When "backup intent" is requested on the mount (e.g. backupuid or
backupgid mount options), the corresponding flag needs to be set
on opens of directories (and files) but was missing in some
places causing access denied trying to enumerate and backup
servers.
Fixes kernel bugzilla #200953https://bugzilla.kernel.org/show_bug.cgi?id=200953
Reported-and-tested-by: <whh@rubrik.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
CC: Stable <stable@vger.kernel.org>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
When a Mac client saves an item containing a backslash to a file server
the backslash is represented in the CIFS/SMB protocol as as U+F026.
Before this change, listing a directory containing an item with a
backslash in its name will return that item with the backslash
represented with a true backslash character (U+005C) because
convert_sfm_character mapped U+F026 to U+005C when interpretting the
CIFS/SMB protocol response. However, attempting to open or stat the
path using a true backslash will result in an error because
convert_to_sfm_char does not map U+005C back to U+F026 causing the
CIFS/SMB request to be made with the backslash represented as U+005C.
This change simply prevents the U+F026 to U+005C conversion from
happenning. This is analogous to how the code does not do any
translation of UNI_SLASH (U+F000).
Signed-off-by: Jon Kuhn <jkuhn@barracuda.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Pull core fixes from Thomas Gleixner:
"A small set of updates for core code:
- Prevent tracing in functions which are called from trace patching
via stop_machine() to prevent executing half patched function trace
entries.
- Remove old GCC workarounds
- Remove pointless includes of notifier.h"
* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
objtool: Remove workaround for unreachable warnings from old GCC
notifier: Remove notifier header file wherever not used
watchdog: Mark watchdog touch functions as notrace
When mounting the superblock, ext4_fill_super() calculates the free
blocks and free inodes and stores them in the superblock. It's not
strictly necessary, since we don't use them any more, but it's nice to
keep them roughly aligned to reality.
Since it's not critical for file system correctness, the code doesn't
call ext4_commit_super(). The problem is that it's in
ext4_commit_super() that we recalculate the superblock checksum. So
if we're not going to call ext4_commit_super(), we need to call
ext4_superblock_csum_set() to make sure the superblock checksum is
consistent.
Most of the time, this doesn't matter, since we end up calling
ext4_commit_super() very soon thereafter, and definitely by the time
the file system is unmounted. However, it doesn't work in this
sequence:
mke2fs -Fq -t ext4 /dev/vdc 128M
mount /dev/vdc /vdc
cp xfstests/git-versions /vdc
godown /vdc
umount /vdc
mount /dev/vdc
tune2fs -l /dev/vdc
With this commit, the "tune2fs -l" no longer fails.
Reported-by: Chengguang Xu <cgxu519@gmx.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
A maliciously crafted file system can cause an overflow when the
results of a 64-bit calculation is stored into a 32-bit length
parameter.
https://bugzilla.kernel.org/show_bug.cgi?id=200623
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reported-by: Wen Xu <wen.xu@gatech.edu>
Cc: stable@vger.kernel.org
Since overlayfs implements stacked file operations, the underlying
filesystems are not supposed to be exposed to the overlayfs file,
whose f_inode is an overlayfs inode.
Assigning an overlayfs file to swap_file results in an attempt of xfs
code to dereference an xfs_inode struct from an ovl_inode pointer:
CPU: 0 PID: 2462 Comm: swapon Not tainted
4.18.0-xfstests-12721-g33e17876ea4e #3402
RIP: 0010:xfs_find_bdev_for_inode+0x23/0x2f
Call Trace:
xfs_iomap_swapfile_activate+0x1f/0x43
__se_sys_swapon+0xb1a/0xee9
Fix this by not assigning the real inode mapping to f_mapping, which
will cause swapon() to return an error (-EINVAL). Although it makes
sense not to allow setting swpafile on an overlayfs file, some users
may depend on it, so we may need to fix this up in the future.
Keeping f_mapping pointing to overlay inode mapping will cause O_DIRECT
open to fail. Fix this by installing ovl_aops with noop_direct_IO in
overlay inode mapping.
Keeping f_mapping pointing to overlay inode mapping will cause other
a_ops related operations to fail (e.g. readahead()). Those will be
fixed by follow up patches.
Suggested-by: Miklos Szeredi <mszeredi@redhat.com>
Fixes: f7c72396d0 ("ovl: add O_DIRECT support")
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAluGfbAACgkQnJ2qBz9k
QNmkVQgAjv/tCHStwoQ4Hhr6q5wLU9apAW5WC7Or2MyNfqoJsKgyujpikwFMa0xY
wkVqlITYavvUKl/nRlQ6hpZtAY7hi1Y7GvIYs2ci6QO4YAUuBoH7qPLKqyZYzXx+
mBaC68885nMMDqIHvsCcLurwTiDer6XQXXPDmpMoO9g4kyVuhm2e0/M7CPaA4Ue9
WEfBPBROSNdRH7Wtww/MUvtfd+9mezdlpUQZVVO5cdhAVQ8pzW2g2piYtwIpaY7E
DiJ1zcgaqCnni+b69x4wz8TwB0q8wZJjrPfJgf4X7mIGBZrw80yNRlevfe5SxwdJ
mkMDvjaD0TEItvpjgTZReXaAjFOWTA==
=E61M
-----END PGP SIGNATURE-----
Merge tag 'for_v4.19-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull misc fs fixes from Jan Kara:
- make UDF to properly mount media created by Win7
- make isofs to properly refuse devices with large physical block size
- fix a Spectre gadget in quotactl(2)
- fix a warning in fsnotify code hit by syzkaller
* tag 'for_v4.19-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
udf: Fix mounting of Win7 created UDF filesystems
udf: Remove dead code from udf_find_fileset()
fs/quota: Fix spectre gadget in do_quotactl
fs/quota: Replace XQM_MAXQUOTAS usage with MAXQUOTAS
isofs: reject hardware sector size > 2048 bytes
fsnotify: fix false positive warning on inode delete
After changing over to 64-bit time_t syscalls, many architectures will
want compat_sys_utimensat() but not respective handlers for utime(),
utimes() and futimesat(). This adds a new __ARCH_WANT_SYS_UTIME32 to
complement __ARCH_WANT_SYS_UTIME. For now, all 64-bit architectures that
support CONFIG_COMPAT set it, but future 64-bit architectures will not
(tile would not have needed it either, but got removed).
As older 32-bit architectures get converted to using CONFIG_64BIT_TIME,
they will have to use __ARCH_WANT_SYS_UTIME32 instead of
__ARCH_WANT_SYS_UTIME. Architectures using the generic syscall ABI don't
need either of them as they never had a utime syscall.
Since the compat_utimbuf structure is now required outside of
CONFIG_COMPAT, I'm moving it into compat_time.h.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
---
changed from last version:
- renamed __ARCH_WANT_COMPAT_SYS_UTIME to __ARCH_WANT_SYS_UTIME32
There are four generations of utimes() syscalls: utime(), utimes(),
futimesat() and utimensat(), each one being a superset of the previous
one. For y2038 support, we have to add another one, which is the same
as the existing utimensat() but always passes 64-bit times_t based
timespec values.
There are currently 10 architectures that only use utimensat(), two
that use utimes(), futimesat() and utimensat() but not utime(), and 11
architectures that have all four, and those define __ARCH_WANT_SYS_UTIME
in order to get a sys_utime implementation. Since all the new
architectures only want utimensat(), moving all the legacy entry points
into a common __ARCH_WANT_SYS_UTIME guard simplifies the logic. Only alpha
and ia64 grow a tiny bit as they now also get an unused sys_utime(),
but it didn't seem worth the extra complexity of adding yet another
ifdef for those.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
When 32-bit architectures get changed to support 64-bit time_t,
utimensat() needs to use the new __kernel_timespec structure as its
argument.
The older utime(), utimes() and futimesat() system calls don't need a
corresponding change as they are no longer used on C libraries that have
64-bit time support.
As we do for the other syscalls that have timespec arguments, we reuse
the 'compat' syscall entry points to implement the traditional four
interfaces, and only leave the new utimensat() as a native handler,
so that the same code gets used on both 32-bit and 64-bit kernels
on each syscall.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
The sys_llseek sytem call is needed on all 32-bit architectures and
none of the 64-bit ones, so we can remove the __ARCH_WANT_SYS_LLSEEK guard
and simplify the include/asm-generic/unistd.h header further.
Since 32-bit tasks can run either natively or in compat mode on 64-bit
architectures, we have to check for both !CONFIG_64BIT and CONFIG_COMPAT.
There are a few 64-bit architectures that also reference sys_llseek
in their 64-bit ABI (e.g. sparc), but I verified that those all
select CONFIG_COMPAT, so the #if check is still correct here. It's
a bit odd to include it in the syscall table though, as it's the
same as sys_lseek() on 64-bit, but with strange calling conventions.
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
We have four generations of stat() syscalls:
- the oldstat syscalls that are only used on the older architectures
- the newstat family that is used on all 64-bit architectures but
lacked support for large files on 32-bit architectures.
- the stat64 family that is used mostly on 32-bit architectures to
replace newstat
- statx() to replace all of the above, adding 64-bit timestamps among
other things.
We already compile stat64 only on those architectures that need it,
but newstat is always built, including on those that don't reference
it. This adds a new __ARCH_WANT_NEW_STAT symbol along the lines of
__ARCH_WANT_OLD_STAT and __ARCH_WANT_STAT64 to control compilation of
newstat. All architectures that need it use an explict define, the
others now get a little bit smaller, and future architecture (including
64-bit targets) won't ever see it.
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
The GFS2_RDF_UPTODATE flag in the rgrp is used to determine when
a rgrp buffer is valid. It's cleared when the glock is invalidated,
signifying that the buffer data is now invalid. But before this
patch, function update_rgrp_lvb was setting the flag when it
determined it had a valid lvb. But that's an invalid assumption:
just because you have a valid lvb doesn't mean you have valid
buffers. After all, another node may have made the lvb valid,
and this node just fetched it from the glock via dlm.
Consider this scenario:
1. The file system is mounted with RGRPLVB option.
2. In gfs2_inplace_reserve it locks the rgrp glock EX, but thanks
to GL_SKIP, it skips the gfs2_rgrp_bh_get.
3. Since loops == 0 and the allocation target (ap->target) is
bigger than the largest known chunk of blocks in the rgrp
(rs->rs_rbm.rgd->rd_extfail_pt) it skips that rgrp and bypasses
the call to gfs2_rgrp_bh_get there as well.
4. update_rgrp_lvb sees the lvb MAGIC number is valid, so bypasses
gfs2_rgrp_bh_get, but it still sets sets GFS2_RDF_UPTODATE due
to this invalid assumption.
5. The next time update_rgrp_lvb is called, it sees the bit is set
and just returns 0, assuming both the lvb and rgrp are both
uptodate. But since this is a smaller allocation, or space has
been freed by another node, thus adjusting the lvb values,
it decides to use the rgrp for allocations, with invalid rd_free
due to the fact it was never updated.
This patch changes update_rgrp_lvb so it doesn't set the UPTODATE
flag anymore. That way, it has no choice but to fetch the latest
values.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Before this patch, gfs2_rgrp_bh_get would check for lvb mismatches,
but it wouldn't tell you what was actually wrong. This patch adds
more information to help us debug it. It also makes rgrp consistency
checks dump any bad rgrps, and the rgrp dump code dump any lvbs
as well as the rgrp itself.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Acked-by: Steven Whitehouse <swhiteho@redhat.com>
A specially crafted file system can trick empty_inline_dir() into
reading past the last valid entry in a inline directory, and then run
into the end of xattr marker. This will trigger a divide by zero
fault. Fix this by using the size of the inline directory instead of
dir->i_size.
Also clean up error reporting in __ext4_check_dir_entry so that the
message is clearer and more understandable --- and avoids the division
by zero trap if the size passed in is zero. (I'm not sure why we
coded it that way in the first place; printing offset % size is
actually more confusing and less useful.)
https://bugzilla.kernel.org/show_bug.cgi?id=200933
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reported-by: Wen Xu <wen.xu@gatech.edu>
Cc: stable@vger.kernel.org
Christoph Hellwig suggested a slightly different path for handling
backwards compatibility with the 32-bit time_t based system calls:
Rather than simply reusing the compat_sys_* entry points on 32-bit
architectures unchanged, we get rid of those entry points and the
compat_time types by renaming them to something that makes more sense
on 32-bit architectures (which don't have a compat mode otherwise),
and then share the entry points under the new name with the 64-bit
architectures that use them for implementing the compatibility.
The following types and interfaces are renamed here, and moved
from linux/compat_time.h to linux/time32.h:
old new
--- ---
compat_time_t old_time32_t
struct compat_timeval struct old_timeval32
struct compat_timespec struct old_timespec32
struct compat_itimerspec struct old_itimerspec32
ns_to_compat_timeval() ns_to_old_timeval32()
get_compat_itimerspec64() get_old_itimerspec32()
put_compat_itimerspec64() put_old_itimerspec32()
compat_get_timespec64() get_old_timespec32()
compat_put_timespec64() put_old_timespec32()
As we already have aliases in place, this patch addresses only the
instances that are relevant to the system call interface in particular,
not those that occur in device drivers and other modules. Those
will get handled separately, while providing the 64-bit version
of the respective interfaces.
I'm not renaming the timex, rusage and itimerval structures, as we are
still debating what the new interface will look like, and whether we
will need a replacement at all.
This also doesn't change the names of the syscall entry points, which can
be done more easily when we actually switch over the 32-bit architectures
to use them, at that point we need to change COMPAT_SYSCALL_DEFINEx to
SYSCALL_DEFINEx with a new name, e.g. with a _time32 suffix.
Suggested-by: Christoph Hellwig <hch@infradead.org>
Link: https://lore.kernel.org/lkml/20180705222110.GA5698@infradead.org/
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
If the destination of the rename(2) system call exists, the inode's
link count (i_nlinks) must be non-zero. If it is, the inode can end
up on the orphan list prematurely, leading to all sorts of hilarity,
including a use-after-free.
https://bugzilla.kernel.org/show_bug.cgi?id=200931
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reported-by: Wen Xu <wen.xu@gatech.edu>
Cc: stable@vger.kernel.org
This suppresses some false positives in gcc 8's -Wstringop-truncation
Suggested by Miguel Ojeda (hopefully the __nonstring definition will
eventually get accepted in the compiler-gcc.h header file).
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com>
Pull IDA updates from Matthew Wilcox:
"A better IDA API:
id = ida_alloc(ida, GFP_xxx);
ida_free(ida, id);
rather than the cumbersome ida_simple_get(), ida_simple_remove().
The new IDA API is similar to ida_simple_get() but better named. The
internal restructuring of the IDA code removes the bitmap
preallocation nonsense.
I hope the net -200 lines of code is convincing"
* 'ida-4.19' of git://git.infradead.org/users/willy/linux-dax: (29 commits)
ida: Change ida_get_new_above to return the id
ida: Remove old API
test_ida: check_ida_destroy and check_ida_alloc
test_ida: Convert check_ida_conv to new API
test_ida: Move ida_check_max
test_ida: Move ida_check_leaf
idr-test: Convert ida_check_nomem to new API
ida: Start new test_ida module
target/iscsi: Allocate session IDs from an IDA
iscsi target: fix session creation failure handling
drm/vmwgfx: Convert to new IDA API
dmaengine: Convert to new IDA API
ppc: Convert vas ID allocation to new IDA API
media: Convert entity ID allocation to new IDA API
ppc: Convert mmu context allocation to new IDA API
Convert net_namespace to new IDA API
cb710: Convert to new IDA API
rsxx: Convert to new IDA API
osd: Convert to new IDA API
sd: Convert to new IDA API
...
Pull perf updates from Thomas Gleixner:
"Kernel:
- Improve kallsyms coverage
- Add x86 entry trampolines to kcore
- Fix ARM SPE handling
- Correct PPC event post processing
Tools:
- Make the build system more robust
- Small fixes and enhancements all over the place
- Update kernel ABI header copies
- Preparatory work for converting libtraceevnt to a shared library
- License cleanups"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (100 commits)
tools arch: Update arch/x86/lib/memcpy_64.S copy used in 'perf bench mem memcpy'
tools arch x86: Update tools's copy of cpufeatures.h
perf python: Fix pyrf_evlist__read_on_cpu() interface
perf mmap: Store real cpu number in 'struct perf_mmap'
perf tools: Remove ext from struct kmod_path
perf tools: Add gzip_is_compressed function
perf tools: Add lzma_is_compressed function
perf tools: Add is_compressed callback to compressions array
perf tools: Move the temp file processing into decompress_kmodule
perf tools: Use compression id in decompress_kmodule()
perf tools: Store compression id into struct dso
perf tools: Add compression id into 'struct kmod_path'
perf tools: Make is_supported_compression() static
perf tools: Make decompress_to_file() function static
perf tools: Get rid of dso__needs_decompress() call in __open_dso()
perf tools: Get rid of dso__needs_decompress() call in symbol__disassemble()
perf tools: Get rid of dso__needs_decompress() call in read_object_code()
tools lib traceevent: Change to SPDX License format
perf llvm: Allow passing options to llc in addition to clang
perf parser: Improve error message for PMU address filters
...
* memory_failure() gets confused by dev_pagemap backed mappings. The
recovery code has specific enabling for several possible page states
that needs new enabling to handle poison in dax mappings. Teach
memory_failure() about ZONE_DEVICE pages.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE5DAy15EJMCV1R6v9YGjFFmlTOEoFAlt9ui8ACgkQYGjFFmlT
OEpNRw//XGj9s7sezfJFeol4psJlRUd935yii/gmJRgi/yPf2VxxQG9qyM6SMBUc
75jASfOL6FSsfxHz0kplyWzMDNdrTkNNAD+9rv80FmY7GqWgcas9DaJX7jZ994vI
5SRO7pfvNZcXlo7IhqZippDw3yxkIU9Ufi0YQKaEUm7GFieptvCZ0p9x3VYfdvwM
BExrxQe0X1XUF4xErp5P78+WUbKxP47DLcucRDig8Q7dmHELUdyNzo3E1SVoc7m+
3CmvyTj6XuFQgOZw7ZKun1BJYfx/eD5ZlRJLZbx6wJHRtTXv/Uea8mZ8mJ31ykN9
F7QVd0Pmlyxys8lcXfK+nvpL09QBE0/PhwWKjmZBoU8AdgP/ZvBXLDL/D6YuMTg6
T4wwtPNJorfV4lVD06OliFkVI4qbKbmNsfRq43Ns7PCaLueu4U/eMaSwSH99UMaZ
MGbO140XW2RZsHiU9yTRUmZq73AplePEjxtzR8oHmnjo45nPDPy8mucWPlkT9kXA
oUFMhgiviK7dOo19H4eaPJGqLmHM93+x5tpYxGqTr0dUOXUadKWxMsTnkID+8Yi7
/kzQWCFvySz3VhiEHGuWkW08GZT6aCcpkREDomnRh4MEnETlZI8bblcuXYOCLs6c
nNf1SIMtLdlsl7U1fEX89PNeQQ2y237vEDhFQZftaalPeu/JJV0=
=Ftop
-----END PGP SIGNATURE-----
Merge tag 'libnvdimm-for-4.19_dax-memory-failure' of gitolite.kernel.org:pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull libnvdimm memory-failure update from Dave Jiang:
"As it stands, memory_failure() gets thoroughly confused by dev_pagemap
backed mappings. The recovery code has specific enabling for several
possible page states and needs new enabling to handle poison in dax
mappings.
In order to support reliable reverse mapping of user space addresses:
1/ Add new locking in the memory_failure() rmap path to prevent races
that would typically be handled by the page lock.
2/ Since dev_pagemap pages are hidden from the page allocator and the
"compound page" accounting machinery, add a mechanism to determine
the size of the mapping that encompasses a given poisoned pfn.
3/ Given pmem errors can be repaired, change the speculatively
accessed poison protection, mce_unmap_kpfn(), to be reversible and
otherwise allow ongoing access from the kernel.
A side effect of this enabling is that MADV_HWPOISON becomes usable
for dax mappings, however the primary motivation is to allow the
system to survive userspace consumption of hardware-poison via dax.
Specifically the current behavior is:
mce: Uncorrected hardware memory error in user-access at af34214200
{1}[Hardware Error]: It has been corrected by h/w and requires no further action
mce: [Hardware Error]: Machine check events logged
{1}[Hardware Error]: event severity: corrected
Memory failure: 0xaf34214: reserved kernel page still referenced by 1 users
[..]
Memory failure: 0xaf34214: recovery action for reserved kernel page: Failed
mce: Memory error not recovered
<reboot>
...and with these changes:
Injecting memory failure for pfn 0x20cb00 at process virtual address 0x7f763dd00000
Memory failure: 0x20cb00: Killing dax-pmd:5421 due to hardware memory corruption
Memory failure: 0x20cb00: recovery action for dax page: Recovered
Given all the cross dependencies I propose taking this through
nvdimm.git with acks from Naoya, x86/core, x86/RAS, and of course dax
folks"
* tag 'libnvdimm-for-4.19_dax-memory-failure' of gitolite.kernel.org:pub/scm/linux/kernel/git/nvdimm/nvdimm:
libnvdimm, pmem: Restore page attributes when clearing errors
x86/memory_failure: Introduce {set, clear}_mce_nospec()
x86/mm/pat: Prepare {reserve, free}_memtype() for "decoy" addresses
mm, memory_failure: Teach memory_failure() about dev_pagemap pages
filesystem-dax: Introduce dax_lock_mapping_entry()
mm, memory_failure: Collect mapping size in collect_procs()
mm, madvise_inject_error: Let memory_failure() optionally take a page reference
mm, dev_pagemap: Do not clear ->mapping on final put
mm, madvise_inject_error: Disable MADV_SOFT_OFFLINE for ZONE_DEVICE pages
filesystem-dax: Set page->index
device-dax: Set page->index
device-dax: Enable page_mapping()
device-dax: Convert to vmf_insert_mixed and vm_fault_t
Collection of misc libnvdimm patches for 4.19 submission
* Adding support to read locked nvdimm capacity.
* Change test code to make DSM failure code injection an override.
* Add support for calculate maximum contiguous area for namespace.
* Add support for queueing a short ARS when there is on going ARS for
nvdimm.
* Allow NULL to be passed in to ->direct_access() for kaddr and
pfn params.
* Improve smart injection support for nvdimm emulation testing.
* Fix test code that supports for emulating controller temperature.
* Fix hang on error before devm_memremap_pages()
* Fix a bug that causes user memory corruption when data returned
to user for ars_status.
* Maintainer updates for Ross Zwisler emails and adding Jan Kara to fsdax.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE5DAy15EJMCV1R6v9YGjFFmlTOEoFAlt9uUIACgkQYGjFFmlT
OErL+xAAgWHSGs8w98VtYA9kLDeTYEXutq93wJZQoBu/FMAXuuU3hYmQYnOQU87h
KKYmfDkeusaih1R3IX7mzlegnnzSfQ6MraNSV76M43noJHbRTunknCPZH6ebp4fo
b/eljvWlZF/idM+7YcsnoFMnHSRj2pjJGXmKQDlKedHD+KMxpmk6zEl2s5Y0zvPU
4U7UQLtk3D5IIpLNsLEmxge32BfvNf5IzoSO1aZp7Eqk0+U5Tq3Sq/Tjmd+J0RKt
6WH5yA6NqXQgBh+ayHsYU8YX62RqnbKQZXqVxD35OH64zJEUefnP1fpt9pmaZ9eL
43BPMkpM09eLAikO2ET3/3c2k6h3h9ttz1sH8t/hiroCtfmxs3XgskY06hxpKjZV
EbN+BUmut5Mr+zzYitRr3dbK2aHPVU9IbU7jUw/1Tz23rq3kU5iI7SHHv1b/eWup
1Cr77Z1M6HB8VBhjnJ+R607sbRrnKQUOV7fGzAaIskyUOTWsEvIgTh/6MRiaj9MD
5HXIgc/0y9E+G93s7MsUWwzpB7J6E7EGoybST2SKPtqwtDMPsBNeWRjyA9quBCoN
u1s+e+lWHYutqRW0eisDTTlq3nJwPijSx1nnzhJxw9s1EkCXz3f7KRZhyH1C79Co
7wjiuvKQ79e/HI/oXvGmTnv5lbLEpWYyJ3U3KIFfoUqugeyhr0k=
=5p2n
-----END PGP SIGNATURE-----
Merge tag 'libnvdimm-for-4.19_misc' of gitolite.kernel.org:pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull libnvdimm updates from Dave Jiang:
"Collection of misc libnvdimm patches for 4.19 submission:
- Adding support to read locked nvdimm capacity.
- Change test code to make DSM failure code injection an override.
- Add support for calculate maximum contiguous area for namespace.
- Add support for queueing a short ARS when there is on going ARS for
nvdimm.
- Allow NULL to be passed in to ->direct_access() for kaddr and pfn
params.
- Improve smart injection support for nvdimm emulation testing.
- Fix test code that supports for emulating controller temperature.
- Fix hang on error before devm_memremap_pages()
- Fix a bug that causes user memory corruption when data returned to
user for ars_status.
- Maintainer updates for Ross Zwisler emails and adding Jan Kara to
fsdax"
* tag 'libnvdimm-for-4.19_misc' of gitolite.kernel.org:pub/scm/linux/kernel/git/nvdimm/nvdimm:
libnvdimm: fix ars_status output length calculation
device-dax: avoid hang on error before devm_memremap_pages()
tools/testing/nvdimm: improve emulation of smart injection
filesystem-dax: Do not request kaddr and pfn when not required
md/dm-writecache: Don't request pointer dummy_addr when not required
dax/super: Do not request a pointer kaddr when not required
tools/testing/nvdimm: kaddr and pfn can be NULL to ->direct_access()
s390, dcssblk: kaddr and pfn can be NULL to ->direct_access()
libnvdimm, pmem: kaddr and pfn can be NULL to ->direct_access()
acpi/nfit: queue issuing of ars when an uc error notification comes in
libnvdimm: Export max available extent
libnvdimm: Use max contiguous area for namespace size
MAINTAINERS: Add Jan Kara for filesystem DAX
MAINTAINERS: update Ross Zwisler's email address
tools/testing/nvdimm: Fix support for emulating controller temperature
tools/testing/nvdimm: Make DSM failure code injection an override
acpi, nfit: Prefer _DSM over _LSR for namespace label reads
libnvdimm: Introduce locked DIMM capacity support
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAluAdzkACgkQiiy9cAdy
T1GjlAv/SOsNm2sj9Bcq/Z/CpPoFRFoJBFsLeReF78QdbF/+eUFuQJvq0aIK8BmV
0NRmvlQk9oCGQfWN0pLWeRn7a7xUqMQ7HYKSS1fzW1O+kJGlyA1cFjCbiZe4py6m
AFSlpraPTtL7isX/ZMOyZ1D7YKMj4Fq5wndcHPSnMQXI2GlaUAip5k/zamXUbMmo
dFaGDkXc67un6Y/04v18LsKJtOHgbVIAES2OgO0sjqiwp0cnGATsZl/OGzvsTo31
brstBum/0Ig2Mpr+5IXa4QFoP+naNXDyhv+D69huETwsMSImnjGL6L/GAgUUO5/t
sDN6bpQdM9wqpckNuFcV9hKbBHQ1nZhzr0gUjRnmN8CJFHOgI2Ndo4J4N9slnWo9
wEXJV7RN5/VQh3Ozb7m+DpAYo4K4r3Oexxa3MG7+IFY8t89O39PBc0dEQ6O8ztaJ
NCcubQVpSFxC7fwVjY56IHHofyznS7JLKMAcCe3+ssOHwJt0PZr/iqdBNHC9W4ZE
7fcNZtct
=jMLB
-----END PGP SIGNATURE-----
Merge tag '4.19-rc-smb3' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs fixes from Steve French:
"Three small SMB3 fixes, one for stable"
* tag '4.19-rc-smb3' of git://git.samba.org/sfrench/cifs-2.6:
cifs: update internal module version number for cifs.ko to 2.12
cifs: check kmalloc before use
cifs: check if SMB2 PDU size has been padded and suppress the warning
cifs: create a define for how many iovs we need for an SMB2_open()
At the point where r is being checked for different values, r is always
going to be equal to 2 as the previous if statements jump to end or end1
if r is not 2. Hence the assignment to err can be simplified to just
err an assignment without any checks on the value or r.
Detected by CoverityScan, CID#1226737 ("Logically dead code")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull namespace fixes from Eric Biederman:
"This is a set of four fairly obvious bug fixes:
- a switch from d_find_alias to d_find_any_alias because the xattr
code perversely takes a dentry
- two mutex vs copy_to_user fixes from Jann Horn
- a fix to use a sanitized size not the size userspace passed in from
Christian Brauner"
* 'userns-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
getxattr: use correct xattr length
sys: don't hold uts_sem while accessing userspace memory
userns: move user access out of the mutex
cap_inode_getsecurity: use d_find_any_alias() instead of d_find_alias()
Commit 672d599041 ("btrfs: Use wrapper macro for rcu string to remove
duplicate code") replaces some open coded RCU string handling with macro.
It turns out that btrfs_debug_in_rcu() is used for the first time and
the macro lacks lock/unlock of RCU string for non-debug case (i.e. when
the message is not printed), leading to suspicious RCU usage warning
when CONFIG_PROVE_RCU is on.
Fix this by adding a wrapper to call lock/unlock for the non-debug case
too.
Fixes: 672d599041 ("btrfs: Use wrapper macro for rcu string to remove duplicate code")
Reported-by: David Howells <dhowells@redhat.com>
Tested-by: David Howells <dhowells@redhat.com>
Signed-off-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This empty file sneaked into the tree by mistake.
Remove it.
Fixes: 6eb61d587f ("ubifs: Pass struct ubifs_info to ubifs_assert()")
Signed-off-by: Richard Weinberger <richard@nod.at>
Win7 is creating UDF filesystems with single partition with number 8192.
Current partition descriptor scanning code does not handle this well as
it incorrectly assumes that partition numbers will form mostly contiguous
space of small numbers. This results in unmountable media due to errors
like:
UDF-fs: error (device dm-1): udf_read_tagged: tag version 0x0000 != 0x0002 || 0x0003, block 0
UDF-fs: warning (device dm-1): udf_fill_super: No fileset found
Fix the problem by handling partition descriptors in a way that sparse
partition numbering does not matter.
Reported-and-tested-by: jean-luc malet <jeanluc.malet@gmail.com>
CC: stable@vger.kernel.org
Fixes: 7b78fd02fb
Signed-off-by: Jan Kara <jack@suse.cz>
Merge yet more updates from Andrew Morton:
- the rest of MM
- various misc fixes and tweaks
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (22 commits)
mm: Change return type int to vm_fault_t for fault handlers
lib/fonts: convert comments to utf-8
s390: ebcdic: convert comments to UTF-8
treewide: convert ISO_8859-1 text comments to utf-8
drivers/gpu/drm/gma500/: change return type to vm_fault_t
docs/core-api: mm-api: add section about GFP flags
docs/mm: make GFP flags descriptions usable as kernel-doc
docs/core-api: split memory management API to a separate file
docs/core-api: move *{str,mem}dup* to "String Manipulation"
docs/core-api: kill trailing whitespace in kernel-api.rst
mm/util: add kernel-doc for kvfree
mm/util: make strndup_user description a kernel-doc comment
fs/proc/vmcore.c: hide vmcoredd_mmap_dumps() for nommu builds
treewide: correct "differenciate" and "instanciate" typos
fs/afs: use new return type vm_fault_t
drivers/hwtracing/intel_th/msu.c: change return type to vm_fault_t
mm: soft-offline: close the race against page allocation
mm: fix race on soft-offlining free huge pages
namei: allow restricted O_CREAT of FIFOs and regular files
hfs: prevent crash on exit from failed search
...
Use new return type vm_fault_t for fault handler. For now, this is just
documenting that the function returns a VM_FAULT value rather than an
errno. Once all instances are converted, vm_fault_t will become a
distinct type.
Ref-> commit 1c8f422059 ("mm: change return type to vm_fault_t")
The aim is to change the return type of finish_fault() and
handle_mm_fault() to vm_fault_t type. As part of that clean up return
type of all other recursively called functions have been changed to
vm_fault_t type.
The places from where handle_mm_fault() is getting invoked will be
change to vm_fault_t type but in a separate patch.
vmf_error() is the newly introduce inline function in 4.17-rc6.
[akpm@linux-foundation.org: don't shadow outer local `ret' in __do_huge_pmd_anonymous_page()]
Link: http://lkml.kernel.org/r/20180604171727.GA20279@jordon-HP-15-Notebook-PC
Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com>
Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Without CONFIG_MMU, we get a build warning:
fs/proc/vmcore.c:228:12: error: 'vmcoredd_mmap_dumps' defined but not used [-Werror=unused-function]
static int vmcoredd_mmap_dumps(struct vm_area_struct *vma, unsigned long dst,
The function is only referenced from an #ifdef'ed caller, so
this uses the same #ifdef around it.
Link: http://lkml.kernel.org/r/20180525213526.2117790-1-arnd@arndb.de
Fixes: 7efe48df8a ("vmcore: append device dumps to vmcore as elf notes")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Ganesh Goudar <ganeshgr@chelsio.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Rahul Lakkireddy <rahul.lakkireddy@chelsio.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use new return type vm_fault_t for fault handler in struct
vm_operations_struct. For now, this is just documenting that the
function returns a VM_FAULT value rather than an errno. Once all
instances are converted, vm_fault_t will become a distinct type.
See 1c8f422059 ("mm: change return type to vm_fault_t") for reference.
Link: http://lkml.kernel.org/r/20180702152017.GA3780@jordon-HP-15-Notebook-PC
Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com>
Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Disallows open of FIFOs or regular files not owned by the user in world
writable sticky directories, unless the owner is the same as that of the
directory or the file is opened without the O_CREAT flag. The purpose
is to make data spoofing attacks harder. This protection can be turned
on and off separately for FIFOs and regular files via sysctl, just like
the symlinks/hardlinks protection. This patch is based on Openwall's
"HARDEN_FIFO" feature by Solar Designer.
This is a brief list of old vulnerabilities that could have been prevented
by this feature, some of them even allow for privilege escalation:
CVE-2000-1134
CVE-2007-3852
CVE-2008-0525
CVE-2009-0416
CVE-2011-4834
CVE-2015-1838
CVE-2015-7442
CVE-2016-7489
This list is not meant to be complete. It's difficult to track down all
vulnerabilities of this kind because they were often reported without any
mention of this particular attack vector. In fact, before
hardlinks/symlinks restrictions, fifos/regular files weren't the favorite
vehicle to exploit them.
[s.mesoraca16@gmail.com: fix bug reported by Dan Carpenter]
Link: https://lkml.kernel.org/r/20180426081456.GA7060@mwanda
Link: http://lkml.kernel.org/r/1524829819-11275-1-git-send-email-s.mesoraca16@gmail.com
[keescook@chromium.org: drop pr_warn_ratelimited() in favor of audit changes in the future]
[keescook@chromium.org: adjust commit subjet]
Link: http://lkml.kernel.org/r/20180416175918.GA13494@beast
Signed-off-by: Salvatore Mesoraca <s.mesoraca16@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Suggested-by: Solar Designer <solar@openwall.com>
Suggested-by: Kees Cook <keescook@chromium.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
hfs_find_exit() expects fd->bnode to be NULL after a search has failed.
hfs_brec_insert() may instead set it to an error-valued pointer. Fix
this to prevent a crash.
Link: http://lkml.kernel.org/r/53d9749a029c41b4016c495fc5838c9dba3afc52.1530294815.git.ernesto.mnd.fernandez@gmail.com
Signed-off-by: Ernesto A. Fernández <ernesto.mnd.fernandez@gmail.com>
Cc: Anatoly Trosinenko <anatoly.trosinenko@gmail.com>
Cc: Viacheslav Dubeyko <slava@dubeyko.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
hfs_find_exit() expects fd->bnode to be NULL after a search has failed.
hfs_brec_insert() may instead set it to an error-valued pointer. Fix
this to prevent a crash.
Link: http://lkml.kernel.org/r/803590a35221fbf411b2c141419aea3233a6e990.1530294813.git.ernesto.mnd.fernandez@gmail.com
Signed-off-by: Ernesto A. Fernandez <ernesto.mnd.fernandez@gmail.com>
Reported-by: Anatoly Trosinenko <anatoly.trosinenko@gmail.com>
Reviewed-by: Vyacheslav Dubeyko <slava@dubeyko.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
An HFS+ filesystem can be mounted read-only without having a metadata
directory, which is needed to support hardlinks. But if the catalog
data is corrupted, a directory lookup may still find dentries claiming
to be hardlinks.
hfsplus_lookup() does check that ->hidden_dir is not NULL in such a
situation, but mistakenly does so after dereferencing it for the first
time. Reorder this check to prevent a crash.
This happens when looking up corrupted catalog data (dentry) on a
filesystem with no metadata directory (this could only ever happen on a
read-only mount). Wen Xu sent the replication steps in detail to the
fsdevel list: https://bugzilla.kernel.org/show_bug.cgi?id=200297
Link: http://lkml.kernel.org/r/20180712215344.q44dyrhymm4ajkao@eaf
Signed-off-by: Ernesto A. Fernández <ernesto.mnd.fernandez@gmail.com>
Reported-by: Wen Xu <wen.xu@gatech.edu>
Cc: Viacheslav Dubeyko <slava@dubeyko.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Stable bufixes:
- v3.17+: Fix an off-by-one in bl_map_stripe()
- v4.9+: NFSv4 client live hangs after live data migration recovery
- v4.18+: xprtrdma: Fix disconnect regression
- v4.14+: Fix locking in pnfs_generic_recover_commit_reqs
- v4.9+: Fix a sleep in atomic context in nfs4_callback_sequence()
Features:
- Add support for asynchronous server-side COPY operations
Other bugfixes and cleanups:
- Optitmizations and fixes involving NFS v4.1 / pNFS layout handling
- Optimize lseek(fd, SEEK_CUR, 0) on directories to avoid locking
- Immediately reschedule writeback when the server replies with an error
- Fix excessive attribute revalidation in nfs_execute_ok()
- Add error checking to nfs_idmap_prepare_message()
- Use new vm_fault_t return type
- Return a delegation when reclaiming one that the server has recalled
- Referrals should inherit proto setting from parents
- Make rpc_auth_create_args a const
- Improvements to rpc_iostats tracking
- Fix a potential reference leak when there is an error processing a callback
- Fix rmdir / mkdir / rename nlink accounting
- Fix updating inode change attribute
- Fix error handling in nfsn4_sp4_select_mode()
- Use an appropriate work queue for direct-write completion
- Don't busy wait if NFSv4 session draining is interrupted
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAlt/CYIACgkQ18tUv7Cl
QOu8gBAA0xQWmgRoG6oIdYUxvgYqhuJmMqC4SU1E6mCJ93xEuUSvEFw51X+84KCt
r6UPkp/bKiVe3EIinKTplIzuxgggXNG0EQmO46FYNTl7nqpN85ffLsQoWsiD23fp
j8afqKPFR2zfhHXLKQC7k1oiOpwGqJ+EJWgIW4llE80pSNaErEoEaDqSPds5thMN
dHEjjLr8ef6cbBux6sSPjwWGNbE82uoSu3MDuV2+e62hpGkgvuEYo1vyE6ujeZW5
MUsmw+AHZkwro0msTtNBOHcPZAS0q/2UMPzl1tsDeCWNl2mugqZ6szQLSS2AThKq
Zr6iK9Q5dWjJfrQHcjRMnYJB+SCX1SfPA7ASuU34opwcWPjecbS9Q92BNTByQYwN
o9ngs2K0mZfqpYESMAmf7Il134cCBrtEp3skGko2KopJcYcE5YUFhdKihi1yQQjU
UbOOubMpQk8vY9DpDCAwGbICKwUZwGvq27uuUWL20kFVDb1+jvfHwcV4KjRAJo/E
J9aFtU+qOh4rMPMnYlEVZcAZBGfenlv/DmBl1upRpjzBkteUpUJsAbCmGyAk4616
3RECasehgsjNCQpFIhv3FpUkWzP5jt0T3gRr1NeY6WKJZwYnHEJr9PtapS+EIsCT
tB5DvvaJqFtuHFOxzn+KlGaxdSodHF7klOq7NM3AC0cX8AkWqaU=
=8+9t
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-4.19-1' of git://git.linux-nfs.org/projects/anna/linux-nfs
Pull NFS client updates from Anna Schumaker:
"These patches include adding async support for the v4.2 COPY
operation. I think Bruce is planning to send the server patches for
the next release, but I figured we could get the client side out of
the way now since it's been in my tree for a while. This shouldn't
cause any problems, since the server will still respond with
synchronous copies even if the client requests async.
Features:
- Add support for asynchronous server-side COPY operations
Stable bufixes:
- Fix an off-by-one in bl_map_stripe() (v3.17+)
- NFSv4 client live hangs after live data migration recovery (v4.9+)
- xprtrdma: Fix disconnect regression (v4.18+)
- Fix locking in pnfs_generic_recover_commit_reqs (v4.14+)
- Fix a sleep in atomic context in nfs4_callback_sequence() (v4.9+)
Other bugfixes and cleanups:
- Optimizations and fixes involving NFS v4.1 / pNFS layout handling
- Optimize lseek(fd, SEEK_CUR, 0) on directories to avoid locking
- Immediately reschedule writeback when the server replies with an
error
- Fix excessive attribute revalidation in nfs_execute_ok()
- Add error checking to nfs_idmap_prepare_message()
- Use new vm_fault_t return type
- Return a delegation when reclaiming one that the server has
recalled
- Referrals should inherit proto setting from parents
- Make rpc_auth_create_args a const
- Improvements to rpc_iostats tracking
- Fix a potential reference leak when there is an error processing a
callback
- Fix rmdir / mkdir / rename nlink accounting
- Fix updating inode change attribute
- Fix error handling in nfsn4_sp4_select_mode()
- Use an appropriate work queue for direct-write completion
- Don't busy wait if NFSv4 session draining is interrupted"
* tag 'nfs-for-4.19-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (54 commits)
pNFS: Remove unwanted optimisation of layoutget
pNFS/flexfiles: ff_layout_pg_init_read should exit on error
pNFS: Treat RECALLCONFLICT like DELAY...
pNFS: When updating the stateid in layoutreturn, also update the recall range
NFSv4: Fix a sleep in atomic context in nfs4_callback_sequence()
NFSv4: Fix locking in pnfs_generic_recover_commit_reqs
NFSv4: Fix a typo in nfs4_init_channel_attrs()
NFSv4: Don't busy wait if NFSv4 session draining is interrupted
NFS recover from destination server reboot for copies
NFS add a simple sync nfs4_proc_commit after async COPY
NFS handle COPY ERR_OFFLOAD_NO_REQS
NFS send OFFLOAD_CANCEL when COPY killed
NFS export nfs4_async_handle_error
NFS handle COPY reply CB_OFFLOAD call race
NFS add support for asynchronous COPY
NFS COPY xdr handle async reply
NFS OFFLOAD_CANCEL xdr
NFS CB_OFFLOAD xdr
NFS: Use an appropriate work queue for direct-write completion
NFSv4: Fix error handling in nfs4_sp4_select_mode()
...
missing Chuck's fixes for the problem with callbacks over GSS from
multi-homed servers, and a smaller fix from Laura Abbott.
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJbftA8AAoJECebzXlCjuG+QPMQALieEKkX0YoqRhPz5G+RrWFy
KgOBFAoiRcjFQD6wMt9FzD6qYEZqSJ+I2b+K5N3BkdyDDQu845iD0wK0zBGhMgLm
7ith85nphIMbe18+5jPorqAsI9RlfBQjiSGw1MEx5dicLQQzTObHL5q+l5jcWna4
jWS3yUKv1URpOsR1hIryw74ktSnhuH8n//zmntw8aWrCkq3hnXOZK/agtYxZ7Viv
V3kiQsiNpL2FPRcHN7ejhLUTnRkkuD2iYKrzP/SpTT/JfdNEUXlMhKkAySogNpus
nvR9X7hwta8Lgrt7PSB9ibFTXtCupmuICg5mbDWy6nXea2NvpB01QhnTzrlX17Eh
Yfk/18z95b6Qs1v4m3SI8ESmyc6l5dMZozLudtHzifyCqooWZriEhCR1PlQfQ/FJ
4cYQ8U/qiMiZIJXL7N2wpSoSaWR5bqU1rXen29Np1WEDkiv4Nf5u2fsCXzv0ZH2C
ReWpNkbnNxsNiKpp4geBZtlcSEU1pk+1PqE0MagTdBV3iptiUHRSP4jR7qLnc0zT
J1lCvU7Fodnt9vNSxMpt2Jd6XxQ6xtx7n6aMQAiYFnXDs+hP2hPnJVCScnYW3L6R
2r1sHRKKeoOzCJ2thw+zu4lOwMm7WPkJPWAYfv90reWkiKoy2vG0S9P7wsNGoJuW
fuEjB2b9pow1Ffynat6q
=JnLK
-----END PGP SIGNATURE-----
Merge tag 'nfsd-4.19-1' of git://linux-nfs.org/~bfields/linux
Pull nfsd updates from Bruce Fields:
"Chuck Lever fixed a problem with NFSv4.0 callbacks over GSS from
multi-homed servers.
The only new feature is a minor bit of protocol (change_attr_type)
which the client doesn't even use yet.
Other than that, various bugfixes and cleanup"
* tag 'nfsd-4.19-1' of git://linux-nfs.org/~bfields/linux: (27 commits)
sunrpc: Add comment defining gssd upcall API keywords
nfsd: Remove callback_cred
nfsd: Use correct credential for NFSv4.0 callback with GSS
sunrpc: Extract target name into svc_cred
sunrpc: Enable the kernel to specify the hostname part of service principals
sunrpc: Don't use stack buffer with scatterlist
rpc: remove unneeded variable 'ret' in rdma_listen_handler
nfsd: use true and false for boolean values
nfsd: constify write_op[]
fs/nfsd: Delete invalid assignment statements in nfsd4_decode_exchange_id
NFSD: Handle full-length symlinks
NFSD: Refactor the generic write vector fill helper
svcrdma: Clean up Read chunk path
svcrdma: Avoid releasing a page in svc_xprt_release()
nfsd: Mark expected switch fall-through
sunrpc: remove redundant variables 'checksumlen','blocksize' and 'data'
nfsd: fix leaked file lock with nfs exported overlayfs
nfsd: don't advertise a SCSI layout for an unsupported request_queue
nfsd: fix corrupted reply to badly ordered compound
nfsd: clarify check_op_ordering
...
- Year 2038 preparations
- New UBI feature to skip CRC checks of static volumes
- A new Kconfig option to disable xattrs in UBIFS
- Lots of fixes in UBIFS, found by our new test framework
-----BEGIN PGP SIGNATURE-----
iQJKBAABCAA0FiEEdgfidid8lnn52cLTZvlZhesYu8EFAlt9zFkWHHJpY2hhcmRA
c2lnbWEtc3Rhci5hdAAKCRBm+VmF6xi7waiuD/oDYzerOLe0R7n2sRT9zjtY8kCx
LuizRvDYUlmMynI6EVahfyJy2IixcDmXOklGdxJqUkN5igDC/FORWdQjv2X9y56d
qZ2dlS8aBvI0ZKBG2ew4VP1H67CXtCw8H9fE32SGotPmxKRUQqt2vKqo+vgQfapH
eSVPrOaoqoRh+/ieumYXsvFdEUWpa66G3tVMFe4znu+kYRBbGzSszxpuq1ukIls2
P9wewqbWAZpqn+n9A9+RBIv81g+jH87/acfjK2L7/lT9wsFO7BQGKi373dPbnTa5
9WsjGEd+Gt0kb4Kjh5QegY97bPqWjmaMj1BLqeQVpSbQqpzkiFMf9GW5+h3XqAfO
hM1zzgONZMxHdZSKH0bWzIRQbvU6v0d9C4J/elfFuH9ke2XscrxjOtZZQbtbGeYj
tE7FWoZnB8euXubulGAUBKofzWe+gItBe9+iA29EBETNOemrJyHyKjO0Fe9ze5p2
bfVFvN62kHz4ZCJoinwO/OpXnCuA91xrVocLOOIreb4dkZ/kqP+YZWFf70FcE1o5
sPAbAUu+hfb2LbpktEdZHHbhoupfCnJokzfboJMX0NWKRtFXJDONjogJYTFUjrpW
eXS+55+WFHoLWtx9J2IVmcb3cQrj/W/4J83kSg99cUkVjGpil50zmtzhq9bHzsLc
wazngueP7kW2l9bSSg==
=gCyp
-----END PGP SIGNATURE-----
Merge tag 'upstream-4.19-rc1' of git://git.infradead.org/linux-ubifs
Pull UBI/UBIFS updates from Richard Weinberger:
- Year 2038 preparations
- New UBI feature to skip CRC checks of static volumes
- A new Kconfig option to disable xattrs in UBIFS
- Lots of fixes in UBIFS, found by our new test framework
* tag 'upstream-4.19-rc1' of git://git.infradead.org/linux-ubifs: (21 commits)
ubifs: Set default assert action to read-only
ubifs: Allow setting assert action as mount parameter
ubifs: Rework ubifs_assert()
ubifs: Pass struct ubifs_info to ubifs_assert()
ubifs: Turn two ubifs_assert() into a WARN_ON()
ubi: expose the volume CRC check skip flag
ubi: provide a way to skip CRC checks
ubifs: Use kmalloc_array()
ubifs: Check data node size before truncate
Revert "UBIFS: Fix potential integer overflow in allocation"
ubifs: Add comment on c->commit_sem
ubifs: introduce Kconfig symbol for xattr support
ubifs: use swap macro in swap_dirty_idx
ubifs: tnc: use monotonic znode timestamp
ubifs: use timespec64 for inode timestamps
ubifs: xattr: Don't operate on deleted inodes
ubifs: gc: Fix typo
ubifs: Fix memory leak in lprobs self-check
ubi: Initialize Fastmap checkmapping correctly
ubifs: Fix synced_i_size calculation for xattr inodes
...
The kmalloc was not being checked - if it fails issue a warning
and return -ENOMEM to the caller.
Signed-off-by: Nicholas Mc Guire <hofrat@osadl.org>
Fixes: b8da344b74 ("cifs: dynamic allocation of ntlmssp blob")
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
cc: Stable <stable@vger.kernel.org>`
Some SMB2/3 servers, Win2016 but possibly others too, adds padding
not only between PDUs in a compound but also to the final PDU.
This padding extends the PDU to a multiple of 8 bytes.
Check if the unexpected length looks like this might be the case
and avoid triggering the log messages for :
"SMB2 server sent bad RFC1001 len %d not %d\n"
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
When running in a container with a user namespace, if you call getxattr
with name = "system.posix_acl_access" and size % 8 != 4, then getxattr
silently skips the user namespace fixup that it normally does resulting in
un-fixed-up data being returned.
This is caused by posix_acl_fix_xattr_to_user() being passed the total
buffer size and not the actual size of the xattr as returned by
vfs_getxattr().
This commit passes the actual length of the xattr as returned by
vfs_getxattr() down.
A reproducer for the issue is:
touch acl_posix
setfacl -m user:0:rwx acl_posix
and the compile:
#define _GNU_SOURCE
#include <errno.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/types.h>
#include <unistd.h>
#include <attr/xattr.h>
/* Run in user namespace with nsuid 0 mapped to uid != 0 on the host. */
int main(int argc, void **argv)
{
ssize_t ret1, ret2;
char buf1[128], buf2[132];
int fret = EXIT_SUCCESS;
char *file;
if (argc < 2) {
fprintf(stderr,
"Please specify a file with "
"\"system.posix_acl_access\" permissions set\n");
_exit(EXIT_FAILURE);
}
file = argv[1];
ret1 = getxattr(file, "system.posix_acl_access",
buf1, sizeof(buf1));
if (ret1 < 0) {
fprintf(stderr, "%s - Failed to retrieve "
"\"system.posix_acl_access\" "
"from \"%s\"\n", strerror(errno), file);
_exit(EXIT_FAILURE);
}
ret2 = getxattr(file, "system.posix_acl_access",
buf2, sizeof(buf2));
if (ret2 < 0) {
fprintf(stderr, "%s - Failed to retrieve "
"\"system.posix_acl_access\" "
"from \"%s\"\n", strerror(errno), file);
_exit(EXIT_FAILURE);
}
if (ret1 != ret2) {
fprintf(stderr, "The value of \"system.posix_acl_"
"access\" for file \"%s\" changed "
"between two successive calls\n", file);
_exit(EXIT_FAILURE);
}
for (ssize_t i = 0; i < ret2; i++) {
if (buf1[i] == buf2[i])
continue;
fprintf(stderr,
"Unexpected different in byte %zd: "
"%02x != %02x\n", i, buf1[i], buf2[i]);
fret = EXIT_FAILURE;
}
if (fret == EXIT_SUCCESS)
fprintf(stderr, "Test passed\n");
else
fprintf(stderr, "Test failed\n");
_exit(fret);
}
and run:
./tester acl_posix
On a non-fixed up kernel this should return something like:
root@c1:/# ./t
Unexpected different in byte 16: ffffffa0 != 00
Unexpected different in byte 17: ffffff86 != 00
Unexpected different in byte 18: 01 != 00
and on a fixed kernel:
root@c1:~# ./t
Test passed
Cc: stable@vger.kernel.org
Fixes: 2f6f0654ab ("userns: Convert vfs posix_acl support to use kuids and kgids")
Link: https://bugzilla.kernel.org/show_bug.cgi?id=199945
Reported-by: Colin Watson <cjwatson@ubuntu.com>
Signed-off-by: Christian Brauner <christian@brauner.io>
Acked-by: Serge Hallyn <serge@hallyn.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
The issue here is that btrfs_commit_transaction() frees "trans" on both
the error and the success path. So the problem would be if
btrfs_commit_transaction() succeeds, and then qgroup_rescan_init()
fails. That means that "ret" is non-zero and "trans" is non-NULL and it
leads to a use after free inside the btrfs_end_transaction() macro.
Fixes: 340f1aa27f ("btrfs: qgroups: Move transaction management inside btrfs_quota_enable/disable")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Test case btrfs/164 reports use-after-free:
[ 6712.084324] general protection fault: 0000 [#1] PREEMPT SMP
..
[ 6712.195423] btrfs_update_commit_device_size+0x75/0xf0 [btrfs]
[ 6712.201424] btrfs_commit_transaction+0x57d/0xa90 [btrfs]
[ 6712.206999] btrfs_rm_device+0x627/0x850 [btrfs]
[ 6712.211800] btrfs_ioctl+0x2b03/0x3120 [btrfs]
Reason for this is that btrfs_shrink_device adds the resized device to
the fs_devices::resized_devices after it has called the last commit
transaction.
So the list fs_devices::resized_devices is not empty when
btrfs_shrink_device returns. Now the parent function
btrfs_rm_device calls:
btrfs_close_bdev(device);
call_rcu(&device->rcu, free_device_rcu);
and then does the transactio ncommit. It goes through the
fs_devices::resized_devices in btrfs_update_commit_device_size and
leads to use-after-free.
Fix this by making sure btrfs_shrink_device calls the last needed
btrfs_commit_transaction before the return. This is consistent with what
the grow counterpart does and this makes sure the on-disk state is
persistent when the function returns.
Reported-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Tested-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
After btrfs_qgroup_reserve_meta_prealloc(), num_bytes will be assigned
again by btrfs_calc_trans_metadata_size(). Once block_rsv fails, we
can't properly free the num_bytes of the previous qgroup_reserve. Use a
separate variable to store the num_bytes of the qgroup_reserve.
Delete the comment for the qgroup_reserved that does not exist and add a
comment about use_global_rsv.
Fixes: c4c129db5d ("btrfs: drop unused parameter qgroup_reserved")
CC: stable@vger.kernel.org # 4.18+
Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If we deduplicate extents between two different files we can end up
corrupting data if the source range ends at the size of the source file,
the source file's size is not aligned to the filesystem's block size
and the destination range does not go past the size of the destination
file size.
Example:
$ mkfs.btrfs -f /dev/sdb
$ mount /dev/sdb /mnt
$ xfs_io -f -c "pwrite -S 0x6b 0 2518890" /mnt/foo
# The first byte with a value of 0xae starts at an offset (2518890)
# which is not a multiple of the sector size.
$ xfs_io -c "pwrite -S 0xae 2518890 102398" /mnt/foo
# Confirm the file content is full of bytes with values 0x6b and 0xae.
$ od -t x1 /mnt/foo
0000000 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
*
11467540 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b ae ae ae ae ae ae
11467560 ae ae ae ae ae ae ae ae ae ae ae ae ae ae ae ae
*
11777540 ae ae ae ae ae ae ae ae
11777550
# Create a second file with a length not aligned to the sector size,
# whose bytes all have the value 0x6b, so that its extent(s) can be
# deduplicated with the first file.
$ xfs_io -f -c "pwrite -S 0x6b 0 557771" /mnt/bar
# Now deduplicate the entire second file into a range of the first file
# that also has all bytes with the value 0x6b. The destination range's
# end offset must not be aligned to the sector size and must be less
# then the offset of the first byte with the value 0xae (byte at offset
# 2518890).
$ xfs_io -c "dedupe /mnt/bar 0 1957888 557771" /mnt/foo
# The bytes in the range starting at offset 2515659 (end of the
# deduplication range) and ending at offset 2519040 (start offset
# rounded up to the block size) must all have the value 0xae (and not
# replaced with 0x00 values). In other words, we should have exactly
# the same data we had before we asked for deduplication.
$ od -t x1 /mnt/foo
0000000 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
*
11467540 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b ae ae ae ae ae ae
11467560 ae ae ae ae ae ae ae ae ae ae ae ae ae ae ae ae
*
11777540 ae ae ae ae ae ae ae ae
11777550
# Unmount the filesystem and mount it again. This guarantees any file
# data in the page cache is dropped.
$ umount /dev/sdb
$ mount /dev/sdb /mnt
$ od -t x1 /mnt/foo
0000000 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
*
11461300 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 00 00 00 00 00
11461320 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
*
11470000 ae ae ae ae ae ae ae ae ae ae ae ae ae ae ae ae
*
11777540 ae ae ae ae ae ae ae ae
11777550
# The bytes in range 2515659 to 2519040 have a value of 0x00 and not a
# value of 0xae, data corruption happened due to the deduplication
# operation.
So fix this by rounding down, to the sector size, the length used for the
deduplication when the following conditions are met:
1) Source file's range ends at its i_size;
2) Source file's i_size is not aligned to the sector size;
3) Destination range does not cross the i_size of the destination file.
Fixes: e1d227a42e ("btrfs: Handle unaligned length in extent_same")
CC: stable@vger.kernel.org # 4.2+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When we add a new name for an inode which was logged in the current
transaction, we update the inode in the log so that its new name and
ancestors are added to the log. However when we do this we do not persist
the log, so the changes remain in memory only, and as a consequence, any
ancestors that were created in the current transaction are updated such
that future calls to btrfs_inode_in_log() return true. This leads to a
subsequent fsync against such new ancestor directories returning
immediately, without persisting the log, therefore after a power failure
the new ancestor directories do not exist, despite fsync being called
against them explicitly.
Example:
$ mkfs.btrfs -f /dev/sdb
$ mount /dev/sdb /mnt
$ mkdir /mnt/A
$ mkdir /mnt/B
$ mkdir /mnt/A/C
$ touch /mnt/B/foo
$ xfs_io -c "fsync" /mnt/B/foo
$ ln /mnt/B/foo /mnt/A/C/foo
$ xfs_io -c "fsync" /mnt/A
<power failure>
After the power failure, directory "A" does not exist, despite the explicit
fsync on it.
Instead of fixing this by changing the behaviour of the explicit fsync on
directory "A" to persist the log instead of doing nothing, make the logging
of the new file name (which happens when creating a hard link or renaming)
persist the log. This approach not only is simpler, not requiring addition
of new fields to the inode in memory structure, but also gives us the same
behaviour as ext4, xfs and f2fs (possibly other filesystems too).
A test case for fstests follows soon.
Fixes: 12fcfd22fe ("Btrfs: tree logging unlink/rename fixes")
Reported-by: Vijay Chidambaram <vvijay03@gmail.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Clean up: The global callback_cred is no longer used, so it can be
removed.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
I've had trouble when operating a multi-homed Linux NFS server with
Kerberos using NFSv4.0. Lately, I've seen my clients reporting
this (and then hanging):
May 9 11:43:26 manet kernel: NFS: NFSv4 callback contains invalid cred
The client-side commit f11b2a1cfb ("nfs4: copy acceptor name from
context to nfs_client") appears to be related, but I suspect this
problem has been going on for some time before that.
RFC 7530 Section 3.3.3 says:
> For Kerberos V5, nfs/hostname would be a server principal in the
> Kerberos Key Distribution Center database. This is the same
> principal the client acquired a GSS-API context for when it issued
> the SETCLIENTID operation ...
In other words, an NFSv4.0 client expects that the server will use
the same GSS principal for callback that the client used to
establish its lease. For example, if the client used the service
principal "nfs@server.domain" to establish its lease, the server
is required to use "nfs@server.domain" when performing NFSv4.0
callback operations.
The Linux NFS server currently does not. It uses a common service
principal for all callback connections. Sometimes this works as
expected, and other times -- for example, when the server is
accessible via multiple hostnames -- it won't work at all.
This patch scrapes the target name from the client credential,
and uses that for the NFSv4.0 callback credential. That should
be correct much more often.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
NFSv4.0 callback needs to know the GSS target name the client used
when it established its lease. That information is available from
the GSS context created by gssproxy. Make it available in each
svc_cred.
Note this will also give us access to the real target service
principal name (which is typically "nfs", but spec does not require
that).
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
In this round, we've tuned f2fs to improve general performance by serializing
block allocation and enhancing discard flows like fstrim which avoids user IO
contention. And we've added fsync_mode=nobarrier which gives an option to user
where it skips issuing cache_flush commands to underlying flash storage. And
there are many bug fixes related to fuzzed images, revoked atomic writes, quota
ops, and minor direct IO.
Enhancement:
- add fsync_mode=nobarrier which bypasses cache_flush command
- enhance the discarding flow which avoids user IOs and issues in LBA order
- readahead some encrypted blocks during GC
- enable in-memory inode checksum to verify the blocks if F2FS_CHECK_FS is set
- enhance nat_bits behavior
- set -o discard by default
- set REQ_RAHEAD to bio in ->readpages
Bug fixes:
- fix a corner case to corrupt atomic_writes revoking flow
- revisit i_gc_rwsem to fix race conditions
- fix some dio behaviors captured by xfstests
- correct handling errors given by quota-related failures
- add many sanity check flows to avoid fuzz test failures
- add more error number propagation to their callers
- fix several corner cases to continue fault injection w/ shutdown loop
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE00UqedjCtOrGVvQiQBSofoJIUNIFAlt82U4ACgkQQBSofoJI
UNJTLQ/+PhewnNa5tDfUgWdFnUFz3h9/NcC677l0OplOOUNxA8iSa1xamlKf/nf9
sB5ey0I7oBF8zQGxfndhHQfi6fpfUcNMr14hm+TS/3+d54xLJmiVShD5fjNSV2vB
Ur0xoozuQDwYF1e3QKdBQjFqaCf78VheTr3aWxyv22/Sg+PYylZJ2K8rHTB7mGPU
UG0aRnKrP3FPRjL7Q3m0Vm6b6eZ5uNdNrFfjgn/8yuQQ9V197K8vwSbPAsR5/pOh
miCQXyM708NgEYJRWkWmC/rDSQdU0/h/mGnJWrBrbceW62QefGOgd2jcVfmthHJa
ZXpj+BEG5bYpCCxGxF6N+u0e28OKonCwO/uvL8YAd5icN7yXtsKzoF1CCuXxOYf1
9K5SMylCTSyrs/+LV8CJoT2ya8w0l0w+R/txUYn8UT+4AgqU+chS2kJeXqw9tcHB
WLFs/rnAyofWCI/8frVBmJY+zA1ZZvTqs/lmVYrtJUkiOcMTq34WICBUAEFKV452
BM5dcu21bSIkapYispEt4Rr7o4P4HHMQ+N1i2yUZMFCz5T0RyzdybeS5THk2yVzd
L0kxfYU+zHigNX51ez8+Z7DyDLDBp6jkD0e66x73bUK9TGPH+ZbnAL6gwLikmD3M
+VxYl5nyW/3bxx1HdfK1Xwd4/wYMNBmdtn5NC50oZ+jpB0h8YCE=
=zgbL
-----END PGP SIGNATURE-----
Merge tag 'f2fs-for-4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs
Pull f2fs updates from Jaegeuk Kim:
"In this round, we've tuned f2fs to improve general performance by
serializing block allocation and enhancing discard flows like fstrim
which avoids user IO contention. And we've added fsync_mode=nobarrier
which gives an option to user where it skips issuing cache_flush
commands to underlying flash storage. And there are many bug fixes
related to fuzzed images, revoked atomic writes, quota ops, and minor
direct IO.
Enhancements:
- add fsync_mode=nobarrier which bypasses cache_flush command
- enhance the discarding flow which avoids user IOs and issues in
LBA order
- readahead some encrypted blocks during GC
- enable in-memory inode checksum to verify the blocks if
F2FS_CHECK_FS is set
- enhance nat_bits behavior
- set -o discard by default
- set REQ_RAHEAD to bio in ->readpages
Bug fixes:
- fix a corner case to corrupt atomic_writes revoking flow
- revisit i_gc_rwsem to fix race conditions
- fix some dio behaviors captured by xfstests
- correct handling errors given by quota-related failures
- add many sanity check flows to avoid fuzz test failures
- add more error number propagation to their callers
- fix several corner cases to continue fault injection w/ shutdown
loop"
* tag 'f2fs-for-4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (89 commits)
f2fs: readahead encrypted block during GC
f2fs: avoid fi->i_gc_rwsem[WRITE] lock in f2fs_gc
f2fs: fix performance issue observed with multi-thread sequential read
f2fs: fix to skip verifying block address for non-regular inode
f2fs: rework fault injection handling to avoid a warning
f2fs: support fault_type mount option
f2fs: fix to return success when trimming meta area
f2fs: fix use-after-free of dicard command entry
f2fs: support discard submission error injection
f2fs: split discard command in prior to block layer
f2fs: wake up gc thread immediately when gc_urgent is set
f2fs: fix incorrect range->len in f2fs_trim_fs()
f2fs: refresh recent accessed nat entry in lru list
f2fs: fix avoid race between truncate and background GC
f2fs: avoid race between zero_range and background GC
f2fs: fix to do sanity check with block address in main area v2
f2fs: fix to do sanity check with inline flags
f2fs: fix to reset i_gc_failures correctly
f2fs: fix invalid memory access
f2fs: fix to avoid broken of dnode block list
...
...otherwise there will be list corruption due to inode_sb_list_add() being
called for inode already on the sb list.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Fixes: e950564b97 ("vfs: don't evict uninitialized inode")
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Merge more updates from Andrew Morton:
- the rest of MM
- procfs updates
- various misc things
- more y2038 fixes
- get_maintainer updates
- lib/ updates
- checkpatch updates
- various epoll updates
- autofs updates
- hfsplus
- some reiserfs work
- fatfs updates
- signal.c cleanups
- ipc/ updates
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (166 commits)
ipc/util.c: update return value of ipc_getref from int to bool
ipc/util.c: further variable name cleanups
ipc: simplify ipc initialization
ipc: get rid of ids->tables_initialized hack
lib/rhashtable: guarantee initial hashtable allocation
lib/rhashtable: simplify bucket_table_alloc()
ipc: drop ipc_lock()
ipc/util.c: correct comment in ipc_obtain_object_check
ipc: rename ipcctl_pre_down_nolock()
ipc/util.c: use ipc_rcu_putref() for failues in ipc_addid()
ipc: reorganize initialization of kern_ipc_perm.seq
ipc: compute kern_ipc_perm.id under the ipc lock
init/Kconfig: remove EXPERT from CHECKPOINT_RESTORE
fs/sysv/inode.c: use ktime_get_real_seconds() for superblock stamp
adfs: use timespec64 for time conversion
kernel/sysctl.c: fix typos in comments
drivers/rapidio/devices/rio_mport_cdev.c: remove redundant pointer md
fork: don't copy inconsistent signal handler state to child
signal: make get_signal() return bool
signal: make sigkill_pending() return bool
...
get_seconds() is deprecated in favor of ktime_get_real_seconds(), which
returns a 64-bit timestamp.
In the SYSV file system, the superblock timestamp is only 32 bits wide,
and it is used to check whether a file system is clean, so the best
solution seems to be to force a wraparound and explicitly convert it to an
unsigned 32-bit value.
This is independent of the inode timestamps that are also 32-bit wide on
disk and that come from current_time().
Link: http://lkml.kernel.org/r/20180713145236.3152513-1-arnd@arndb.de
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We just truncate the seconds to 32-bit in one place now, so this can
trivially be converted over to using timespec64 consistently.
Link: http://lkml.kernel.org/r/20180620100133.4035614-1-arnd@arndb.de
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now that we pass down 64-bit timestamps from VFS, we just need to convert
that correctly into on-disk timestamps. To make that work correctly, this
changes the last use of time_to_tm() in the kernel to time64_to_tm(),
which also lets use remove that deprecated interfaces.
Similarly, the time_t use in fat_time_fat2unix() truncates the timestamp
on the way in, which can be avoided by using types that are wide enough to
hold the intermediate values during the conversion.
[hirofumi@mail.parknet.co.jp: remove useless temporary variable, needless long long]
Link: http://lkml.kernel.org/r/20180619153646.3637529-1-arnd@arndb.de
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Cc: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
On corrupted FATfs may have invalid ->i_start. To handle it, this checks
->i_start before using, and return proper error code.
Link: http://lkml.kernel.org/r/87o9f8y1t5.fsf_-_@mail.parknet.co.jp
Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Reported-by: Anatoly Trosinenko <anatoly.trosinenko@gmail.com>
Tested-by: Anatoly Trosinenko <anatoly.trosinenko@gmail.com>
Cc: Alan Cox <gnomes@lxorguk.ukuu.org.uk>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This fixes the following issues:
- When a buffer size is supplied to reiserfs_listxattr() such that each
individual name fits, but the concatenation of all names doesn't fit,
reiserfs_listxattr() overflows the supplied buffer. This leads to a
kernel heap overflow (verified using KASAN) followed by an out-of-bounds
usercopy and is therefore a security bug.
- When a buffer size is supplied to reiserfs_listxattr() such that a
name doesn't fit, -ERANGE should be returned. But reiserfs instead just
truncates the list of names; I have verified that if the only xattr on a
file has a longer name than the supplied buffer length, listxattr()
incorrectly returns zero.
With my patch applied, -ERANGE is returned in both cases and the memory
corruption doesn't happen anymore.
Credit for making me clean this code up a bit goes to Al Viro, who pointed
out that the ->actor calling convention is suboptimal and should be
changed.
Link: http://lkml.kernel.org/r/20180802151539.5373-1-jannh@google.com
Fixes: 48b32a3553 ("reiserfs: use generic xattr handlers")
Signed-off-by: Jann Horn <jannh@google.com>
Acked-by: Jeff Mahoney <jeffm@suse.com>
Cc: Eric Biggers <ebiggers@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This uses the deprecated time_t type but is write-only, and could be
removed, but as Jeff explains, having a timestamp can be usefule for
post-mortem analysis in crash dumps.
In order to remove one of the last instances of time_t, this changes the
type to time64_t, same as j_trans_start_time.
Link: http://lkml.kernel.org/r/20180622133315.221210-1-arnd@arndb.de
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Before linux-2.4.6, print_time() was used to pretty-print an inode time
when running reiserfs in user space, after that it has become obsolete and
is still a bit incorrect: It behaves differently on 32-bit and 64-bit
machines, and uses a static buffer to hold a string, which could lead to
undefined behavior if we ever called this from multiple places
simultaneously.
Since we always want to treat the timestamps as 'unsigned' anyway, simply
printing them as an integer is both simpler and safer while avoiding the
deprecated time_t type.
Link: http://lkml.kernel.org/r/20180620142522.27639-3-arnd@arndb.de
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Using CLOCK_REALTIME time_t timestamps breaks on 32-bit systems in 2038,
and gives surprising results with a concurrent settimeofday().
This changes the reiserfs journal timestamps to use ktime_get_seconds()
instead, which makes it use a 64-bit CLOCK_MONOTONIC stamp.
In the procfs output, the monotonic timestamp needs to be converted back
to CLOCK_REALTIME to keep the existing ABI.
Link: http://lkml.kernel.org/r/20180620142522.27639-2-arnd@arndb.de
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The HFS+ Access Control Lists have not worked at all for the past five
years, and nobody seems to have noticed. Besides, POSIX draft ACLs are
not compatible with MacOS. Drop the feature entirely.
Link: http://lkml.kernel.org/r/20180714190608.wtnmmtjqeyladkut@eaf
Signed-off-by: Ernesto A. Fernández <ernesto.mnd.fernandez@gmail.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Cc: Viacheslav Dubeyko <slava@dubeyko.com>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Files created under macOS cannot be opened under linux if their names
contain Korean characters, and vice versa.
The Korean alphabet is special because its normalization is done without a
table. The module deals with it correctly when composing, but forgets
about it for the decomposition.
Fix this using the Hangul decomposition function provided in the Unicode
Standard. The code fits a bit awkwardly because it requires a buffer,
while all the other normalizations are returned as pointers to the
decomposition table. This is actually also a bug because reordering may
still be needed, but for now leave it as it is.
The patch will cause trouble for Hangul filenames already created by the
module in the past. This shouldn't really be concern because its main
purpose was always sharing with macOS. If a user actually needs to access
such a file the nodecompose mount option should be enough.
Link: http://lkml.kernel.org/r/20180717220951.p6qqrgautc4pxvzu@eaf
Signed-off-by: Ernesto A. Fernández <ernesto.mnd.fernandez@gmail.com>
Reported-by: Ting-Chang Hou <tchou@synology.com>
Tested-by: Ting-Chang Hou <tchou@synology.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
After an extent is removed from the extent tree, the corresponding bits
are also cleared from the block allocation file. This is currently done
without releasing the tree lock.
The problem is that the allocation file has extents of its own; if it is
fragmented enough, some of them may be in the extent tree as well, and
hfsplus_get_block() will try to take the lock again.
To avoid deadlock, only hold the extent tree lock during the actual tree
operations.
Link: http://lkml.kernel.org/r/20180709202549.auxwkb6memlegb4a@eaf
Signed-off-by: Ernesto A. Fernández <ernesto.mnd.fernandez@gmail.com>
Reported-by: Anatoly Trosinenko <anatoly.trosinenko@gmail.com>
Cc: Viacheslav Dubeyko <slava@dubeyko.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
syzbot is reporting NULL pointer dereference at mount_fs() [1]. This is
because hfsplus_fill_super() is by error returning 0 when
hfsplus_fill_super() detected invalid filesystem image, and mount_bdev()
is returning NULL because dget(s->s_root) == NULL if s->s_root == NULL,
and mount_fs() is accessing root->d_sb because IS_ERR(root) == false if
root == NULL. Fix this by returning -EINVAL when hfsplus_fill_super()
detected invalid filesystem image.
[1] https://syzkaller.appspot.com/bug?id=21acb6850cecbc960c927229e597158cf35f33d0
Link: http://lkml.kernel.org/r/d83ce31a-874c-dd5b-f790-41405983a5be@I-love.SAKURA.ne.jp
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reported-by: syzbot <syzbot+01ffaf5d9568dd1609f7@syzkaller.appspotmail.com>
Reviewed-by: Ernesto A. Fernández <ernesto.mnd.fernandez@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use new return type vm_fault_t for page_mkwrite handler.
Link: http://lkml.kernel.org/r/1529555928-2411-1-git-send-email-konishi.ryusuke@lab.ntt.co.jp
Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The mount time field in the superblock uses a 64-bit timestamp, but
calling get_seconds() may truncate the current time to 32 bits.
This changes it to ktime_get_real_seconds() to avoid the potential
overflow.
Link: http://lkml.kernel.org/r/20180620075041.4154396-1-arnd@arndb.de
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Cc: David Howells <dhowells@redhat.com>
Cc: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The userspace automount(8) daemon is meant to perform a forced expire when
sent a SIGUSR2.
But since the expiration is routed through the kernel and the kernel
doesn't send an expire request if the mount is busy this hasn't worked at
least since autofs version 5.
Add an AUTOFS_EXP_FORCED flag to allow implemention of the feature and
bump the protocol version so user space can check if it's implemented if
needed.
Link: http://lkml.kernel.org/r/152937734715.21213.6594007182776598970.stgit@pluto.themaw.net
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Make the usage of the expire flags consistent by naming the expire flags
the same as it is named in the version 5 miscelaneous ioctl parameters and
only check the bit flags when needed.
Link: http://lkml.kernel.org/r/152937734046.21213.9454131988766280028.stgit@pluto.themaw.net
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The expire flag AUTOFS_EXP_LEAVES is cleared before the second call to
should_expire() in autofs_expire_indirect() but the parameter passed in
the second call is incorrect.
Fortunately AUTOFS_EXP_LEAVES expire flag has not been used for a long
time but might be needed in the future so fix it rather than remove the
expire leaves functionality.
Link: http://lkml.kernel.org/r/152937732410.21213.7447294898147765076.stgit@pluto.themaw.net
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The global variable "now" in fs/autofs/expire.c is used in an inconsistent
way, sometimes using jiffies directly, and sometimes using the "now"
variable, and setting it isn't done consistently either.
But the autofs dentry info last_used field is only updated during path
walks or during expire so jiffies can be used directly and the global
variable "now" removed.
Link: http://lkml.kernel.org/r/152937731702.21213.7371321165189170865.stgit@pluto.themaw.net
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Depending on how it is configured the autofs user space daemon can leave
in use mounts mounted at exit and re-connect to them at start up. But for
this to work best the state of the autofs file system needs to be left
intact over the restart.
Also, at system shutdown, mounts in an autofs file system might be
umounted exposing a mount point trigger for which subsequent access can
lead to a hang. So recent versions of automount(8) now does its best to
set autofs file system mounts catatonic at shutdown.
When autofs file system mounts are catatonic it's currently possible to
create and remove directories and symlinks which can be a problem at
restart, as described above.
So return EACCES in the directory, symlink and unlink methods if the
autofs file system is catatonic.
Link: http://lkml.kernel.org/r/152902119090.4144.9561910674530214291.stgit@pluto.themaw.net
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Instead of having each caller pass the rdllink explicitly, just have
ep_is_linked() pass it while the callers just need the epi pointer. This
helper is all about the rdllink, and this change, furthermore, improves
the function's self documentation.
Link: http://lkml.kernel.org/r/20180727053432.16679-3-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Similar to other calls, ep_poll() is not called with interrupts disabled,
and we can therefore avoid the irq save/restore dance and just disable
local irqs. In fact, the call should never be called in irq context at
all, considering that the only path is
epoll_wait(2) -> do_epoll_wait() -> ep_poll().
When running on a 2 socket 40-core (ht) IvyBridge a common pipe based
epoll_wait(2) microbenchmark, the following performance improvements are
seen:
# threads vanilla dirty
1 1805587 2106412
2 1854064 2090762
4 1805484 2017436
8 1751222 1974475
16 1725299 1962104
32 1378463 1571233
64 787368 900784
Which is a pretty constantly near 15%.
Also add a lockdep check such that we detect any mischief before
deadlocking.
Link: http://lkml.kernel.org/r/20180727053432.16679-2-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Jason Baron <jbaron@akamai.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
... 'tis easier on the eye.
[akpm@linux-foundation.org: use inlines rather than macros]
Link: http://lkml.kernel.org/r/20180725185620.11020-1-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Sprinkle lockdep_assert_irqs_enabled() checks in the functions that do not
save and restore interrupts when dealing with the ep->wq.lock. These are
ep_scan_ready_list() and those called by epoll_ctl(): ep_insert, ep_modify
and ep_remove.
[akpm@linux-foundation.org: remove too-obvious comments]
Link: http://lkml.kernel.org/r/20180721183127.3busfa335zlcjeox@linux-r8p5
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Both functions are similar to the context of ep_modify(), called via
epoll_ctl(2). Just like ep_modify(), saving and restoring interrupts is
an overkill in these calls as it will never be called with irqs disabled.
While ep_remove() can be called directly from EPOLL_CTL_DEL, it can also
be called when releasing the file, but this also complies with the above.
Link: http://lkml.kernel.org/r/20180720172956.2883-3-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "fs/epoll: loosen irq safety when possible".
Both patches replace saving+restoring interrupts when taking the ep->lock
(now the waitqueue lock), with just disabling local irqs. This shows
immediate performance benefits in patch 1 for an epoll workload running on
Xen. The main concern we need to have with this sort of changes in epoll
is the ep_poll_callback() which is passed to the wait queue wakeup and is
done very often under irq context, this patch does not touch this call.
Patches have been tested pretty heavily with the customer workload,
microbenchmarks, ltp testcases and two high level workloads that use epoll
under the hood: nginx and libevent benchmarks.
This patch (of 2):
Saving and restoring interrupts in ep_scan_ready_list() is an
overkill as it is never called with interrupts disabled. Loosen
this to simply disabling local irqs such that archs where managing
irqs is expensive or virtual environments. This patch yields
some throughput improvements on a workload that is epoll intensive
running on a single Xen DomU.
1 Job 7500 --> 8800 enq/s (+17%)
2 Jobs 14000 --> 15200 enq/s (+8%)
3 Jobs 20500 --> 22300 enq/s (+8%)
4 Jobs 25000 --> 28000 enq/s (+8-12)%
On bare metal:
For a 2-socket 40-core (ht) IvyBridge on a few workloads, unfortunately I
don't have a xen environment and the results for Xen I do have (which
numbers are in patch 1) I don't have the actual workload, so cannot
compare them directly.
1) Different configurations were used for a epoll_wait (pipes io)
microbench (http://linux-scalability.org/epoll/epoll-test.c) and shows
around a 7-10% improvement in overall total number of times the
epoll_wait() loops when using both regular and nested epolls, so very
raw numbers, but measurable nonetheless.
# threads vanilla dirty
1 1677717 1805587
2 1660510 1854064
4 1610184 1805484
8 1577696 1751222
16 1568837 1725299
32 1291532 1378463
64 752584 787368
Note that stddev is pretty small.
2) Another pipe test, which shows no real measurable improvement.
(http://www.xmailserver.org/linux-patches/pipetest.c)
Link: http://lkml.kernel.org/r/20180720172956.2883-2-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The userfaultfd code currently uses the unlocked waitqueue helpers for
managing fault_wqh, but instead of holding the waitqueue lock for this
waitqueue around these calls, it the waitqueue lock of
fault_pending_wq, which is a different waitqueue instance. Given that
the waitqueue is not exposed to the rest of the kernel this actually
works ok at the moment, but prevents the userfaultfd locking rules from
being enforced using lockdep.
Switch to the internally locked waitqueue helpers instead. This means
that the lock inside fault_wqh now nests inside the fault_pending_wqh
lock, but that's not a problem since it was entirely unused before.
[hch@lst.de: slight changelog updates]
[rppt@linux.vnet.ibm.com: spotted changelog spellos]
Link: http://lkml.kernel.org/r/20171214152344.6880-3-hch@lst.de
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "waitqueue lockdep annotation", v3.
This series adds a strategic lockdep_assert_held to __wake_up_common to
ensure callers really do hold the wait_queue_head lock when calling the
unlocked wake_up variants. It turns out epoll did not do this for a
fairly common path (hit all the time by systemd during bootup), so the
second patch fixed this instance as well.
This patch (of 3):
The epoll code currently uses the unlocked waitqueue helpers for managing
ep->wq, but instead of holding the waitqueue lock around these calls, it
uses its own ep->lock spinlock. Given that the waitqueue is not exposed
to the rest of the kernel this actually works ok at the moment, but
prevents the epoll locking rules from being enforced using lockdep.
Remove ep->lock and use the waitqueue lock to not only reduce the size of
struct eventpoll but also to make sure we can assert locking invariants in
the waitqueue code.
Link: http://lkml.kernel.org/r/20171214152344.6880-2-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jason Baron <jbaron@akamai.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The vmcoreinfo information is useful for runtime debugging tools, not just
for crash dumps. A lot of this information can be determined by other
means, but this is much more convenient, and it only adds a page at most
to the file.
Link: http://lkml.kernel.org/r/fddbcd08eed76344863303878b12de1c1e2a04b6.1531953780.git.osandov@fb.com
Signed-off-by: Omar Sandoval <osandov@fb.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Bhupesh Sharma <bhsharma@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: James Morse <james.morse@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The current code does a full search of the segment list every time for
every page. This is wasteful, since it's almost certain that the next
page will be in the same segment. Instead, check if the previous segment
covers the current page before doing the list search.
Link: http://lkml.kernel.org/r/fd346c11090cf93d867e01b8d73a6567c5ac6361.1531953780.git.osandov@fb.com
Signed-off-by: Omar Sandoval <osandov@fb.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Bhupesh Sharma <bhsharma@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: James Morse <james.morse@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently, the ELF file header, program headers, and note segment are
allocated all at once, in some icky code dating back to 2.3. Programs
tend to read the file header, then the program headers, then the note
segment, all separately, so this is a waste of effort. It's cleaner and
more efficient to handle the three separately.
Link: http://lkml.kernel.org/r/19c92cbad0e11f6103ff3274b2e7a7e51a1eb74b.1531953780.git.osandov@fb.com
Signed-off-by: Omar Sandoval <osandov@fb.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Bhupesh Sharma <bhsharma@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: James Morse <james.morse@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now that we're using an rwsem, we can hold it during the entirety of
read_kcore() and have a common return path. This is preparation for the
next change.
[akpm@linux-foundation.org: fix locking bug reported by Tetsuo Handa]
Link: http://lkml.kernel.org/r/d7cfbc1e8a76616f3b699eaff9df0a2730380534.1531953780.git.osandov@fb.com
Signed-off-by: Omar Sandoval <osandov@fb.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Bhupesh Sharma <bhsharma@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: James Morse <james.morse@arm.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There's a theoretical race condition that will cause /proc/kcore to miss
a memory hotplug event:
CPU0 CPU1
// hotplug event 1
kcore_need_update = 1
open_kcore() open_kcore()
kcore_update_ram() kcore_update_ram()
// Walk RAM // Walk RAM
__kcore_update_ram() __kcore_update_ram()
kcore_need_update = 0
// hotplug event 2
kcore_need_update = 1
kcore_need_update = 0
Note that CPU1 set up the RAM kcore entries with the state after hotplug
event 1 but cleared the flag for hotplug event 2. The RAM entries will
therefore be stale until there is another hotplug event.
This is an extremely unlikely sequence of events, but the fix makes the
synchronization saner, anyways: we serialize the entire update sequence,
which means that whoever clears the flag will always succeed in replacing
the kcore list.
Link: http://lkml.kernel.org/r/6106c509998779730c12400c1b996425df7d7089.1531953780.git.osandov@fb.com
Signed-off-by: Omar Sandoval <osandov@fb.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Bhupesh Sharma <bhsharma@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: James Morse <james.morse@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now we only need kclist_lock from user context and at fs init time, and
the following changes need to sleep while holding the kclist_lock.
Link: http://lkml.kernel.org/r/521ba449ebe921d905177410fee9222d07882f0d.1531953780.git.osandov@fb.com
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Bhupesh Sharma <bhsharma@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: James Morse <james.morse@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The memory hotplug notifier kcore_callback() only needs kclist_lock to
prevent races with __kcore_update_ram(), but we can easily eliminate that
race by using an atomic xchg() in __kcore_update_ram(). This is
preparation for converting kclist_lock to an rwsem.
Link: http://lkml.kernel.org/r/0a4bc89f4dbde8b5b2ea309f7b4fb6a85fe29df2.1531953780.git.osandov@fb.com
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Bhupesh Sharma <bhsharma@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: James Morse <james.morse@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "/proc/kcore improvements", v4.
This series makes a few improvements to /proc/kcore. It fixes a couple of
small issues in v3 but is otherwise the same. Patches 1, 2, and 3 are
prep patches. Patch 4 is a fix/cleanup. Patch 5 is another prep patch.
Patches 6 and 7 are optimizations to ->read(). Patch 8 makes it possible
to enable CRASH_CORE on any architecture, which is needed for patch 9.
Patch 9 adds vmcoreinfo to /proc/kcore.
This patch (of 9):
kclist_add() is only called at init time, so there's no point in grabbing
any locks. We're also going to replace the rwlock with a rwsem, which we
don't want to try grabbing during early boot.
While we're here, mark kclist_add() with __init so that we'll get a
warning if it's called from non-init code.
Link: http://lkml.kernel.org/r/98208db1faf167aa8b08eebfa968d95c70527739.1531953780.git.osandov@fb.com
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Bhupesh Sharma <bhsharma@redhat.com>
Tested-by: Bhupesh Sharma <bhsharma@redhat.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Bhupesh Sharma <bhsharma@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: James Morse <james.morse@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
elf_kcore_store_hdr() uses __pa() to find the physical address of
KCORE_RAM or KCORE_TEXT entries exported as program headers.
This trips CONFIG_DEBUG_VIRTUAL's checks, as the KCORE_TEXT entries are
not in the linear map.
Handle these two cases separately, using __pa_symbol() for the KCORE_TEXT
entries.
Link: http://lkml.kernel.org/r/20180711131944.15252-1-james.morse@arm.com
Signed-off-by: James Morse <james.morse@arm.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Omar Sandoval <osandov@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use new return type vm_fault_t for fault handler in struct
vm_operations_struct. For now, this is just documenting that the function
returns a VM_FAULT value rather than an errno. Once all instances are
converted, vm_fault_t will become a distinct type.
See 1c8f422059 ("mm: change return type to vm_fault_t") for reference.
Link: http://lkml.kernel.org/r/20180702153325.GA3875@jordon-HP-15-Notebook-PC
Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com>
Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Ganesh Goudar <ganeshgr@chelsio.com>
Cc: Rahul Lakkireddy <rahul.lakkireddy@chelsio.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Number of CPUs is never high enough to force 64-bit arithmetic.
Save couple of bytes on x86_64.
Link: http://lkml.kernel.org/r/20180627200710.GC18434@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
->latency_record is defined as
struct latency_record[LT_SAVECOUNT];
so use the same macro whie iterating.
Link: http://lkml.kernel.org/r/20180627200534.GA18434@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Code checks if write is done by current to its own attributes.
For that get/put pair is unnecessary as it can be done under RCU.
Note: rcu_read_unlock() can be done even earlier since pointer to a task
is not dereferenced. It depends if /proc code should look scary or not:
rcu_read_lock();
task = pid_task(...);
rcu_read_unlock();
if (!task)
return -ESRCH;
if (task != current)
return -EACCESS:
P.S.: rename "length" variable. Code like this
length = -EINVAL;
should not exist.
Link: http://lkml.kernel.org/r/20180627200218.GF18113@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Readdir context is thread local, so ->pos is thread local,
move it out of readlock.
Link: http://lkml.kernel.org/r/20180627195339.GD18113@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
get_monotonic_boottime() is deprecated and uses the old timespec type.
Let's convert /proc/uptime to use ktime_get_boottime_ts64().
Link: http://lkml.kernel.org/r/20180620081746.282742-1-arnd@arndb.de
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Deepa Dinamani <deepa.kernel@gmail.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
24074a35c5 ("proc: Make inline name size calculation automatic")
started to put PDE allocations into kmalloc-256 which is unnecessary as
~40 character names are very rare.
Put allocation back into kmalloc-192 cache for 64-bit non-debug builds.
Put BUILD_BUG_ON to know when PDE size has gotten out of control.
[adobriyan@gmail.com: fix BUILD_BUG_ON breakage on powerpc64]
Link: http://lkml.kernel.org/r/20180703191602.GA25521@avx2
Link: http://lkml.kernel.org/r/20180617215732.GA24688@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently, percpu memory only exposes allocation and utilization
information via debugfs. This more or less is only really useful for
understanding the fragmentation and allocation information at a per-chunk
level with a few global counters. This is also gated behind a config.
BPF and cgroup, for example, have seen an increase in use causing
increased use of percpu memory. Let's make it easier for someone to
identify how much memory is being used.
This patch adds the "Percpu" stat to meminfo to more easily look up how
much percpu memory is in use. This number includes the cost for all
allocated backing pages and not just insight at the per a unit, per chunk
level. Metadata is excluded. I think excluding metadata is fair because
the backing memory scales with the numbere of cpus and can quickly
outweigh the metadata. It also makes this calculation light.
Link: http://lkml.kernel.org/r/20180807184723.74919-1-dennisszhou@gmail.com
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Roman Gushchin <guro@fb.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Rather than in vm_area_alloc(). To ensure that the various oddball
stack-based vmas are in a good state. Some of the callers were zeroing
them out, others were not.
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Russell King <rmk+kernel@arm.linux.org.uk>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The /proc/pid/smaps_rollup file is currently implemented via the
m_start/m_next/m_stop seq_file iterators shared with the other maps files,
that iterate over vma's. However, the rollup file doesn't print anything
for each vma, only accumulate the stats.
There are some issues with the current code as reported in [1] - the
accumulated stats can get skewed if seq_file start()/stop() op is called
multiple times, if show() is called multiple times, and after seeks to
non-zero position.
Patch [1] fixed those within existing design, but I believe it is
fundamentally wrong to expose the vma iterators to the seq_file mechanism
when smaps_rollup shows logically a single set of values for the whole
address space.
This patch thus refactors the code to provide a single "value" at offset
0, with vma iteration to gather the stats done internally. This fixes the
situations where results are skewed, and simplifies the code, especially
in show_smap(), at the expense of somewhat less code reuse.
[1] https://marc.info/?l=linux-mm&m=151927723128134&w=2
[vbabka@suse.c: use seq_file infrastructure]
Link: http://lkml.kernel.org/r/bf4525b0-fd5b-4c4c-2cb3-adee3dd95a48@suse.cz
Link: http://lkml.kernel.org/r/20180723111933.15443-5-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reported-by: Daniel Colascione <dancol@google.com>
Reviewed-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
To prepare for handling /proc/pid/smaps_rollup differently from
/proc/pid/smaps factor out from show_smap() printing the parts of output
that are common for both variants, which is the bulk of the gathered
memory stats.
[vbabka@suse.cz: add const, per Alexey]
Link: http://lkml.kernel.org/r/b45f319f-cd04-337b-37f8-77f99786aa8a@suse.cz
Link: http://lkml.kernel.org/r/20180723111933.15443-4-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Daniel Colascione <dancol@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
To prepare for handling /proc/pid/smaps_rollup differently from
/proc/pid/smaps factor out vma mem stats gathering from show_smap() - it
will be used by both.
Link: http://lkml.kernel.org/r/20180723111933.15443-3-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Daniel Colascione <dancol@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "cleanups and refactor of /proc/pid/smaps*".
The recent regression in /proc/pid/smaps made me look more into the code.
Especially the issues with smaps_rollup reported in [1] as explained in
Patch 4, which fixes them by refactoring the code. Patches 2 and 3 are
preparations for that. Patch 1 is me realizing that there's a lot of
boilerplate left from times where we tried (unsuccessfuly) to mark thread
stacks in the output.
Originally I had also plans to rework the translation from
/proc/pid/*maps* file offsets to the internal structures. Now the offset
means "vma number", which is not really stable (vma's can come and go
between read() calls) and there's an extra caching of last vma's address.
My idea was that offsets would be interpreted directly as addresses, which
would also allow meaningful seeks (see the ugly seek_to_smaps_entry() in
tools/testing/selftests/vm/mlock2.h). However loff_t is (signed) long
long so that might be insufficient somewhere for the unsigned long
addresses.
So the result is fixed issues with skewed /proc/pid/smaps_rollup results,
simpler smaps code, and a lot of unused code removed.
[1] https://marc.info/?l=linux-mm&m=151927723128134&w=2
This patch (of 4):
Commit b76437579d ("procfs: mark thread stack correctly in
proc/<pid>/maps") introduced differences between /proc/PID/maps and
/proc/PID/task/TID/maps to mark thread stacks properly, and this was
also done for smaps and numa_maps. However it didn't work properly and
was ultimately removed by commit b18cb64ead ("fs/proc: Stop trying to
report thread stacks").
Now the is_pid parameter for the related show_*() functions is unused
and we can remove it together with wrapper functions and ops structures
that differ for PID and TID cases only in this parameter.
Link: http://lkml.kernel.org/r/20180723111933.15443-2-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Daniel Colascione <dancol@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
autofs_sbi() does not check the superblock magic number to verify it has
been given an autofs super block.
Link: http://lkml.kernel.org/r/153475422934.17131.7563724552005298277.stgit@pluto.themaw.net
Reported-by: <syzbot+87c3c541582e56943277@syzkaller.appspotmail.com>
Signed-off-by: Ian Kent <raven@themaw.net>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
XQM_MAXQUOTAS and MAXQUOTAS are, it appears, equivalent. Replace all
usage of XQM_MAXQUOTAS and remove it along with the unused XQM_*QUOTA
definitions.
Signed-off-by: Jeremy Cline <jcline@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
ida_alloc_max() matches what this driver wants to do. Also removes a
call to ida_pre_get(). We no longer need the protection of the mutex,
so convert pty_count to an atomic_t and remove the mutex entirely.
Signed-off-by: Matthew Wilcox <willy@infradead.org>
This contains various bug fixes and cleanups.
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCW3xvGwAKCRDh3BK/laaZ
PKECAP9qUpdtQ5RaIL/y9OGZzJLSZbBZuK3LGNY2u2B3EfrSjgEAvhkhXyOQgvVi
kgYLNszbg/C+w8U4Xc5GWB6cjNm6rwE=
=GJI7
-----END PGP SIGNATURE-----
Merge tag 'fuse-update-4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse
Pull fuse update from Miklos Szeredi:
"Various bug fixes and cleanups"
* tag 'fuse-update-4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
fuse: reduce allocation size for splice_write
fuse: use kvmalloc to allocate array of pipe_buffer structs.
fuse: convert last timespec use to timespec64
fs: fuse: Adding new return type vm_fault_t
fuse: simplify fuse_abort_conn()
fuse: Add missed unlock_page() to fuse_readpages_fill()
fuse: Don't access pipe->buffers without pipe_lock()
fuse: fix initial parallel dirops
fuse: Fix oops at process_init_reply()
fuse: umount should wait for all requests
fuse: fix unlocked access to processing queue
fuse: fix double request_end()
This contains two new features:
1) Stack file operations: this allows removal of several hacks from the
VFS, proper interaction of read-only open files with copy-up,
possibility to implement fs modifying ioctls properly, and others.
2) Metadata only copy-up: when file is on lower layer and only metadata is
modified (except size) then only copy up the metadata and continue to
use the data from the lower file.
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCW3srhAAKCRDh3BK/laaZ
PC6tAQCP+KklcN+TvNp502f+O/kATahSpgnun4NY1/p4I8JV+AEAzdlkTN3+MiAO
fn9brN6mBK7h59DO3hqedPLJy2vrgwg=
=QDXH
-----END PGP SIGNATURE-----
Merge tag 'ovl-update-4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs
Pull overlayfs updates from Miklos Szeredi:
"This contains two new features:
- Stack file operations: this allows removal of several hacks from
the VFS, proper interaction of read-only open files with copy-up,
possibility to implement fs modifying ioctls properly, and others.
- Metadata only copy-up: when file is on lower layer and only
metadata is modified (except size) then only copy up the metadata
and continue to use the data from the lower file"
* tag 'ovl-update-4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs: (66 commits)
ovl: Enable metadata only feature
ovl: Do not do metacopy only for ioctl modifying file attr
ovl: Do not do metadata only copy-up for truncate operation
ovl: add helper to force data copy-up
ovl: Check redirect on index as well
ovl: Set redirect on upper inode when it is linked
ovl: Set redirect on metacopy files upon rename
ovl: Do not set dentry type ORIGIN for broken hardlinks
ovl: Add an inode flag OVL_CONST_INO
ovl: Treat metacopy dentries as type OVL_PATH_MERGE
ovl: Check redirects for metacopy files
ovl: Move some dir related ovl_lookup_single() code in else block
ovl: Do not expose metacopy only dentry from d_real()
ovl: Open file with data except for the case of fsync
ovl: Add helper ovl_inode_realdata()
ovl: Store lower data inode in ovl_inode
ovl: Fix ovl_getattr() to get number of blocks from lower
ovl: Add helper ovl_dentry_lowerdata() to get lower data dentry
ovl: Copy up meta inode data from lowest data inode
ovl: Modify ovl_lookup() and friends to lookup metacopy dentry
...
- Fix an uninitialized variable
- Don't use obviously garbage AG header counters to calculate
transaction reservations
- Trigger icount recalculation on bad icount when monting.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAlty8tsACgkQ+H93GTRK
tOtoKw/+OeCaY6jZc2JoztBwLSUJsMYQ0R8Wsj5GRb4bVp9b0zes7RJMFU03nCtj
XuE4Rhdsx+6+QZQKxTq/Z6lrKHEjF0kL1EVGHtL46Inr+Z+Rr4bLBG6NV1o0dg7B
CR1IqW5vYcZ7Vrk9ko/RXVXtuCIxBS5jSW/S/uFT95Y4lVMAf/2asR/OoYt5ZVE3
17CUfWRifiSGoBQpjtfZd63F23XlEEusiErC5iS9rUbE2qC9FxP9EuvoUP5M/n01
nLS34Fjw7X739AiwHbf10fQPOvBr7atTazCXskjy4gbwqIWTmuhbF4ieTU1OfTI8
ozhvYomBYLiZbsEYBhVCs09VEnIfHmf2HoLh//efGE8VEvoQllxdn/g2PQekoPAn
M7VnRUXCTvaLI8IE2d3Ed1VWm0OTea09xqEiNpB0XGjegim9pXuf6t/zbe4R0vJy
YLBgQT8XRPw5ZgCnBbxvZOXXxQtAqKnqZzYSWGxlHJhhduKVeKMqerhP0nn0ui8g
wAOmOe3XEoyLfSY8WY0ACEEEA00pAwErerwVEFLCpaKTh5GOY4i3OBdqcZOtXacn
f5oIeG9HZKAXKkOTGwpq1zGHTOYhz4mxAYhodRFiEE8rXHDa9odUWQ/iG0zgZaO6
19xznXjXkVWVg0QJqQJi6SbEkkrAEFtFRYH+VPTgWM/1tg47a14=
=+0Eq
-----END PGP SIGNATURE-----
Merge tag 'xfs-4.19-merge-7' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs fixes from Darrick Wong:
- Fix an uninitialized variable
- Don't use obviously garbage AG header counters to calculate
transaction reservations
- Trigger icount recalculation on bad icount when mounting
* tag 'xfs-4.19-merge-7' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
iomap: fix WARN_ON_ONCE on uninitialized variable
xfs: sanity check ag header values in xrep_calc_ag_resblks
xfs: recalculate summary counters at mount time if icount is bad
Pull core signal handling updates from Eric Biederman:
"It was observed that a periodic timer in combination with a
sufficiently expensive fork could prevent fork from every completing.
This contains the changes to remove the need for that restart.
This set of changes is split into several parts:
- The first part makes PIDTYPE_TGID a proper pid type instead
something only for very special cases. The part starts using
PIDTYPE_TGID enough so that in __send_signal where signals are
actually delivered we know if the signal is being sent to a a group
of processes or just a single process.
- With that prep work out of the way the logic in fork is modified so
that fork logically makes signals received while it is running
appear to be received after the fork completes"
* 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (22 commits)
signal: Don't send signals to tasks that don't exist
signal: Don't restart fork when signals come in.
fork: Have new threads join on-going signal group stops
fork: Skip setting TIF_SIGPENDING in ptrace_init_task
signal: Add calculate_sigpending()
fork: Unconditionally exit if a fatal signal is pending
fork: Move and describe why the code examines PIDNS_ADDING
signal: Push pid type down into complete_signal.
signal: Push pid type down into __send_signal
signal: Push pid type down into send_signal
signal: Pass pid type into do_send_sig_info
signal: Pass pid type into send_sigio_to_task & send_sigurg_to_task
signal: Pass pid type into group_send_sig_info
signal: Pass pid and pid type into send_sigqueue
posix-timers: Noralize good_sigevent
signal: Use PIDTYPE_TGID to clearly store where file signals will be sent
pid: Implement PIDTYPE_TGID
pids: Move the pgrp and session pid pointers from task_struct to signal_struct
kvm: Don't open code task_pid in kvm_vcpu_ioctl
pids: Compute task_tgid using signal->leader_pid
...
If we knew that the file was empty, we wouldn't be asking for a layout.
Any optimisation here is already done before calling pnfs_update_layout().
As it stands, we sometimes end up doing an unnecessary inband read to
the MDS even when holding a layout.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
If we get an error while retrieving the layout, then we should
report it rather than falling back to I/O through the MDS.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
The largest block size supported by isofs is ISOFS_BLOCK_SIZE (2048), but
isofs_fill_super calls sb_min_blocksize and sets the blocksize to the
device's logical block size if it's larger than what we ended up with after
option parsing.
If for some reason we try to mount a hard 4k device as an isofs filesystem,
we'll set opt.blocksize to 4096, and when we try to read the superblock
we found via:
block = iso_blknum << (ISOFS_BLOCK_BITS - s->s_blocksize_bits)
with s_blocksize_bits greater than ISOFS_BLOCK_BITS, we'll have a negative
shift and the bread will fail somewhat cryptically:
isofs_fill_super: bread failed, dev=sda, iso_blknum=17, block=-2147483648
It seems best to just catch and clearly reject mounts of such a device.
Reported-by: Bryan Gurney <bgurney@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
The f2fs_gc() called by f2fs_balance_fs() requires to be called outside of
fi->i_gc_rwsem[WRITE], since f2fs_gc() can try to grab it in a loop.
If it hits the miximum retrials in GC, let's give a chance to release
gc_mutex for a short time in order not to go into live lock in the worst
case.
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This reverts the commit - "b93f771 - f2fs: remove writepages lock"
to fix the drop in sequential read throughput.
Test: ./tiotest -t 32 -d /data/tio_tmp -f 32 -b 524288 -k 1 -k 3 -L
device: UFS
Before -
read throughput: 185 MB/s
total read requests: 85177 (of these ~80000 are 4KB size requests).
total write requests: 2546 (of these ~2208 requests are written in 512KB).
After -
read throughput: 758 MB/s
total read requests: 2417 (of these ~2042 are 512KB reads).
total write requests: 2701 (of these ~2034 requests are written in 512KB).
Signed-off-by: Sahitya Tummala <stummala@codeaurora.org>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
- Restructure of lockdep and latency tracers
This is the biggest change. Joel Fernandes restructured the hooks
from irqs and preemption disabling and enabling. He got rid of
a lot of the preprocessor #ifdef mess that they caused.
He turned both lockdep and the latency tracers to use trace events
inserted in the preempt/irqs disabling paths. But unfortunately,
these started to cause issues in corner cases. Thus, parts of the
code was reverted back to where lockde and the latency tracers
just get called directly (without using the trace events).
But because the original change cleaned up the code very nicely
we kept that, as well as the trace events for preempt and irqs
disabling, but they are limited to not being called in NMIs.
- Have trace events use SRCU for "rcu idle" calls. This was required
for the preempt/irqs off trace events. But it also had to not
allow them to be called in NMI context. Waiting till Paul makes
an NMI safe SRCU API.
- New notrace SRCU API to allow trace events to use SRCU.
- Addition of mcount-nop option support
- SPDX headers replacing GPL templates.
- Various other fixes and clean ups.
- Some fixes are marked for stable, but were not fully tested
before the merge window opened.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCW3ruhRQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qiM7AP47NhYdSnCFCRUJfrt6PovXmQtuCHt3
c3QMoGGdvzh9YAEAqcSXwh7uLhpHUp1LjMAPkXdZVwNddf4zJQ1zyxQ+EAU=
=vgEr
-----END PGP SIGNATURE-----
Merge tag 'trace-v4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing updates from Steven Rostedt:
- Restructure of lockdep and latency tracers
This is the biggest change. Joel Fernandes restructured the hooks
from irqs and preemption disabling and enabling. He got rid of a lot
of the preprocessor #ifdef mess that they caused.
He turned both lockdep and the latency tracers to use trace events
inserted in the preempt/irqs disabling paths. But unfortunately,
these started to cause issues in corner cases. Thus, parts of the
code was reverted back to where lockdep and the latency tracers just
get called directly (without using the trace events). But because the
original change cleaned up the code very nicely we kept that, as well
as the trace events for preempt and irqs disabling, but they are
limited to not being called in NMIs.
- Have trace events use SRCU for "rcu idle" calls. This was required
for the preempt/irqs off trace events. But it also had to not allow
them to be called in NMI context. Waiting till Paul makes an NMI safe
SRCU API.
- New notrace SRCU API to allow trace events to use SRCU.
- Addition of mcount-nop option support
- SPDX headers replacing GPL templates.
- Various other fixes and clean ups.
- Some fixes are marked for stable, but were not fully tested before
the merge window opened.
* tag 'trace-v4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (44 commits)
tracing: Fix SPDX format headers to use C++ style comments
tracing: Add SPDX License format tags to tracing files
tracing: Add SPDX License format to bpf_trace.c
blktrace: Add SPDX License format header
s390/ftrace: Add -mfentry and -mnop-mcount support
tracing: Add -mcount-nop option support
tracing: Avoid calling cc-option -mrecord-mcount for every Makefile
tracing: Handle CC_FLAGS_FTRACE more accurately
Uprobe: Additional argument arch_uprobe to uprobe_write_opcode()
Uprobes: Simplify uprobe_register() body
tracepoints: Free early tracepoints after RCU is initialized
uprobes: Use synchronize_rcu() not synchronize_sched()
tracing: Fix synchronizing to event changes with tracepoint_synchronize_unregister()
ftrace: Remove unused pointer ftrace_swapper_pid
tracing: More reverting of "tracing: Centralize preemptirq tracepoints and unify their usage"
tracing/irqsoff: Handle preempt_count for different configs
tracing: Partial revert of "tracing: Centralize preemptirq tracepoints and unify their usage"
tracing: irqsoff: Account for additional preempt_disable
trace: Use rcu_dereference_raw for hooks from trace-event subsystem
tracing/kprobes: Fix within_notrace_func() to check only notrace functions
...
basic support for rbd images within namespaces (myself). Also included
y2038 conversion patches from Arnd, a pile of miscellaneous fixes from
Chengguang and Zheng's feature bit infrastructure for the filesystem.
-----BEGIN PGP SIGNATURE-----
iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAlt62CkTHGlkcnlvbW92
QGdtYWlsLmNvbQAKCRBKf944AhHzizfhB/0c/rz6frunc6EcZMWuBNzlOIOktJ/m
MEbPGjCxMAsmidO1rqHHYF4iEN5hr+3AWTbtIL2m6wkqYVdg3FjmNaAYB27AdQMG
kH9bLfrKIew72/NZqXfm25yjY/86kIt8t91kay4Lchc97tSYhnFSnku7iAX2HTND
TMhq/1O/GvEyw/RmqnenJEQqFJvKnfgPPQm6W8sM2bH0T5j+EXmDT/Rv+90LogFR
J4+pZkHqDfvyMb1WJ5MkumohytbRVzRNKcMpOvjquJSqUgtgZa2JdrIsypDqSNKY
nUT6jGGlxoSbHCqRwDJoFEJOlh5A9RwKqYxNuM2a/vs9u7HpvdCK/Iah
=AtgY
-----END PGP SIGNATURE-----
Merge tag 'ceph-for-4.19-rc1' of git://github.com/ceph/ceph-client
Pull ceph updates from Ilya Dryomov:
"The main things are support for cephx v2 authentication protocol and
basic support for rbd images within namespaces (myself).
Also included are y2038 conversion patches from Arnd, a pile of
miscellaneous fixes from Chengguang and Zheng's feature bit
infrastructure for the filesystem"
* tag 'ceph-for-4.19-rc1' of git://github.com/ceph/ceph-client: (40 commits)
ceph: don't drop message if it contains more data than expected
ceph: support cephfs' own feature bits
crush: fix using plain integer as NULL warning
libceph: remove unnecessary non NULL check for request_key
ceph: refactor error handling code in ceph_reserve_caps()
ceph: refactor ceph_unreserve_caps()
ceph: change to void return type for __do_request()
ceph: compare fsc->max_file_size and inode->i_size for max file size limit
ceph: add additional size check in ceph_setattr()
ceph: add additional offset check in ceph_write_iter()
ceph: add additional range check in ceph_fallocate()
ceph: add new field max_file_size in ceph_fs_client
libceph: weaken sizeof check in ceph_x_verify_authorizer_reply()
libceph: check authorizer reply/challenge length before reading
libceph: implement CEPHX_V2 calculation mode
libceph: add authorizer challenge
libceph: factor out encrypt_authorizer()
libceph: factor out __ceph_x_decrypt()
libceph: factor out __prepare_write_connect()
libceph: store ceph_auth_handshake pointer in ceph_connection
...
When inode is getting deleted and someone else holds reference to a mark
attached to the inode, we just detach the connector from the inode. In
that case fsnotify_put_mark() called from fsnotify_destroy_marks() will
decide to recalculate mask for the inode and __fsnotify_recalc_mask()
will WARN about invalid connector type:
WARNING: CPU: 1 PID: 12015 at fs/notify/mark.c:139
__fsnotify_recalc_mask+0x2d7/0x350 fs/notify/mark.c:139
Actually there's no reason to warn about detached connector in
__fsnotify_recalc_mask() so just silently skip updating the mask in such
case.
Reported-by: syzbot+c34692a51b9a6ca93540@syzkaller.appspotmail.com
Fixes: 3ac70bfcde ("fsnotify: add helper to get mask from connector")
Signed-off-by: Jan Kara <jack@suse.cz>
Here are all of the driver core and related patches for 4.19-rc1.
Nothing huge here, just a number of small cleanups and the ability to
now stop the deferred probing after init happens.
All of these have been in linux-next for a while with only a merge issue
reported. That merge issue is in fs/sysfs/group.c and Stephen has
posted the diff of what it should be to resolve this. I'll follow up
with that diff to this pull request.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCW3g86Q8cZ3JlZ0Brcm9h
aC5jb20ACgkQMUfUDdst+ynyXQCePaZSW8wft4b7nLN8RdZ98ATBru0Ani10lrJa
HQeQJRNbWU1AZ0ym7695
=tOaH
-----END PGP SIGNATURE-----
Merge tag 'driver-core-4.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core updates from Greg KH:
"Here are all of the driver core and related patches for 4.19-rc1.
Nothing huge here, just a number of small cleanups and the ability to
now stop the deferred probing after init happens.
All of these have been in linux-next for a while with only a merge
issue reported"
* tag 'driver-core-4.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (21 commits)
base: core: Remove WARN_ON from link dependencies check
drivers/base: stop new probing during shutdown
drivers: core: Remove glue dirs from sysfs earlier
driver core: remove unnecessary function extern declare
sysfs.h: fix non-kernel-doc comment
PM / Domains: Stop deferring probe at the end of initcall
iommu: Remove IOMMU_OF_DECLARE
iommu: Stop deferring probe at end of initcalls
pinctrl: Support stopping deferred probe after initcalls
dt-bindings: pinctrl: add a 'pinctrl-use-default' property
driver core: allow stopping deferred probe after init
driver core: add a debugfs entry to show deferred devices
sysfs: Fix internal_create_group() for named group updates
base: fix order of OF initialization
linux/device.h: fix kernel-doc notation warning
Documentation: update firmware loader fallback reference
kobject: Replace strncpy with memcpy
drivers: base: cacheinfo: use OF property_read_u32 instead of get_property,read_number
kernfs: Replace strncpy with memcpy
device: Add #define dev_fmt similar to #define pr_fmt
...
This tag is the same as 9p-for-4.19 without the two MAINTAINERS patches
Contains mostly fixes (6 to be backported to stable) and a few changes,
here is the breakdown:
* Rework how fids are attributed by replacing some custom tracking in a
list by an idr (f28cdf0430)
* For packet-based transports (virtio/rdma) validate that the packet
length matches what the header says (f984579a01)
* A few race condition fixes found by syzkaller (9f476d7c54,
430ac66eb4)
* Missing argument check when NULL device is passed in sys_mount
(10aa14527f)
* A few virtio fixes (23cba9cbde, 31934da810, d28c756cae)
* Some spelling and style fixes
----------------------------------------------------------------
Chirantan Ekbote (1):
9p/net: Fix zero-copy path in the 9p virtio transport
Colin Ian King (1):
fs/9p/v9fs.c: fix spelling mistake "Uknown" -> "Unknown"
Jean-Philippe Brucker (1):
net/9p: fix error path of p9_virtio_probe
Matthew Wilcox (4):
9p: Fix comment on smp_wmb
9p: Change p9_fid_create calling convention
9p: Replace the fidlist with an IDR
9p: Embed wait_queue_head into p9_req_t
Souptick Joarder (1):
fs/9p/vfs_file.c: use new return type vm_fault_t
Stephen Hemminger (1):
9p: fix whitespace issues
Tomas Bortoli (5):
net/9p/client.c: version pointer uninitialized
net/9p/trans_fd.c: fix race-condition by flushing workqueue before the kfree()
net/9p/trans_fd.c: fix race by holding the lock
9p: validate PDU length
9p: fix multiple NULL-pointer-dereferences
jiangyiwen (2):
net/9p/virtio: Fix hard lockup in req_done
9p/virtio: fix off-by-one error in sg list bounds check
piaojun (5):
net/9p/client.c: add missing '\n' at the end of p9_debug()
9p/net/protocol.c: return -ENOMEM when kmalloc() failed
net/9p/trans_virtio.c: fix some spell mistakes in comments
fs/9p/xattr.c: catch the error of p9_client_clunk when setting xattr failed
net/9p/trans_virtio.c: add null terminal for mount tag
fs/9p/v9fs.c | 2 +-
fs/9p/vfs_file.c | 2 +-
fs/9p/xattr.c | 6 ++++--
include/net/9p/client.h | 11 ++++-------
net/9p/client.c | 119 +++++++++++++++++++++++++++++++++++++++++++++++++-------------------------------------------------------------------
net/9p/protocol.c | 2 +-
net/9p/trans_fd.c | 22 +++++++++++++++-------
net/9p/trans_rdma.c | 4 ++++
net/9p/trans_virtio.c | 66 +++++++++++++++++++++++++++++++++++++---------------------------
net/9p/trans_xen.c | 3 +++
net/9p/util.c | 1 -
12 files changed, 122 insertions(+), 116 deletions(-)
-----BEGIN PGP SIGNATURE-----
iF0EABECAB0WIQQ8idm2ZSicIMLgzKqoqIItDqvwPAUCW3ElNwAKCRCoqIItDqvw
PMzfAKCkCYFyNC89vcpxcCNsK7rFQ1qKlwCgoaBpZDdegOu0jMB7cyKwAWrB0LM=
=h3T0
-----END PGP SIGNATURE-----
Merge tag '9p-for-4.19-2' of git://github.com/martinetd/linux
Pull 9p updates from Dominique Martinet:
"This contains mostly fixes (6 to be backported to stable) and a few
changes, here is the breakdown:
- rework how fids are attributed by replacing some custom tracking in
a list by an idr
- for packet-based transports (virtio/rdma) validate that the packet
length matches what the header says
- a few race condition fixes found by syzkaller
- missing argument check when NULL device is passed in sys_mount
- a few virtio fixes
- some spelling and style fixes"
* tag '9p-for-4.19-2' of git://github.com/martinetd/linux: (21 commits)
net/9p/trans_virtio.c: add null terminal for mount tag
9p/virtio: fix off-by-one error in sg list bounds check
9p: fix whitespace issues
9p: fix multiple NULL-pointer-dereferences
fs/9p/xattr.c: catch the error of p9_client_clunk when setting xattr failed
9p: validate PDU length
net/9p/trans_fd.c: fix race by holding the lock
net/9p/trans_fd.c: fix race-condition by flushing workqueue before the kfree()
net/9p/virtio: Fix hard lockup in req_done
net/9p/trans_virtio.c: fix some spell mistakes in comments
9p/net: Fix zero-copy path in the 9p virtio transport
9p: Embed wait_queue_head into p9_req_t
9p: Replace the fidlist with an IDR
9p: Change p9_fid_create calling convention
9p: Fix comment on smp_wmb
net/9p/client.c: version pointer uninitialized
fs/9p/v9fs.c: fix spelling mistake "Uknown" -> "Unknown"
net/9p: fix error path of p9_virtio_probe
9p/net/protocol.c: return -ENOMEM when kmalloc() failed
net/9p/client.c: add missing '\n' at the end of p9_debug()
...
Merge updates from Andrew Morton:
- a few misc things
- a few Y2038 fixes
- ntfs fixes
- arch/sh tweaks
- ocfs2 updates
- most of MM
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (111 commits)
mm/hmm.c: remove unused variables align_start and align_end
fs/userfaultfd.c: remove redundant pointer uwq
mm, vmacache: hash addresses based on pmd
mm/list_lru: introduce list_lru_shrink_walk_irq()
mm/list_lru.c: pass struct list_lru_node* as an argument to __list_lru_walk_one()
mm/list_lru.c: move locking from __list_lru_walk_one() to its caller
mm/list_lru.c: use list_lru_walk_one() in list_lru_walk_node()
mm, swap: make CONFIG_THP_SWAP depend on CONFIG_SWAP
mm/sparse: delete old sparse_init and enable new one
mm/sparse: add new sparse_init_nid() and sparse_init()
mm/sparse: move buffer init/fini to the common place
mm/sparse: use the new sparse buffer functions in non-vmemmap
mm/sparse: abstract sparse buffer allocations
mm/hugetlb.c: don't zero 1GiB bootmem pages
mm, page_alloc: double zone's batchsize
mm/oom_kill.c: document oom_lock
mm/hugetlb: remove gigantic page support for HIGHMEM
mm, oom: remove sleep from under oom_lock
kernel/dma: remove unsupported gfp_mask parameter from dma_alloc_from_contiguous()
mm/cma: remove unsupported gfp_mask parameter from cma_alloc()
...
Pointer uwq is being assigned but is never used hence it is redundant
and can be removed.
Cleans up clang warning:
warning: variable 'uwq' set but not used [-Wunused-but-set-variable]
Link: http://lkml.kernel.org/r/20180717090802.18357-1-colin.king@canonical.com
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We need to distinguish the situations when shrinker has very small
amount of objects (see vfs_pressure_ratio() called from
super_cache_count()), and when it has no objects at all. Currently, in
the both of these cases, shrinker::count_objects() returns 0.
The patch introduces new SHRINK_EMPTY return value, which will be used
for "no objects at all" case. It's is a refactoring mostly, as
SHRINK_EMPTY is replaced by 0 by all callers of do_shrink_slab() in this
patch, and all the magic will happen in further.
Link: http://lkml.kernel.org/r/153063069574.1818.11037751256699341813.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Tested-by: Shakeel Butt <shakeelb@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Li RongQing <lirongqing@baidu.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Matthias Kaehlcke <mka@chromium.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Sahitya Tummala <stummala@codeaurora.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <longman@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add list_lru::shrinker_id field and populate it by registered shrinker
id.
This will be used to set correct bit in memcg shrinkers map by lru code
in next patches, after there appeared the first related to memcg element
in list_lru.
Link: http://lkml.kernel.org/r/153063059758.1818.14866596416857717800.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Tested-by: Shakeel Butt <shakeelb@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Li RongQing <lirongqing@baidu.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Matthias Kaehlcke <mka@chromium.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Sahitya Tummala <stummala@codeaurora.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <longman@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Do two list_lru_init_memcg() calls after prealloc_super().
destroy_unused_super() in fail path is OK with this. Next patch needs
such the order.
Link: http://lkml.kernel.org/r/153063058712.1818.3382490999719078571.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Tested-by: Shakeel Butt <shakeelb@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Li RongQing <lirongqing@baidu.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Matthias Kaehlcke <mka@chromium.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Sahitya Tummala <stummala@codeaurora.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <longman@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The buffer_head can consume a significant amount of system memory and is
directly related to the amount of page cache. In our production
environment we have observed that a lot of machines are spending a
significant amount of memory as buffer_head and can not be left as
system memory overhead.
Charging buffer_head is not as simple as adding __GFP_ACCOUNT to the
allocation. The buffer_heads can be allocated in a memcg different from
the memcg of the page for which buffer_heads are being allocated. One
concrete example is memory reclaim. The reclaim can trigger I/O of
pages of any memcg on the system. So, the right way to charge
buffer_head is to extract the memcg from the page for which buffer_heads
are being allocated and then use targeted memcg charging API.
[shakeelb@google.com: use __GFP_ACCOUNT for directed memcg charging]
Link: http://lkml.kernel.org/r/20180702220208.213380-1-shakeelb@google.com
Link: http://lkml.kernel.org/r/20180627191250.209150-3-shakeelb@google.com
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "Directed kmem charging", v8.
The Linux kernel's memory cgroup allows limiting the memory usage of the
jobs running on the system to provide isolation between the jobs. All
the kernel memory allocated in the context of the job and marked with
__GFP_ACCOUNT will also be included in the memory usage and be limited
by the job's limit.
The kernel memory can only be charged to the memcg of the process in
whose context kernel memory was allocated. However there are cases
where the allocated kernel memory should be charged to the memcg
different from the current processes's memcg. This patch series
contains two such concrete use-cases i.e. fsnotify and buffer_head.
The fsnotify event objects can consume a lot of system memory for large
or unlimited queues if there is either no or slow listener. The events
are allocated in the context of the event producer. However they should
be charged to the event consumer. Similarly the buffer_head objects can
be allocated in a memcg different from the memcg of the page for which
buffer_head objects are being allocated.
To solve this issue, this patch series introduces mechanism to charge
kernel memory to a given memcg. In case of fsnotify events, the memcg
of the consumer can be used for charging and for buffer_head, the memcg
of the page can be charged. For directed charging, the caller can use
the scope API memalloc_[un]use_memcg() to specify the memcg to charge
for all the __GFP_ACCOUNT allocations within the scope.
This patch (of 2):
A lot of memory can be consumed by the events generated for the huge or
unlimited queues if there is either no or slow listener. This can cause
system level memory pressure or OOMs. So, it's better to account the
fsnotify kmem caches to the memcg of the listener.
However the listener can be in a different memcg than the memcg of the
producer and these allocations happen in the context of the event
producer. This patch introduces remote memcg charging API which the
producer can use to charge the allocations to the memcg of the listener.
There are seven fsnotify kmem caches and among them allocations from
dnotify_struct_cache, dnotify_mark_cache, fanotify_mark_cache and
inotify_inode_mark_cachep happens in the context of syscall from the
listener. So, SLAB_ACCOUNT is enough for these caches.
The objects from fsnotify_mark_connector_cachep are not accounted as
they are small compared to the notification mark or events and it is
unclear whom to account connector to since it is shared by all events
attached to the inode.
The allocations from the event caches happen in the context of the event
producer. For such caches we will need to remote charge the allocations
to the listener's memcg. Thus we save the memcg reference in the
fsnotify_group structure of the listener.
This patch has also moved the members of fsnotify_group to keep the size
same, at least for 64 bit build, even with additional member by filling
the holes.
[shakeelb@google.com: use GFP_KERNEL_ACCOUNT rather than open-coding it]
Link: http://lkml.kernel.org/r/20180702215439.211597-1-shakeelb@google.com
Link: http://lkml.kernel.org/r/20180627191250.209150-2-shakeelb@google.com
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
a_ops->readpages() is only ever used for read-ahead. Ensure that we
pass this information down to the block layer.
Link: http://lkml.kernel.org/r/20180621010725.17813-5-axboe@kernel.dk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Chris Mason <clm@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
a_ops->readpages() is only ever used for read-ahead. Ensure that we
pass this information down to the block layer.
Link: http://lkml.kernel.org/r/20180621010725.17813-4-axboe@kernel.dk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Chris Mason <clm@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
a_ops->readpages() is only ever used for read-ahead, yet we don't flag
the IO being submitted as such. Fix that up. Any file system that uses
mpage_readpages() as its ->readpages() implementation will now get this
right.
Since we're passing in whether the IO is read-ahead or not, we don't
need to pass in the 'gfp' separately, as it is dependent on the IO being
read-ahead. Kill off that member.
Add some documentation notes on ->readpages() being purely for
read-ahead.
Link: http://lkml.kernel.org/r/20180621010725.17813-3-axboe@kernel.dk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Chris Mason <clm@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "Submit ->readpages() IO as read-ahead", v4.
The only caller of ->readpages() is from read-ahead, yet we don't submit
IO flagged with REQ_RAHEAD. This means we don't see it in blktrace, for
instance, which is a shame. Additionally, it's preventing further
functional changes in the block layer for deadling with read-ahead more
intelligently. We already make assumptions about ->readpages() just
being for read-ahead in the mpage implementation, using
readahead_gfp_mask(mapping) as out GFP mask of choice.
This small series fixes up mpage_readpages() to submit with REQ_RAHEAD,
which takes care of file systems using mpage_readpages(). The first
patch is a prep patch, that makes do_mpage_readpage() take an argument
structure.
This patch (of 4):
We're currently passing 8 arguments to this function, clean it up a bit
by packing the arguments in an args structure we pass to it.
No intentional functional changes in this patch.
[akpm@linux-foundation.org: coding-style fixes]
Link: http://lkml.kernel.org/r/20180621010725.17813-2-axboe@kernel.dk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Chris Mason <clm@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The documentation for seq_file suggests that it is necessary to be able
to move the iterator to a given offset, however that is not the case.
If the iterator is stored in the private data and is stable from one
read() syscall to the next, it is only necessary to support first/next
interactions. Implementing this in a client is a little clumsy.
- if ->start() is given a pos of zero, it should go to start of
sequence.
- if ->start() is given the name pos that was given to the most recent
next() or start(), it should restore the iterator to state just
before that last call
- if ->start is given another number, it should set the iterator one
beyond the start just before the last ->start or ->next call.
Also, the documentation says that the implementation can interpret the
pos however it likes (other than zero meaning start), but seq_file
increments the pos sometimes which does impose on the implementation.
This patch simplifies the interface for first/next iteration and
simplifies the code, while maintaining complete backward compatability.
Now:
- if ->start() is given a pos of zero, it should return an iterator
placed at the start of the sequence
- if ->start() is given a non-zero pos, it should return the iterator
in the same state it was after the last ->start or ->next.
This is particularly useful for interators which walk the multiple
chains in a hash table, e.g. using rhashtable_walk*. See
fs/gfs2/glock.c and drivers/staging/lustre/lustre/llite/vvp_dev.c
A large part of achieving this is to *always* call ->next after ->show
has successfully stored all of an entry in the buffer. Never just
increment the index instead. Also:
- always pass &m->index to ->start() and ->next(), never a temp
variable
- don't clear ->from when ->count is zero, as ->from is dead when
->count is zero.
Some ->next functions do not increment *pos when they return NULL. To
maintain compatability with this, we still need to increment m->index in
one place, if ->next didn't increment it. Note that such ->next
functions are buggy and should be fixed. A simple demonstration is
dd if=/proc/swaps bs=1000 skip=1
Choose any block size larger than the size of /proc/swaps. This will
always show the whole last line of /proc/swaps.
This patch doesn't work around buggy next() functions for this case.
[neilb@suse.com: ensure ->from is valid]
Link: http://lkml.kernel.org/r/87601ryb8a.fsf@notabene.neil.brown.name
Signed-off-by: NeilBrown <neilb@suse.com>
Acked-by: Jonathan Corbet <corbet@lwn.net> [docs]
Tested-by: Jann Horn <jannh@google.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This flag was introduce in 2.1.37pre1 and the only place it was tested
was removed in 2.1.43pre1. The flag was never set.
Let's discard it properly.
Link: http://lkml.kernel.org/r/877en0hewz.fsf@notabene.neil.brown.name
Signed-off-by: NeilBrown <neilb@suse.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since only dentry->d_name.len + 1 bytes out of DNAME_INLINE_LEN bytes
are initialized at __d_alloc(), we can't copy the whole size
unconditionally.
WARNING: kmemcheck: Caught 32-bit read from uninitialized memory (ffff8fa27465ac50)
636f6e66696766732e746d70000000000010000000000000020000000188ffff
i i i i i i i i i i i i i u u u u u u u u u u i i i i i u u u u
^
RIP: 0010:take_dentry_name_snapshot+0x28/0x50
RSP: 0018:ffffa83000f5bdf8 EFLAGS: 00010246
RAX: 0000000000000020 RBX: ffff8fa274b20550 RCX: 0000000000000002
RDX: ffffa83000f5be40 RSI: ffff8fa27465ac50 RDI: ffffa83000f5be60
RBP: ffffa83000f5bdf8 R08: ffffa83000f5be48 R09: 0000000000000001
R10: ffff8fa27465ac00 R11: ffff8fa27465acc0 R12: ffff8fa27465ac00
R13: ffff8fa27465acc0 R14: 0000000000000000 R15: 0000000000000000
FS: 00007f79737ac8c0(0000) GS:ffffffff8fc30000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffff8fa274c0b000 CR3: 0000000134aa7002 CR4: 00000000000606f0
take_dentry_name_snapshot+0x28/0x50
vfs_rename+0x128/0x870
SyS_rename+0x3b2/0x3d0
entry_SYSCALL_64_fastpath+0x1a/0xa4
0xffffffffffffffff
Link: http://lkml.kernel.org/r/201709131912.GBG39012.QMJLOVFSFFOOtH@I-love.SAKURA.ne.jp
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Vegard Nossum <vegard.nossum@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There are a variety of functions and variables that are local to the
source and do not need to be in global scope, so make them static. Also
make a couple of char arrays static const.
Cleans up sparse warnings:
symbol 'o2hb_heartbeat_mode_desc' was not declared. Should it be static?
symbol 'o2hb_heartbeat_mode' was not declared. Should it be static?
symbol 'o2hb_dependent_users' was not declared. Should it be static?
symbol 'o2hb_region_dec_user' was not declared. Should it be static?
symbol 'o2nm_fence_method_desc' was not declared. Should it be static?
symbol 'lockdep_keys' was not declared. Should it be static?
Link: http://lkml.kernel.org/r/20180628131659.12133-1-colin.king@canonical.com
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Cc: Mark Fasheh <mfasheh@versity.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <joseph.qi@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Several functions have some unnecessary code, clean up these code.
Link: http://lkml.kernel.org/r/5B14DF72.5020800@huawei.com
Signed-off-by: Yan Wang <wangyan122@huawei.com>
Reviewed-by: Jun Piao <piaojun@huawei.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Cc: Changwei Ge <ge.changwei@h3c.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We should return -EROFS rather than other errno if filesystem becomes
read-only.
[akpm@linux-foundation.org: coding-style fixes]
Link: http://lkml.kernel.org/r/5B191B26.9010501@huawei.com
Signed-off-by: Jun Piao <piaojun@huawei.com>
Reviewed-by: Yiwen Jiang <jiangyiwen@huawei.com>
Acked-by: Joseph Qi <jiangqi903@gmail.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <ge.changwei@h3c.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In the quest to remove all stack VLA usage from the kernel[1], this
allocates the maximum size stack buffer. Existing checks already
require that blocksize >= NTFS_BLOCK_SIZE and mft_record_size <=
PAGE_SIZE, so max_bhs can be at most PAGE_SIZE / NTFS_BLOCK_SIZE.
Sanity checks are added for robustness.
[1] https://lkml.kernel.org/r/CA+55aFzCG-zNmZwX4A2FQpadafLfEzK6CC=qPXydAacU1RqZWA@mail.gmail.com
Link: http://lkml.kernel.org/r/20180626172909.41453-4-keescook@chromium.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Anton Altaparmakov <anton@tuxera.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In the quest to remove all stack VLA usage from the kernel[1], this
moves the stack buffer used during decompression to be allocated
externally.
The existing "dest_max_index" used in the VLA is bounded by cb_max_page.
cb_max_page is bounded by max_page, and max_page is bounded by nr_pages.
Since nr_pages is used for the "pages" allocation, it can similarly be
used for the "completed_pages" allocation and passed into the
decompression function. The error paths are updated to free the new
allocation.
[1] https://lkml.kernel.org/r/CA+55aFzCG-zNmZwX4A2FQpadafLfEzK6CC=qPXydAacU1RqZWA@mail.gmail.com
Link: http://lkml.kernel.org/r/20180626172909.41453-3-keescook@chromium.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Anton Altaparmakov <anton@tuxera.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
ntfs_end_buffer_async_read() disables interrupts around kmap_atomic().
This is a leftover from the old kmap_atomic() implementation which
relied on fixed mapping slots, so the caller had to make sure that the
same slot could not be reused from an interrupting context.
kmap_atomic() was changed to dynamic slots long ago and commit
1ec9c5ddc1 ("include/linux/highmem.h: remove the second argument of
k[un]map_atomic()") removed the slot assignements, but the callers were
not checked for now redundant interrupt disabling.
Remove the conditional interrupt disable.
Link: http://lkml.kernel.org/r/20180611144913.gln5mklhqcrfsoom@linutronix.de
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Anton Altaparmakov <anton@tuxera.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The VFS timestamps are all 64-bit now, the only missing piece for hpfs
is the internal conversion function. One interesting bit about hpfs is
that it can already deal with moving the 136 year window of its
timestamps to support a much wider range than other file systems with
32-bit timestamps. It also treats the timestamps as 'unsigned' on
64-bit architectures (but signed on 32-bit, because time_t always around
to negative numbers in 2038).
Changing the conversion to use time64_t makes 32-bit architectures
behave the same way as 64-bit. For completeness, this also adds a
clamp_t call for each conversion, so we don't wrap the timestamps but
instead stay within the [0..U32_MAX] range of the on-disk timestamps.
Link: http://lkml.kernel.org/r/20180718115017.742609-3-arnd@arndb.de
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now that the VFS has been converted from timespec to timespec64
timestamps, only the conversion to/from ntfs timestamps uses 32-bit
seconds.
This changes that last missing piece to get the ntfs implementation
y2038 safe on 32-bit architectures.
Link: http://lkml.kernel.org/r/20180718115017.742609-2-arnd@arndb.de
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Anton Altaparmakov <anton@tuxera.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
get_seconds() is deprecated because of the 32-bit overflow and will be
removed. All callers in ufs also truncate to a 32-bit number, so
nothing changes during the conversion, but this should be harmless as
the superblock and cylinder group timestamps are not visible to user
space, except for checking the fs-dirty state, wich works fine across
the overflow.
This moves the call to get_seconds() into a new inline function, with a
comment explaining the constraints, while converting it to
ktime_get_real_seconds().
Link: http://lkml.kernel.org/r/20180718115017.742609-1-arnd@arndb.de
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch is reworked from an earlier patch that Dan has posted:
https://patchwork.kernel.org/patch/10131727/
VM_MIXEDMAP is used by dax to direct mm paths like vm_normal_page() that
the memory page it is dealing with is not typical memory from the linear
map. The get_user_pages_fast() path, since it does not resolve the vma,
is already using {pte,pmd}_devmap() as a stand-in for VM_MIXEDMAP, so we
use that as a VM_MIXEDMAP replacement in some locations. In the cases
where there is no pte to consult we fallback to using vma_is_dax() to
detect the VM_MIXEDMAP special case.
Now that we have explicit driver pfn_t-flag opt-in/opt-out for
get_user_pages() support for DAX we can stop setting VM_MIXEDMAP. This
also means we no longer need to worry about safely manipulating vm_flags
in a future where we support dynamically changing the dax mode of a
file.
DAX should also now be supported with madvise_behavior(), vma_merge(),
and copy_page_range().
This patch has been tested against ndctl unit test. It has also been
tested against xfstests commit: 625515d using fake pmem created by
memmap and no additional issues have been observed.
Link: http://lkml.kernel.org/r/152847720311.55924.16999195879201817653.stgit@djiang5-desk3.ch.intel.com
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Acked-by: Dan Williams <dan.j.williams@intel.com>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
reworking parts of the core code.
- Rework the so-called 'gid cache' and internal APIs to use a kref'd
pointer to a struct instead of copying, push this upwards into the
callers and add more stuff to the struct. The new design avoids some
ugly races the old one suffered with. This is part of the namespace
enablement work as the new struct is learning to be namespace aware.
- Various uapi cleanups, moving more stuff to include/uapi and fixing some
long standing bugs that have recently been discovered.
- Driver updates for mlx5, mlx4 i40iw, rxe, cxgb4, hfi1, usnic, pvrdma,
and hns
- Provide max_send_sge and max_recv_sge attributes to better support HW
where these values are asymmetric.
- mlx5 user API 'devx' allows sending commands directly to the device FW,
instead of trying to cram every wild and niche feature into the common
API. Sort of like what GPU does.
- Major write() and ioctl() API rework to cleanly support PCI device hot
unplug and advance the ioctl conversion work
- Sparse and compile warning cleanups
- Add 'const' to the ib_poll_cq() signature, and permit a NULL 'bad_wr',
which is the common use case
- Various patches to avoid high order allocations across the stack
- SRQ support for cxgb4, hns and qedr
- Changes to IPoIB to better follow the netdev model for working with
struct net_device liftime
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEfB7FMLh+8QxL+6i3OG33FX4gmxoFAlt17oMACgkQOG33FX4g
mxpRsQ//YZY1Gci1IoYLMuq0Rn9+/4lRHaBev+B728z1dvEFBW8m/i2DV5dPnSxO
AUN9dZOKBYYhc08h8vphtnBdMEtYJz6Dl76F8W+mt5vSuM5D4+0ba415RYSnV1Dc
d6Js33OTMVbQVHmYCIAXh9FNDX8lkywT346aXlMOpW3z74xoaLkkQ0cnfB0SEX0y
q9jiu70s6eisLlu9zJsXmCCLQ1b8eUD6IZm7hX8wMheuhDWyfrOv8JBeBCQdICuI
MASc2T7X8E++dvIePAL7Hgx/0SH/2Mit8zaJ0Sbt2OjBDcImLSs8bcple5gPoCPk
3vnCdb2GKg8xlxe3n1S89sGC1b8MY2CtQFElSs9C6npIGCwr2XlrZDDa0tE45+8I
miVhoswakmKW61KTCkVf2d9RXWcIh1qwUIpan1aZMsWdNnA6FYXIF054mMmJO44+
HUi2C93zAhx3XhFuX6O2YAHkG6CSXcZPfO7U9zy++GwAoXtGU0g6OLZbaYdEfuQh
lN8LLqxe3M5sMdDnHYc38AsLW9MmxyJXt+h2yLxtsdZ9jitypBDQxSVfAI68RNwL
BB1qELflF9FtAousQU9qhdNHimsgwctJ9MoZ6I1Aa1+ovwcSQgmKoQlNJIHkFroB
wUz2sz6q25OdLWDpFrGipmG7Kfnosg7xuBSYZUQMBzLmjg0HTVY=
=F50c
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
Pull rdma updates from Jason Gunthorpe:
"This has been a large cycle for RDMA, with several major patch series
reworking parts of the core code.
- Rework the so-called 'gid cache' and internal APIs to use a kref'd
pointer to a struct instead of copying, push this upwards into the
callers and add more stuff to the struct. The new design avoids
some ugly races the old one suffered with. This is part of the
namespace enablement work as the new struct is learning to be
namespace aware.
- Various uapi cleanups, moving more stuff to include/uapi and fixing
some long standing bugs that have recently been discovered.
- Driver updates for mlx5, mlx4 i40iw, rxe, cxgb4, hfi1, usnic,
pvrdma, and hns
- Provide max_send_sge and max_recv_sge attributes to better support
HW where these values are asymmetric.
- mlx5 user API 'devx' allows sending commands directly to the device
FW, instead of trying to cram every wild and niche feature into the
common API. Sort of like what GPU does.
- Major write() and ioctl() API rework to cleanly support PCI device
hot unplug and advance the ioctl conversion work
- Sparse and compile warning cleanups
- Add 'const' to the ib_poll_cq() signature, and permit a NULL
'bad_wr', which is the common use case
- Various patches to avoid high order allocations across the stack
- SRQ support for cxgb4, hns and qedr
- Changes to IPoIB to better follow the netdev model for working with
struct net_device liftime"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (312 commits)
Revert "net/smc: Replace ib_query_gid with rdma_get_gid_attr"
RDMA/hns: Fix usage of bitmap allocation functions return values
IB/core: Change filter function return type from int to bool
IB/core: Update GID entries for netdevice whose mac address changes
IB/core: Add default GIDs of the bond master netdev
IB/core: Consider adding default GIDs of bond device
IB/core: Delete lower netdevice default GID entries in bonding scenario
IB/core: Avoid confusing del_netdev_default_ips
IB/core: Add comment for change upper netevent handling
qedr: Add user space support for SRQ
qedr: Add support for kernel mode SRQ's
qedr: Add wrapping generic structure for qpidr and adjust idr routines.
IB/mlx5: Fix leaking stack memory to userspace
Update the e-mail address of Bart Van Assche
IB/ucm: Fix compiling ucm.c
IB/uverbs: Do not check for device disassociation during ioctl
IB/uverbs: Remove struct uverbs_root_spec and all supporting code
IB/uverbs: Use uverbs_api to unmarshal ioctl commands
IB/uverbs: Use uverbs_alloc for allocations
IB/uverbs: Add a simple allocator to uverbs_attr_bundle
...
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAlt2mBQACgkQnJ2qBz9k
QNntGQgAluTTnuJLjoUDjFfT37Fjf2x1ve8rg6xmYS3YIhYTWWA1oazUIeyBDfwa
soutlfAZ/ix2bP1UEmeULxFhrCIXYBbWAe8s5MRqO/7s01QftNf0M72ASmd7gZRy
rSVt2/BWpr745mWI38tEKlIF4sQJVD7IGrnc1cQslPzleeCqsCXA+uBkBPMlcDpJ
ZWni2qK023y9E2dsg6RsJc1HemkQvrJtoLSVqRsdhty9GEuWseMbssdgz1zMXljQ
eXIALE5BssoxISIpH6qVKZRlr7UWGxOmV4CDPmku7DFLOSiwMk/Ml0V80BwzjNNY
hY8qfxcJOFOGZ8t82pWkVGMjgOAKjA==
=IN6Y
-----END PGP SIGNATURE-----
Merge tag 'fsnotify_for_v4.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull fsnotify updates from Jan Kara:
"fsnotify cleanups from Amir and a small inotify improvement"
* tag 'fsnotify_for_v4.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
inotify: Add flag IN_MASK_CREATE for inotify_add_watch()
fanotify: factor out helpers to add/remove mark
fsnotify: add helper to get mask from connector
fsnotify: let connector point to an abstract object
fsnotify: pass connp and object type to fsnotify_add_mark()
fsnotify: use typedef fsnotify_connp_t for brevity
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAlt2l2MACgkQnJ2qBz9k
QNlZMAgAwVu/bMsRR6PbXJIAYEUNLehrmgUfSdYxIFqnZPq84ZfpOMQZKDYJIO5d
WiLz9Z9pti/ldrQ33yllbJrsalAn8R+LB911eaKUvLscXyrIsoBxsBbOOtVZc9lZ
jaQBUMLStdPvE6LgW93f1EwIg/Z8CSTzaeCO31wlZl7s7wsBhjg3MJ3f9sR6LG0G
OKQZnjDxGbtsbeVl8cnOeeF3sd0kqYTT5EwSh+zkMIbHJQ0dbvEjj24TM9rHdzG2
AN35+rzFZeMHRGnfWsQ/I6il1nTuWIyPRpoc57cwV/dcYwpg1Pi6MZzrFcDsWfwx
rHgRJIkmSqi1S6Ic8o6s9fYsn6266A==
=ljWe
-----END PGP SIGNATURE-----
Merge tag 'for_v4.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull UDF and ext2 update from Jan Kara.
* tag 'for_v4.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
ext2: use ktime_get_real_seconds for timestamps
udf: convert inode stamps to timespec64
Commit e9894fd3e3 ("Btrfs: fix snapshot vs nocow writting") forced
nocow writes to fallback to COW, during writeback, when a snapshot is
created. This resulted in writes made before creating the snapshot to
unexpectedly fail with ENOSPC during writeback when success (0) was
returned to user space through the write system call.
The steps leading to this problem are:
1. When it's not possible to allocate data space for a write, the
buffered write path checks if a NOCOW write is possible. If it is,
it will not reserve space and success (0) is returned to user space.
2. Then when a snapshot is created, the root's will_be_snapshotted
atomic is incremented and writeback is triggered for all inode's that
belong to the root being snapshotted. Incrementing that atomic forces
all previous writes to fallback to COW during writeback (running
delalloc).
3. This results in the writeback for the inodes to fail and therefore
setting the ENOSPC error in their mappings, so that a subsequent
fsync on them will report the error to user space. So it's not a
completely silent data loss (since fsync will report ENOSPC) but it's
a very unexpected and undesirable behaviour, because if a clean
shutdown/unmount of the filesystem happens without previous calls to
fsync, it is expected to have the data present in the files after
mounting the filesystem again.
So fix this by adding a new atomic named snapshot_force_cow to the
root structure which prevents this behaviour and works the following way:
1. It is incremented when we start to create a snapshot after triggering
writeback and before waiting for writeback to finish.
2. This new atomic is now what is used by writeback (running delalloc)
to decide whether we need to fallback to COW or not. Because we
incremented this new atomic after triggering writeback in the
snapshot creation ioctl, we ensure that all buffered writes that
happened before snapshot creation will succeed and not fallback to
COW (which would make them fail with ENOSPC).
3. The existing atomic, will_be_snapshotted, is kept because it is used
to force new buffered writes, that start after we started
snapshotting, to reserve data space even when NOCOW is possible.
This makes these writes fail early with ENOSPC when there's no
available space to allocate, preventing the unexpected behaviour of
writeback later failing with ENOSPC due to a fallback to COW mode.
Fixes: e9894fd3e3 ("Btrfs: fix snapshot vs nocow writting")
Signed-off-by: Robbie Ko <robbieko@synology.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
rdma.git merge resolution for the 4.19 merge window
Conflicts:
drivers/infiniband/core/rdma_core.c
- Use the rdma code and revise with the new spelling for
atomic_fetch_add_unless
drivers/nvme/host/rdma.c
- Replace max_sge with max_send_sge in new blk code
drivers/nvme/target/rdma.c
- Use the blk code and revise to use NULL for ib_post_recv when
appropriate
- Replace max_sge with max_recv_sge in new blk code
net/rds/ib_send.c
- Use the net code and revise to use NULL for ib_post_recv when
appropriate
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
-----BEGIN PGP SIGNATURE-----
iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAltwm2geHHRvcnZhbGRz
QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGITkH/iSzkVhT2OxHoir0
mLVzTi7/Z17L0e/ELl7TvAC0iLFlWZKdlGR0g3b4/QpXLPmNK4HxiDRTQuWn8ke0
qDZyDq89HqLt+mpeFZ43PCd9oqV8CH2xxK3iCWReqv6bNnowGnRpSStlks4rDqWn
zURC/5sUh7TzEG4s997RrrpnyPeQWUlf/Mhtzg2/WvK2btoLWgu5qzjX1uFh3s7u
vaF2NXVJ3X03gPktyxZzwtO1SwLFS1jhwUXWBZ5AnoJ99ywkghQnkqS/2YpekNTm
wFk80/78sU+d91aAqO8kkhHj8VRrd+9SGnZ4mB2aZHwjZjGcics4RRtxukSfOQ+6
L47IdXo=
=sJkt
-----END PGP SIGNATURE-----
Merge tag 'v4.18' into rdma.git for-next
Resolve merge conflicts from the -rc cycle against the rdma.git tree:
Conflicts:
drivers/infiniband/core/uverbs_cmd.c
- New ifs added to ib_uverbs_ex_create_flow in -rc and for-next
- Merge removal of file->ucontext in for-next with new code in -rc
drivers/infiniband/core/uverbs_main.c
- for-next removed code from ib_uverbs_write() that was modified
in for-rc
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Yes, it is possible to get trapped in a loop, but the server should be
administratively revoking the recalled layout if it never gets returned.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
When we update the layout stateid in nfs4_layoutreturn_refresh_stateid, we
should also update the range in order to let the server know we're actually
returning everything.
Fixes: 16c278dbfa63 ("pnfs: Fix handling of NFS4ERR_OLD_STATEID replies...")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEIodevzQLVs53l6BhNqiEXrVAjGQFAlt0SSYACgkQNqiEXrVA
jGTwvA//aTnRN3lPQK0iEeiGw6lzhQX/cO7eXlTTGnHMMxy62NZTSUlZhTaoTm63
bg0MHknX13oikraMMpSEwsf4qoPcIgalTXmFd7RWmOHKw7GBd6LJznsRQQ4i3G8g
0a1KtXLyRXT68UgJ6U0BmukWBjNC1qG9ToWbBG8SXMhrxuFbpg4uPRtMl6eRI9fV
U0CH//x94TYXSB2D4N1eVzvRrIs4l0iJA1RxfqmYZSBQe7b7LW3GLafsIm0axQO3
hy7XEUtjWzhuGILRVTJ+9hmyTG41YARWYrG0Rdd4h0sB5nK/jl8YRZtofGs7zuBK
RqzJHUSNGPXza54O0bBKQk6IwTJTsjhWg2f3AXFComEP4hvTA50i/CAa/XBZYKam
Fq99C70txMA4Ufwrmh4dN+20qtMYFwuvpsbNMiyuQjCDUxXwvey4RNLc1o6J3tWH
1qVNNk/k+kJY704CqA+h7Ay0A1ocaa64glwPIcBDgP5Us72LE/QjDRE4NfQVg4wq
WbVO+Rml8kB+uU2ma2U2y4XXgZIFv7JWxmQ6fxWfMe2kH0+Z6Ech2D66t/oBW6w7
Q0kp2+YYaSpvIbKnQYzBQjW+W3kPIPAYLV4HptM89p0ZLi4DlKLi4EywbeGPsA/R
gynd4Uxi95TA+2bAnKSNuoNse/5mq+R2F4+RuVZMUHt2DWEz2iA=
=VxRQ
-----END PGP SIGNATURE-----
Merge tag 'jfs-4.19' of git://github.com/kleikamp/linux-shaggy
Pull jfs update from David Kleikamp:
"Just one jfs patch for 4.19"
* tag 'jfs-4.19' of git://github.com/kleikamp/linux-shaggy:
jfs: use time64_t for otime
Changes on top of v4.18-rc1 / iomap-4.19-merge-1:
1. Iomap support for buffered writes and for direct I/O.
2. Two patches that reduce the size of struct gfs2_inode.
3. Lots of fixes and cleanups.
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJbcakSAAoJENW/n+sDE2U6AR0QAJlai+92ERML2pM+1hiuEHWP
KizBoV/53pc+dll1dlcEOHQFys2vbcFavcCtcsTXNhLSp1wOqxyzcFQTX6ekWfSZ
hTvJvAKTbeXu0zOWSV2DcX40JWb7SKDAxjxNb8XhL0COilgM9r+mdqoY/UNyVSel
SVmWvs8UYt6UBnw4G8h5UlzSYxl/M64udU1pVO5D8JMQ5cxDKj3kfFoJLLKBDwLF
vaNFxiihdTzmMwMNo3Px7GFSsb5Jnyo9LgAoDKsYd9YlzqGpAvvoYXH8itj4TuSb
sM1KTUZK+97XvquZfuv5BniEifP7XZSq4xYIxyr9HMaOefeys0GdzaCSCb3ifFte
7bqjowlAbHWwBNa9ofuJ1NShsAiOv0GUGDzlY+T/0IgSlqRr0JxAikJ3jLIZQ1Hf
CwWY66XakeSi5euDTi41SuGZMcxTXaX15VbXl6/SGsv4X0dyVXleBz6RuC9Q+n2H
7nqlGppRW2NB1WUqkJ15n9JaNLAF5I6umERTBXKGODM56p/GmZYoScCEGrqj9obN
CntPtL6yrazASjV3+zqXA//OvTb3xfykYu17wVLKhXWD0YWQiuDfA481BLvEut2G
aTtNU3b4VDwv5NuBf7G3wvN0+v3WyJ3gRfhTaEdFnX5PpcH3eHz5/fzU7zCJJwDu
g53icn3efqNu7WvpAkB+
=oGs4
-----END PGP SIGNATURE-----
Merge tag 'gfs2-4.19.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2
Pull gfs2 updates from Andreas Gruenbacher:
- iomap support for buffered writes and for direct I/O
- two patches that reduce the size of struct gfs2_inode
- lots of fixes and cleanups
* tag 'gfs2-4.19.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2: (25 commits)
gfs2: eliminate update_rgrp_lvb_unlinked
gfs2: Fix gfs2_testbit to use clone bitmaps
gfs2: Get rid of gfs2_ea_strlen
gfs2: cleanup: call gfs2_rgrp_ondisk2lvb from gfs2_rgrp_out
gfs2: Special-case rindex for gfs2_grow
GFS2: rgrp free blocks used incorrectly
gfs2: remove redundant variable 'moved'
gfs2: use iomap_readpage for blocksize == PAGE_SIZE
gfs2: Use iomap for stuffed direct I/O reads
gfs2: fallocate_chunk: Always initialize struct iomap
GFS2: Fix recovery issues for spectators
fs: gfs2: Adding new return type vm_fault_t
gfs2: using posix_acl_xattr_size instead of posix_acl_to_xattr
gfs2: Don't reject a supposedly full bitmap if we have blocks reserved
gfs2: Eliminate redundant ip->i_rgd
gfs2: Stop messing with ip->i_rgd in the rlist code
gfs2: Remove gfs2_write_{begin,end}
gfs2: iomap direct I/O support
gfs2: gfs2_extent_length cleanup
gfs2: iomap buffered write support
...
This is mostly updates to the usual drivers: mpt3sas, lpfc, qla2xxx,
hisi_sas, smartpqi, megaraid_sas, arcmsr. In addition, with the
continuing absence of Nic we have target updates for tcmu and target
core (all with reviews and acks). The biggest observable change is
going to be that we're (again) trying to switch to mulitqueue as the
default (a user can still override the setting on the kernel command
line). Other major core stuff is the removal of the remaining
Microchannel drivers, an update of the internal timers and some
reworks of completion and result handling.
Signed-off-by: James E.J. Bottomley <jejb@linux.vnet.ibm.com>
-----BEGIN PGP SIGNATURE-----
iJwEABMIAEQWIQTnYEDbdso9F2cI+arnQslM7pishQUCW3R3niYcamFtZXMuYm90
dG9tbGV5QGhhbnNlbnBhcnRuZXJzaGlwLmNvbQAKCRDnQslM7pishauRAP4yfBKK
dbxF81c/Bxi/Stk16FWkOOrjs4CizwmnMcpM5wD/UmM9o6ebDzaYpZgA8wIl7X/N
o/JckEZZpIp+5NySZNc=
=ggLB
-----END PGP SIGNATURE-----
Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI updates from James Bottomley:
"This is mostly updates to the usual drivers: mpt3sas, lpfc, qla2xxx,
hisi_sas, smartpqi, megaraid_sas, arcmsr.
In addition, with the continuing absence of Nic we have target updates
for tcmu and target core (all with reviews and acks).
The biggest observable change is going to be that we're (again) trying
to switch to mulitqueue as the default (a user can still override the
setting on the kernel command line).
Other major core stuff is the removal of the remaining Microchannel
drivers, an update of the internal timers and some reworks of
completion and result handling"
* tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (203 commits)
scsi: core: use blk_mq_run_hw_queues in scsi_kick_queue
scsi: ufs: remove unnecessary query(DM) UPIU trace
scsi: qla2xxx: Fix issue reported by static checker for qla2x00_els_dcmd2_sp_done()
scsi: aacraid: Spelling fix in comment
scsi: mpt3sas: Fix calltrace observed while running IO & reset
scsi: aic94xx: fix an error code in aic94xx_init()
scsi: st: remove redundant pointer STbuffer
scsi: qla2xxx: Update driver version to 10.00.00.08-k
scsi: qla2xxx: Migrate NVME N2N handling into state machine
scsi: qla2xxx: Save frame payload size from ICB
scsi: qla2xxx: Fix stalled relogin
scsi: qla2xxx: Fix race between switch cmd completion and timeout
scsi: qla2xxx: Fix Management Server NPort handle reservation logic
scsi: qla2xxx: Flush mailbox commands on chip reset
scsi: qla2xxx: Fix unintended Logout
scsi: qla2xxx: Fix session state stuck in Get Port DB
scsi: qla2xxx: Fix redundant fc_rport registration
scsi: qla2xxx: Silent erroneous message
scsi: qla2xxx: Prevent sysfs access when chip is down
scsi: qla2xxx: Add longer window for chip reset
...
Recently syzbot reported crashes in send_sigio_to_task and
send_sigurg_to_task in linux-next. Despite finding a reproducer
syzbot apparently did not bisected this or otherwise track down the
offending commit in linux-next.
I happened to see this report and examined the code because I had
recently changed these functions as part of making PIDTYPE_TGID a real
pid type so that fork would does not need to restart when receiving a
signal. By examination I see that I spotted a bug in the code
that could explain the reported crashes.
When I took Oleg's suggestion and optimized send_sigurg and send_sigio
to only send to a single task when type is PIDTYPE_PID or PIDTYPE_TGID
I failed to handle pids that no longer point to tasks. The macro
do_each_pid_task simply iterates for zero iterations. With pid_task
an explicit NULL test is needed.
Update the code to include the missing NULL test.
Fixes: 019191342f ("signal: Use PIDTYPE_TGID to clearly store where file signals will be sent")
Reported-by: syzkaller-bugs@googlegroups.com
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJbdGisAAoJEAhfPr2O5OEVmXcQAIe+wWHfAFwFDCqRGvG7yHIV
q3Rf6bqYE/IK++sSX9qrCmPC1jXsw6oXjaYqjtdplLb6GBBkNdKo6neGtARV1phB
Moj0JVmFwvnq6yMBZl4TFIZIZHIlfXIe/Ac6b9h5aiL1bGUdnukLhSnGVnoWteVv
F6BRAhH/PDIf7xUpxchEic7HbwnWDWI51ALeDWZQCv5SkNgBUBrLDumWr7b2YCWF
BcNfC14fNN1Jk4Hi9NpS7VCC4Nxvxn+D3iFsVa10Ur/A8EPApIlj+8nu/CDA8p4c
cJa8ImvaKhK0PAfciKMi5RBs/5r5BFRlrmLADkyVz8pmz5AQqjHot/TGoC22LqaN
h0T9lTKEmFAE8CYFQhcOfb6qcR6bfnM6yyVjeCrnAQ3Fw9TgESiPbPE7BpgnxoTU
i3h7a5WHYs4xjkMByJUYda1GhahPD82eJJp8l4iIQ2IMlTZbrEmLFmg13WNO3950
a+dy3HqpXUd2EYwFn9CVFzjTzWKE+63obQWIxLFWJWRirqr5XKbTmB7nPLeik6Os
kPmn8t0eKig76SYiJ2rhKgTMUB1rXjtBYtqP/f02FYN0WSZCEIs7oKMb/SoVfTlP
2rF6k2ZPTzB2ZJp3RzNRX/+/39f04i26QzeEwwQ6/8daxSna3YlKuLMvHzJ5UuFp
q28aN8hD5eXU1IEWHtvH
=ClJZ
-----END PGP SIGNATURE-----
Merge tag 'media/v4.19-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media
Pull media updates from Mauro Carvalho Chehab:
- new Socionext MN88443x ISDB-S/T demodulator driver: mn88443x
- new sensor drivers: ak7375, ov2680 and rj54n1cb0c
- an old soc-camera sensor driver converted to the V4L2 framework:
mt9v111
- a new Voice-Coil Motor (VCM) driver: dw9807-vcm
- some cleanups at cx25821, removing legacy unused code
- some improvements at ddbridge driver
- new platform driver: vicodec
- some DVB API cleanups, removing ioctls and compat code for old
out-of-tree drivers that were never merged upstream
- improvements at DVB core to support frontents that support both
Satellite and non-satellite delivery systems
- got rid of the unused VIDIOC_RESERVED V4L2 ioctl
- some cleanups/improvements at gl861 ISDB driver
- several improvements on ov772x, ov7670 and ov5640, imx274, ov5645,
and smiapp sensor drivers
- fixes at em28xx to support dual TS devices
- some cleanups at V4L2/VB2 locking logic
- some API improvements at media controller
- some cec core and drivers improvements
- some uvcvideo improvements
- some improvements at platform drivers: stm32-dcmi, rcar-vin, coda,
reneseas-ceu, imx, vsp1, venus, camss
- lots of other cleanups and fixes
* tag 'media/v4.19-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media: (406 commits)
Revert "media: vivid: shut up warnings due to a non-trivial logic"
siano: get rid of an unused return code for debugfs register
media: isp: fix a warning about a wrong struct initializer
media: radio-wl1273: fix return code for the polling routine
media: s3c-camif: fix return code for the polling routine
media: saa7164: fix return codes for the polling routine
media: exynos-gsc: fix return code if mutex was interrupted
media: mt9v111: Fix build error with no VIDEO_V4L2_SUBDEV_API
media: xc4000: get rid of uneeded casts
media: drxj: get rid of uneeded casts
media: tuner-xc2028: don't use casts for printing sizes
media: cleanup fall-through comments
media: vivid: shut up warnings due to a non-trivial logic
media: rtl28xxu: be sure that it won't go past the array size
media: mt9v111: avoid going past the buffer
media: vsp1_dl: add a description for cmdpool field
media: sta2x11: add a missing parameter description
media: v4l2-mem2mem: add descriptions to MC fields
media: i2c: fix warning in Aptina MT9V111
media: imx: shut up a false positive warning
...
Pull networking updates from David Miller:
"Highlights:
- Gustavo A. R. Silva keeps working on the implicit switch fallthru
changes.
- Support 802.11ax High-Efficiency wireless in cfg80211 et al, From
Luca Coelho.
- Re-enable ASPM in r8169, from Kai-Heng Feng.
- Add virtual XFRM interfaces, which avoids all of the limitations of
existing IPSEC tunnels. From Steffen Klassert.
- Convert GRO over to use a hash table, so that when we have many
flows active we don't traverse a long list during accumluation.
- Many new self tests for routing, TC, tunnels, etc. Too many
contributors to mention them all, but I'm really happy to keep
seeing this stuff.
- Hardware timestamping support for dpaa_eth/fsl-fman from Yangbo Lu.
- Lots of cleanups and fixes in L2TP code from Guillaume Nault.
- Add IPSEC offload support to netdevsim, from Shannon Nelson.
- Add support for slotting with non-uniform distribution to netem
packet scheduler, from Yousuk Seung.
- Add UDP GSO support to mlx5e, from Boris Pismenny.
- Support offloading of Team LAG in NFP, from John Hurley.
- Allow to configure TX queue selection based upon RX queue, from
Amritha Nambiar.
- Support ethtool ring size configuration in aquantia, from Anton
Mikaev.
- Support DSCP and flowlabel per-transport in SCTP, from Xin Long.
- Support list based batching and stack traversal of SKBs, this is
very exciting work. From Edward Cree.
- Busyloop optimizations in vhost_net, from Toshiaki Makita.
- Introduce the ETF qdisc, which allows time based transmissions. IGB
can offload this in hardware. From Vinicius Costa Gomes.
- Add parameter support to devlink, from Moshe Shemesh.
- Several multiplication and division optimizations for BPF JIT in
nfp driver, from Jiong Wang.
- Lots of prepatory work to make more of the packet scheduler layer
lockless, when possible, from Vlad Buslov.
- Add ACK filter and NAT awareness to sch_cake packet scheduler, from
Toke Høiland-Jørgensen.
- Support regions and region snapshots in devlink, from Alex Vesker.
- Allow to attach XDP programs to both HW and SW at the same time on
a given device, with initial support in nfp. From Jakub Kicinski.
- Add TLS RX offload and support in mlx5, from Ilya Lesokhin.
- Use PHYLIB in r8169 driver, from Heiner Kallweit.
- All sorts of changes to support Spectrum 2 in mlxsw driver, from
Ido Schimmel.
- PTP support in mv88e6xxx DSA driver, from Andrew Lunn.
- Make TCP_USER_TIMEOUT socket option more accurate, from Jon
Maxwell.
- Support for templates in packet scheduler classifier, from Jiri
Pirko.
- IPV6 support in RDS, from Ka-Cheong Poon.
- Native tproxy support in nf_tables, from Máté Eckl.
- Maintain IP fragment queue in an rbtree, but optimize properly for
in-order frags. From Peter Oskolkov.
- Improvde handling of ACKs on hole repairs, from Yuchung Cheng"
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1996 commits)
bpf: test: fix spelling mistake "REUSEEPORT" -> "REUSEPORT"
hv/netvsc: Fix NULL dereference at single queue mode fallback
net: filter: mark expected switch fall-through
xen-netfront: fix warn message as irq device name has '/'
cxgb4: Add new T5 PCI device ids 0x50af and 0x50b0
net: dsa: mv88e6xxx: missing unlock on error path
rds: fix building with IPV6=m
inet/connection_sock: prefer _THIS_IP_ to current_text_addr
net: dsa: mv88e6xxx: bitwise vs logical bug
net: sock_diag: Fix spectre v1 gadget in __sock_diag_cmd()
ieee802154: hwsim: using right kind of iteration
net: hns3: Add vlan filter setting by ethtool command -K
net: hns3: Set tx ring' tc info when netdev is up
net: hns3: Remove tx ring BD len register in hns3_enet
net: hns3: Fix desc num set to default when setting channel
net: hns3: Fix for phy link issue when using marvell phy driver
net: hns3: Fix for information of phydev lost problem when down/up
net: hns3: Fix for command format parsing error in hclge_is_all_function_id_zero
net: hns3: Add support for serdes loopback selftest
bnxt_en: take coredump_record structure off stack
...
Move the source statements of arch-independent Kconfig files instead of
duplicating the includes in every arch/$(SRCARCH)/Kconfig.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJbdFsfAAoJED2LAQed4NsGxHsP/1tmA57OOOj8oGxO2OXhXVbr
Q0MZqCoV4bqMvK/hgCQdl9f+tp0m+j12x4xDLdVf4OqnTXMbqvPDu3uQVKvaj/k1
gHhsFA1tFgSbuJ8InltUsrPEQqbceeJsj50xHVAKijqI6LYeRPPSU7aE9obn+OzH
n2nd5sLKvMI/dqdJvW6i5KPydqTH3r3iA7D+ne/XQj0s0EMXvXUPmDT1+ijTnM4a
yfm6W5p7L/c3Ugf1Pz5PfnPl4BxBwZMfW5ie/UO8j5C6Rl0iPaOGuuHurocaaJb3
MefR/7NEAR3G8MhJyL2+70jbbwhjpqR2b5ooz1vpuulPHxjeU45BY60XIBWq1afR
ewsc12MMCYB695ieYWoHdaWgxD/jhffyRuajfpkXKIZEMgDxS03sMhdULXENVMx1
M0ZQ01g/NLWt9ti9DY3eTKB3ymOhnBa1sa77nGGUHkITq4DQKwPX1J9FP/HT6RNt
uOvzeH5kGzc7tqOlZAO0kHbwhQG1uqGcd78IYd4lgf/XfkSgDERTWjnJmnQbwr9m
3PFuST2u8eyO+8Lh1MK76TXOEkXsHMdFugPmb6SlgtMEPKGVLDPlsj52o/LFtgzl
eygfMiBFr2+ttkZ6IpNcpmQ4IztmDpz6XoMk3PqDAfUTUSYpCnq1gAEuff/eisCM
Odva1ZZaeQ7WpxhsP8rr
=gsQJ
-----END PGP SIGNATURE-----
Merge tag 'kconfig-v4.19-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild
Pull Kconfig consolidation from Masahiro Yamada:
"Consolidation of Kconfig files by Christoph Hellwig.
Move the source statements of arch-independent Kconfig files instead
of duplicating the includes in every arch/$(SRCARCH)/Kconfig"
* tag 'kconfig-v4.19-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
kconfig: add a Memory Management options" menu
kconfig: move the "Executable file formats" menu to fs/Kconfig.binfmt
kconfig: use a menu in arch/Kconfig to reduce clutter
kconfig: include kernel/Kconfig.preempt from init/Kconfig
Kconfig: consolidate the "Kernel hacking" menu
kconfig: include common Kconfig files from top-level Kconfig
kconfig: remove duplicate SWAP symbol defintions
um: create a proper drivers Kconfig
um: cleanup Kconfig files
um: stop abusing KBUILD_KCONFIG
-----BEGIN PGP SIGNATURE-----
Comment: Kees Cook <kees@outflux.net>
iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAltx60wWHGtlZXNjb29r
QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJqmtD/43lXkoLGl31JsCVkVASg2K34B3
DI6uc4vwqJmGCNo9JrGGFNOqnV9n/rBg77Ymi6qDVRImbDm6t3FofaPLTVU1G9oK
3KYlv7x99j6DoFmP+U6sEvnRb5znqrmFR/t6Nx4yn7p68XY6hd7IFx8wZ6ZNt26i
eNUN67Hzg+eEuuDSf1XbVLIbeY9qqIyJvjqf6WUWXD2iAxcVPBUyxpNcsEGtmqti
9e76t/iaPq+T4V8AODthl/lfyD/xauD6g48CpIRTeDvEbkC/cBZgYS3y5frVv4+d
xve5j8aM2YYIaWzJGeE+3j+PXY6pdq/50iSPalo0SQg8ypaDpB7IRYmnBoR08dlX
GNQxNLkfqXGniJf5Khaxzta9l25yjr8nRC3mLkI42dM/FIhHWi3/wF6nJSMaKZXJ
FAZPC06VZR4Amb2pseNtRd09nYRDtqydmjsq53Y/1X0aaqgDiX1vEtothroLS4J1
Lg3VWNzKjowTA40xc0z/5H+8jxw2gUfjtJKIQvYvaNXmEEfC+BvHk9OC5ZHSizu5
PvBHNTGD2inyOpjWZishr6bgpeS4K/i/IDJy6UkFF1Q+z3udRbQTqWEUwwVjRnO0
IlluTw7PcEy5ctbcLT+ABjdyjtWGzp0lATx+s3YvvZ53MvM20rV6nj9CPwfk+PE1
MyqBBtSvtGvOsl9p9g==
=3N3c
-----END PGP SIGNATURE-----
Merge tag 'pstore-v4.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull pstore update from Kees Cook:
"This cycle has been very quiet for pstore: the only change is adding
awareness of the zstd compression method"
* tag 'pstore-v4.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
pstore: add zstd compression support
We must drop the lock before we can sleep in referring_call_exists().
Reported-by: Jia-Ju Bai <baijiaju1990@gmail.com>
Fixes: 045d2a6d07 ("NFSv4.1: Delay callback processing...")
Cc: stable@vger.kernel.org # v4.9+
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
The use of the inode->i_lock was converted to a mutex, but we forgot
to remove the old inode unlock/lock() pair that allowed the layout
segment to be put inside the loop.
Reported-by: Jia-Ju Bai <baijiaju1990@gmail.com>
Fixes: e824f99ada ("NFSv4: Use a mutex to protect the per-inode commit...")
Cc: stable@vger.kernel.org # v4.14+
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
A bunch of good stuff in here:
- Wire up support for qspinlock, replacing our trusty ticket lock code
- Add an IPI to flush_icache_range() to ensure that stale instructions
fetched into the pipeline are discarded along with the I-cache lines
- Support for the GCC "stackleak" plugin
- Support for restartable sequences, plus an arm64 port for the selftest
- Kexec/kdump support on systems booting with ACPI
- Rewrite of our syscall entry code in C, which allows us to zero the
GPRs on entry from userspace
- Support for chained PMU counters, allowing 64-bit event counters to be
constructed on current CPUs
- Ensure scheduler topology information is kept up-to-date with CPU
hotplug events
- Re-enable support for huge vmalloc/IO mappings now that the core code
has the correct hooks to use break-before-make sequences
- Miscellaneous, non-critical fixes and cleanups
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABCgAGBQJbbV41AAoJELescNyEwWM0WoEIALhrKtsIn6vqFlSs/w6aDuJL
cMWmFxjTaKLmIq2+cJIdFLOJ3CH80Pu9gB+nEv/k+cZdCTfUVKfRf28HTpmYWsht
bb4AhdHMC7yFW752BHk+mzJspeC8h/2Rm8wMuNVplZ3MkPrwo3vsiuJTofLhVL/y
BihlU3+5sfBvCYIsWnuEZIev+/I/s/qm1ASiqIcKSrFRZP6VTt5f9TC75vFI8seW
7yc3odKb0CArexB8yBjiPNziehctQF42doxQyL45hezLfWw4qdgHOSiwyiOMxEz9
Fwwpp8Tx33SKLNJgqoqYznGW9PhYJ7n2Kslv19uchJrEV+mds82vdDNaWRULld4=
=kQn6
-----END PGP SIGNATURE-----
Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 updates from Will Deacon:
"A bunch of good stuff in here. Worth noting is that we've pulled in
the x86/mm branch from -tip so that we can make use of the core
ioremap changes which allow us to put down huge mappings in the
vmalloc area without screwing up the TLB. Much of the positive
diffstat is because of the rseq selftest for arm64.
Summary:
- Wire up support for qspinlock, replacing our trusty ticket lock
code
- Add an IPI to flush_icache_range() to ensure that stale
instructions fetched into the pipeline are discarded along with the
I-cache lines
- Support for the GCC "stackleak" plugin
- Support for restartable sequences, plus an arm64 port for the
selftest
- Kexec/kdump support on systems booting with ACPI
- Rewrite of our syscall entry code in C, which allows us to zero the
GPRs on entry from userspace
- Support for chained PMU counters, allowing 64-bit event counters to
be constructed on current CPUs
- Ensure scheduler topology information is kept up-to-date with CPU
hotplug events
- Re-enable support for huge vmalloc/IO mappings now that the core
code has the correct hooks to use break-before-make sequences
- Miscellaneous, non-critical fixes and cleanups"
* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (90 commits)
arm64: alternative: Use true and false for boolean values
arm64: kexec: Add comment to explain use of __flush_icache_range()
arm64: sdei: Mark sdei stack helper functions as static
arm64, kaslr: export offset in VMCOREINFO ELF notes
arm64: perf: Add cap_user_time aarch64
efi/libstub: Only disable stackleak plugin for arm64
arm64: drop unused kernel_neon_begin_partial() macro
arm64: kexec: machine_kexec should call __flush_icache_range
arm64: svc: Ensure hardirq tracing is updated before return
arm64: mm: Export __sync_icache_dcache() for xen-privcmd
drivers/perf: arm-ccn: Use devm_ioremap_resource() to map memory
arm64: Add support for STACKLEAK gcc plugin
arm64: Add stack information to on_accessible_stack
drivers/perf: hisi: update the sccl_id/ccl_id when MT is supported
arm64: fix ACPI dependencies
rseq/selftests: Add support for arm64
arm64: acpi: fix alignment fault in accessing ACPI
efi/arm: map UEFI memory map even w/o runtime services enabled
efi/arm: preserve early mapping of UEFI memory map longer for BGRT
drivers: acpi: add dependency of EFI for arm64
...
Traditionally UBIFS just reported a failed assertion and moved on. The
drawback is that users will notice UBIFS bugs when it is too late, most
of the time when it is no longer about to mount. This makes bug hunting
problematic since valuable information from failing asserts is long gone
when UBIFS is dead. The other extreme, panic'ing on a failing assert is
also not worthwhile, we want users and developers give a chance to
collect as much debugging information as possible if UBIFS hits an
assert. Therefore go for the third option, switch to read-only mode when
an assert fails. That way UBIFS will not write possible bad data to the
MTD and gives users the chance to collect debugging information.
Signed-off-by: Richard Weinberger <richard@nod.at>
With having access to struct ubifs_info in ubifs_assert() we can
give more information when an assert is failing.
By using ubifs_err() we can tell which UBIFS instance failed.
Also multiple actions can be taken now.
We support:
- report: This is what UBIFS did so far, just report the failure and go
on.
- read-only: Switch to read-only mode.
- panic: shoot the kernel in the head.
Signed-off-by: Richard Weinberger <richard@nod.at>
This allows us to have more context in ubifs_assert()
and take different actions depending on the configuration.
Signed-off-by: Richard Weinberger <richard@nod.at>
We are going to pass struct ubifs_info to ubifs_assert()
but while unloading the UBIFS module we don't have the info
struct anymore.
Therefore replace the asserts by a regular WARN_ON().
Signed-off-by: Richard Weinberger <richard@nod.at>
Since commit 6da2ec5605 ("treewide: kmalloc() -> kmalloc_array()")
we use kmalloc_array() for kmalloc() that computes the length with
a multiplication.
Cc: Kees Cook <keescook@chromium.org>
Suggested-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Richard Weinberger <richard@nod.at>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Richard Weinberger <richard@nod.at>
Check whether the size is within bounds before using it.
If the size is not correct, abort and dump the bad data node.
Cc: Kees Cook <keescook@chromium.org>
Cc: Silvio Cesare <silvio.cesare@gmail.com>
Cc: stable@vger.kernel.org
Fixes: 1e51764a3c ("UBIFS: add new flash file system")
Reported-by: Silvio Cesare <silvio.cesare@gmail.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Richard Weinberger <richard@nod.at>
This reverts commit 353748a359.
It bypassed the linux-mtd review process and fixes the issue not as it
should.
Cc: Kees Cook <keescook@chromium.org>
Cc: Silvio Cesare <silvio.cesare@gmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: Richard Weinberger <richard@nod.at>
Every single time I come across that code, I get confused
because it looks like a possible dead lock.
Help myself by adding a comment.
Signed-off-by: Richard Weinberger <richard@nod.at>
Allow to disable extended attribute support.
This aids in reliability testing, especially since some xattr
related bugs have surfaced.
Also an embedded system might not need it, so this allows for a
slightly smaller kernel (about 4KiB).
Signed-off-by: Stefan Agner <stefan@agner.ch>
Signed-off-by: Richard Weinberger <richard@nod.at>
Make use of the swap macro and remove unnecessary variable *t*. This
makes the code easier to read and maintain.
This code was detected with the help of Coccinelle.
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
Without program headers for PTI entry trampoline pages, the trampoline
virtual addresses do not map to anything.
Example before:
sudo gdb --quiet vmlinux /proc/kcore
Reading symbols from vmlinux...done.
[New process 1]
Core was generated by `BOOT_IMAGE=/boot/vmlinuz-4.16.0 root=UUID=a6096b83-b763-4101-807e-f33daff63233'.
#0 0x0000000000000000 in irq_stack_union ()
(gdb) x /21ib 0xfffffe0000006000
0xfffffe0000006000: Cannot access memory at address 0xfffffe0000006000
(gdb) quit
After:
sudo gdb --quiet vmlinux /proc/kcore
[sudo] password for ahunter:
Reading symbols from vmlinux...done.
[New process 1]
Core was generated by `BOOT_IMAGE=/boot/vmlinuz-4.16.0-fix-4-00005-gd6e65a8b4072 root=UUID=a6096b83-b7'.
#0 0x0000000000000000 in irq_stack_union ()
(gdb) x /21ib 0xfffffe0000006000
0xfffffe0000006000: swapgs
0xfffffe0000006003: mov %rsp,-0x3e12(%rip) # 0xfffffe00000021f8
0xfffffe000000600a: xchg %ax,%ax
0xfffffe000000600c: mov %cr3,%rsp
0xfffffe000000600f: bts $0x3f,%rsp
0xfffffe0000006014: and $0xffffffffffffe7ff,%rsp
0xfffffe000000601b: mov %rsp,%cr3
0xfffffe000000601e: mov -0x3019(%rip),%rsp # 0xfffffe000000300c
0xfffffe0000006025: pushq $0x2b
0xfffffe0000006027: pushq -0x3e35(%rip) # 0xfffffe00000021f8
0xfffffe000000602d: push %r11
0xfffffe000000602f: pushq $0x33
0xfffffe0000006031: push %rcx
0xfffffe0000006032: push %rdi
0xfffffe0000006033: mov $0xffffffff91a00010,%rdi
0xfffffe000000603a: callq 0xfffffe0000006046
0xfffffe000000603f: pause
0xfffffe0000006041: lfence
0xfffffe0000006044: jmp 0xfffffe000000603f
0xfffffe0000006046: mov %rdi,(%rsp)
0xfffffe000000604a: retq
(gdb) quit
In addition, entry trampolines all map to the same page. Represent that
by giving the corresponding program headers in kcore the same offset.
This has the benefit that, when perf tools uses /proc/kcore as a source
for kernel object code, samples from different CPU trampolines are
aggregated together. Note, such aggregation is normal for profiling
i.e. people want to profile the object code, not every different virtual
address the object code might be mapped to (across different processes
for example).
Notes by PeterZ:
This also adds the KCORE_REMAP functionality.
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Acked-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: x86@kernel.org
Link: http://lkml.kernel.org/r/1528289651-4113-4-git-send-email-adrian.hunter@intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
The tnc uses get_seconds() based timestamps to check the age of a znode,
which has two problems: on 32-bit architectures this may overflow in
2038 or 2106, and it gives incorrect information when the system time
is updated using settimeofday().
Using montonic timestamps with ktime_get_seconds() solves both thes
problems.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Richard Weinberger <richard@nod.at>
Both vfs and the on-disk inode structures can deal with fine-grained
timestamps now, so this is the last missing piece to make ubifs
y2038-safe on 32-bit architectures.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Richard Weinberger <richard@nod.at>
xattr operations can race with unlink and the following assert triggers:
UBIFS assert failed in ubifs_jnl_change_xattr at 1606 (pid 6256)
Fix this by checking i_nlink before working on the host inode.
Cc: <stable@vger.kernel.org>
Fixes: 1e51764a3c ("UBIFS: add new flash file system")
Signed-off-by: Richard Weinberger <richard@nod.at>
Allocate the buffer after we return early.
Otherwise memory is being leaked.
Cc: <stable@vger.kernel.org>
Fixes: 1e51764a3c ("UBIFS: add new flash file system")
Signed-off-by: Richard Weinberger <richard@nod.at>
In ubifs_jnl_update() we sync parent and child inodes to the flash,
in case of xattrs, the parent inode (AKA host inode) has a non-zero
data_len. Therefore we need to adjust synced_i_size too.
This issue was reported by ubifs self tests unter a xattr related work
load.
UBIFS error (ubi0:0 pid 1896): dbg_check_synced_i_size: ui_size is 4, synced_i_size is 0, but inode is clean
UBIFS error (ubi0:0 pid 1896): dbg_check_synced_i_size: i_ino 65, i_mode 0x81a4, i_size 4
Cc: <stable@vger.kernel.org>
Fixes: 1e51764a3c ("UBIFS: add new flash file system")
Signed-off-by: Richard Weinberger <richard@nod.at>
We have to account the name of the symlink and not the target length.
Fixes: ca7f85be8d ("ubifs: Add support for encrypted symlinks")
Cc: <stable@vger.kernel.org>
Signed-off-by: Richard Weinberger <richard@nod.at>
- simplify the cide by using kvasprintf (Bart Van Assche).
- fix a gcc 8 string truncation warning by making the code simpler
(Guenter Roeck).
- fix a bug in rmdir() handling (Mike Christie).
-----BEGIN PGP SIGNATURE-----
iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAltxL44LHGhjaEBsc3Qu
ZGUACgkQD55TZVIEUYMjKxAAy9M8gy87QhrkOTVLher1A3DY48Qbvx5AEHM2jnZQ
MekqGIcgAGYLlS3ifu54JJBB6QoM+UjKkvkjJiFaJbg685yFepkNkUUfRTy2tVqF
2TcsxUrhZu5gz2gm7tBll321Jz34w++PWzuVoI82YefeobYGu+XxWf0JnJQYJVL3
Q2W73AZBLwZQ1s0F5UNAFa7s+pznIpJLTMUD3iFmzoXx0Qihy7VGxcKxn6+Bzg2E
UfojBH//K9cxk+dMZEiVaA+bXY39fd7bj+yDzcF8ThR+SPJwXjP9yLDDFqM7qsrK
BnhcPYNV3Qjk5WCwvn/Lx22hFHgB44URjyvxaP6Bntp7CVk4IPVJqmqjUaov0fsZ
zHNvdjk3cy6JCOTvcUECikM3qkfgpabNUbN/QSK+nKN6JT22wGrf0ojKpSYfdBb3
nhh8lV7YFkbmVFmigN9qAw2PH/l8C8XM0P/VezRUz9gba1jW8TcLhcvL1OKs4NFb
LSS07pHfdhipm5go1kxcCLDb93Gg0MwYEFCX3xA6YoYU7bjg7uLUa8IR7dFVNdCC
uipcxY9fCfcu2G2AV6WUVT+TQNYjC8tgUgSzM3LI+t12NTJ5hB7MFqSQUmCZSKmJ
nSDjoHHFbY56H0rrelEdhzjXLNGTc8UmkIRDqd3oZk0wnvxZlq8fPRkmU7DLFEh6
tAU=
=5AGn
-----END PGP SIGNATURE-----
Merge tag 'configfs-for-4.19' of git://git.infradead.org/users/hch/configfs
Pull configfs updates from Christoph Hellwig:
- simplify the cide by using kvasprintf (Bart Van Assche)
- fix a gcc 8 string truncation warning by making the code simpler
(Guenter Roeck)
- fix a bug in rmdir() handling (Mike Christie)
* tag 'configfs-for-4.19' of git://git.infradead.org/users/hch/configfs:
configfs: fix registered group removal
configfs: replace strncpy with memcpy
configfs: use kvasprintf() instead of open-coding it
- Support 64-bit timestamps
MTD changes:
Core changes:
- Support sub-partitions
- Clarify mtd_oob_ops documentation
- Make Kconfig formatting consistent
- Fix potential overflows in mtdchar_{write,read}()
- Fallback to ->_{read,write}() when ->_{read,write}_oob() is missing
and no OOB data were requested
- Remove VLA usage in the bch lib
Driver changes:
- Use mtd_device_register() instead of mtd_device_parse_register()
where applicable
- Use proper printk format to print physical addresses in the
solutionengine driver
- Add missing mtd_set_of_node() call in the powernv driver
- Remove unneeded variables in a few drivers
- Plug the TRX part parser to the DT partition parsers logic
- Check ioremap_cache() return code in the gpio-addr-flash driver
- Stop using VMLINUX_SYMBOL_STR() in gen_probe.c
SPI NOR changes:
Core changes:
- Apply reset hacks only when reset is explicitly marked as broken in
the DT
Driver changes:
- Minor cleanup/fixes in the m25p80 driver
- Release flash_np in the nxp-spifi driver
- Add suspend/resume hooks to the atmel-quadspi driver
- Include gpio/consumer.h instead of gpio.h in the atmel-quadspi
driver
- Use %pK instead of %p in the stm32-quadspi driver
- Improve timeout handling in the cadence-quadspi driver
- Use mtd_device_register() instead of mtd_device_parse_register()
in the intel-spi driver
NAND changes:
Core changes:
- Add the SPI-NAND framework.
- Create a helper to find the best ECC configuration.
- Create NAND controller operations.
- Allocate dynamically ONFI parameters structure.
- Add defines for ONFI version bits.
- Add manufacturer fixup for ONFI parameter page.
- Add an option to specify NAND chip as a boot device.
- Add Reed-Solomon error correction algorithm.
- Better name for the controller structure.
- Remove unused caller_is_module() definition.
- Make subop helpers return unsigned values.
- Expose _notsupp() helpers for raw page accessors.
- Add default values for dynamic timings.
- Kill the chip->scan_bbt() hook.
- Rename nand_default_bbt() into nand_create_bbt().
- Start to clean the nand_chip structure.
- Remove stale prototype from rawnand.h.
Raw NAND controllers drivers changes:
- Qcom: structuring cleanup.
- Denali: use core helper to find the best ECC configuration.
- Possible build of almost all drivers by adding a dependency on
COMPILE_TEST for almost all of them in Kconfig, implies various
fixes, Kconfig cleanup, GPIO headers inclusion cleanup, and even
changes in sparc64 and ia64 architectures.
- Clean the ->probe() functions error path of a lot of drivers.
- Migrate all drivers to use nand_scan() instead of
nand_scan_ident()/nand_scan_tail() pair.
- Use mtd_device_register() where applicable to simplify the code.
- Marvell:
* Handle on-die ECC.
* Better clocks handling.
* Remove bogus comment.
* Add suspend and resume support.
- Tegra: add NAND controller driver.
- Atmel:
* Add module param to avoid using dma.
* Drop Wenyou Yang from MAINTAINERS.
- Denali: optimize timings handling.
- FSMC: Stop using chip->read_buf().
- FSL:
* Switch to SPDX license tag identifiers.
* Fix qualifiers in MXC init functions.
Raw NAND chip drivers changes:
- Micron:
* Add fixup for ONFI revision.
* Update ecc_stats.corrected.
* Make ECC activation stateful.
* Avoid enabling/disabling ECC when it can't be disabled.
* Get the actual number of bitflips.
* Allow forced on-die ECC.
* Support 8/512 on-die ECC.
* Fix on-die ECC detection logic.
- Hynix:
* Fix decoding the OOB size on H27UCG8T2BTR.
* Use ->exec_op() in hynix_nand_reg_write_op().
-----BEGIN PGP SIGNATURE-----
iQI5BAABCAAjBQJbcSYdHBxib3Jpcy5icmV6aWxsb25AYm9vdGxpbi5jb20ACgkQ
Ze02AX4ItwA85Q//ViUNiWoluaj7q3Kpv44NZZClPjVfirDtIHEfyzob7HbkrqjT
ivjNUu5cjKU2aqptR24UzIV2lwmaSGicIzvyANN5bNedHy8tb4BWXo1iTEflXc1P
lYihrRz7AeHQqdrOYAiHxq0fcqgIm2NjCDxfdtkP1HJsMQH+LgvoosOkd8RyLF8E
jV5IyreTfg10AiVS44yvnqdsYo7GDVZRgnx1dBRbdJ+WwZS05m6LSeNu0ivgfLMM
NmISrn7pzlMYnuDkzckXwVka7STFhV9befoSCmAHedqNEHtzaGDaNMLEilaC4WzV
1yvsdRGI3hDkdO4s44nP/Vs2XlVmCevuTpOOQ1dEK94Bbek+us7NEPjxOIAf22jE
5MVZR6vpNN7JXJ+nvEZjjL8X9zysqZgk2r6os35Wi2zJK7iHIrFpjlMYwTJfVPzv
JKabOQZuOUqRACbL0isTTdG0+Tzr1BxaQqFiFUteQ9QpG4nSifoHUmI+9XUCJmil
iGTOy6rEkaMd6qu3xapPtZbxGbUbvcLiJvNrZ02ZfsELDAELowz66O0PX6rSwTV9
NIWeVNXK9Bsp0oxbb8P8oJIB5K4F6vqncPBls34J7E/xrLO9nLk1VvFQrGqzuwwP
HmEdgcRwKeXrvJITZz2u0u6lrKpgdDA3FTqfGCNec51PCgb1FwxcpKF/Xz0=
=yIkF
-----END PGP SIGNATURE-----
Merge tag 'mtd/for-4.19' of git://git.infradead.org/linux-mtd
Pull mtd updates from Boris Brezillon:
"JFFS2 changes:
- Support 64-bit timestamps
MTD core changes:
- Support sub-partitions
- Clarify mtd_oob_ops documentation
- Make Kconfig formatting consistent
- Fix potential overflows in mtdchar_{write,read}()
- Fallback to ->_{read,write}() when ->_{read,write}_oob() is missing
and no OOB data were requested
- Remove VLA usage in the bch lib
MTD driver changes:
- Use mtd_device_register() instead of mtd_device_parse_register()
where applicable
- Use proper printk format to print physical addresses in the
solutionengine driver
- Add missing mtd_set_of_node() call in the powernv driver
- Remove unneeded variables in a few drivers
- Plug the TRX part parser to the DT partition parsers logic
- Check ioremap_cache() return code in the gpio-addr-flash driver
- Stop using VMLINUX_SYMBOL_STR() in gen_probe.c
SPI NOR core changes:
- Apply reset hacks only when reset is explicitly marked as broken in
the DT
SPI NOR driver changes:
- Minor cleanup/fixes in the m25p80 driver
- Release flash_np in the nxp-spifi driver
- Add suspend/resume hooks to the atmel-quadspi driver
- Include gpio/consumer.h instead of gpio.h in the atmel-quadspi
driver
- Use %pK instead of %p in the stm32-quadspi driver
- Improve timeout handling in the cadence-quadspi driver
- Use mtd_device_register() instead of mtd_device_parse_register() in
the intel-spi driver
NAND core changes:
- Add the SPI-NAND framework.
- Create a helper to find the best ECC configuration.
- Create NAND controller operations.
- Allocate dynamically ONFI parameters structure.
- Add defines for ONFI version bits.
- Add manufacturer fixup for ONFI parameter page.
- Add an option to specify NAND chip as a boot device.
- Add Reed-Solomon error correction algorithm.
- Better name for the controller structure.
- Remove unused caller_is_module() definition.
- Make subop helpers return unsigned values.
- Expose _notsupp() helpers for raw page accessors.
- Add default values for dynamic timings.
- Kill the chip->scan_bbt() hook.
- Rename nand_default_bbt() into nand_create_bbt().
- Start to clean the nand_chip structure.
- Remove stale prototype from rawnand.h.
Raw NAND controllers drivers changes:
- Qcom: structuring cleanup.
- Denali: use core helper to find the best ECC configuration.
- Possible build of almost all drivers by adding a dependency on
COMPILE_TEST for almost all of them in Kconfig, implies various
fixes, Kconfig cleanup, GPIO headers inclusion cleanup, and even
changes in sparc64 and ia64 architectures.
- Clean the ->probe() functions error path of a lot of drivers.
- Migrate all drivers to use nand_scan() instead of
nand_scan_ident()/nand_scan_tail() pair.
- Use mtd_device_register() where applicable to simplify the code.
- Marvell:
* Handle on-die ECC.
* Better clocks handling.
* Remove bogus comment.
* Add suspend and resume support.
- Tegra: add NAND controller driver.
- Atmel:
* Add module param to avoid using dma.
* Drop Wenyou Yang from MAINTAINERS.
- Denali: optimize timings handling.
- FSMC: Stop using chip->read_buf().
- FSL:
* Switch to SPDX license tag identifiers.
* Fix qualifiers in MXC init functions.
Raw NAND chip drivers changes:
- Micron:
* Add fixup for ONFI revision.
* Update ecc_stats.corrected.
* Make ECC activation stateful.
* Avoid enabling/disabling ECC when it can't be disabled.
* Get the actual number of bitflips.
* Allow forced on-die ECC.
* Support 8/512 on-die ECC.
* Fix on-die ECC detection logic.
- Hynix:
* Fix decoding the OOB size on H27UCG8T2BTR.
* Use ->exec_op() in hynix_nand_reg_write_op()"
* tag 'mtd/for-4.19' of git://git.infradead.org/linux-mtd: (188 commits)
mtd: rawnand: atmel: Select GENERIC_ALLOCATOR
MAINTAINERS: drop Wenyou Yang from Atmel NAND driver support
mtd: rawnand: allocate dynamically ONFI parameters during detection
mtd: spi-nor: only apply reset hacks to broken hardware
mtd: spi-nor: cadence-quadspi: fix timeout handling
mtd: spi-nor: atmel-quadspi: Include gpio/consumer.h instead of gpio.h
mtd: spi-nor: intel-spi: use mtd_device_register()
mtd: spi-nor: stm32-quadspi: replace "%p" with "%pK"
mtd: spi-nor: atmel-quadspi: add suspend/resume hooks
mtd: rawnand: allocate model parameter dynamically
mtd: rawnand: do not export nand_scan_[ident|tail]() anymore
mtd: rawnand: txx9ndfmc: convert driver to nand_scan()
mtd: rawnand: txx9ndfmc: clarify ECC parameters assignation
mtd: rawnand: tegra: convert driver to nand_scan()
mtd: rawnand: jz4740: convert driver to nand_scan()
mtd: rawnand: jz4740: group nand_scan_{ident, tail} calls
mtd: rawnand: jz4740: fix probe function error path
mtd: rawnand: docg4: convert driver to nand_scan()
mtd: rawnand: do not execute nand_scan_ident() if maxchips is zero
mtd: rawnand: atmel: convert driver to nand_scan()
...
generic/184 1s ... [failed, exit status 1]- output mismatch
--- tests/generic/184.out 2015-01-11 16:52:27.643681072 +0800
QA output created by 184 - silence is golden
+rm: cannot remove '/mnt/f2fs/null': Bad address
+mknod: '/mnt/f2fs/null': Bad address
+chmod: cannot access '/mnt/f2fs/null': Bad address
+./tests/generic/184: line 36: /mnt/f2fs/null: Bad address
...
F2FS-fs (zram0): access invalid blkaddr:259
EIP: f2fs_is_valid_blkaddr+0x14b/0x1b0 [f2fs]
f2fs_iget+0x927/0x1010 [f2fs]
f2fs_lookup+0x26e/0x630 [f2fs]
__lookup_slow+0xb3/0x140
lookup_slow+0x31/0x50
walk_component+0x185/0x1f0
path_lookupat+0x51/0x190
filename_lookup+0x7f/0x140
user_path_at_empty+0x36/0x40
vfs_statx+0x61/0xc0
__do_sys_stat64+0x29/0x40
sys_stat64+0x13/0x20
do_fast_syscall_32+0xaa/0x22c
entry_SYSENTER_32+0x53/0x86
In f2fs_iget(), we will check inode's first block address, if it is valid,
we will set FI_FIRST_BLOCK_WRITTEN flag in inode.
But we should only do this for regular inode, otherwise, like special
inode, i_addr[0] is used for storing device info instead of block address,
it will fail checking flow obviously.
So for non-regular inode, let's skip verifying address and setting flag.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAltwvasQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpv65EACTq5gSLnJBI6ZPr1RAHruVDnjfzO2Veitl
tUtjm0XfWmnEiwQ3dYvnyhy99xbyaG3900d9BClCTlH6xaUdSiQkDpcKG/R2F36J
5mZitYukQcpFAQJWF8YKsTTE7JPl4VglCIDqYiC4+C3rOSVi8lrKn2qp4J4MMCFn
thRg3jCcq7c5s9Eigsop1pXWQSasubkXfk55Krcp4oybKYpYRKXXf74Mj14QAbwJ
QHN3VisyAUWoBRg7UQZo1Npe2oPk6bbnJypnjf8M0M2EnlvddEkIlHob91sodka8
6p4APOEu5cbyXOBCAQsw/koff14mb8aEadqeQA68WvXfIdX9ZjfxCX0OoC3sBEXk
yqJhZ0C980AM13zIBD8ejv4uasGcPca8W+47mE5P8sRiI++5kBsFWDZPCtUBna0X
2Kh24NsmEya9XRR5vsB84dsIPQ3tLMkxg/IgQRVDaSnfJz0c/+zm54xDyKRaFT4l
5iERk2WSkm9+8jNfVmWG0edrv6nRAXjpGwFfOCPh6/LCSCi4xQRULYN7sVzsX8ZK
FRjt24HftBI8mJbh4BtweJvg+ppVe1gAk3IO3HvxAQhv29Hz+uvFYe9kL+3N8LJA
Qosr9n9O4+wKYizJcDnw+5iPqCHfAwOm9th4pyedR+R7SmNcP3yNC8AbbheNBiF5
Zolos5H+JA==
=b9ib
-----END PGP SIGNATURE-----
Merge tag 'for-4.19/block-20180812' of git://git.kernel.dk/linux-block
Pull block updates from Jens Axboe:
"First pull request for this merge window, there will also be a
followup request with some stragglers.
This pull request contains:
- Fix for a thundering heard issue in the wbt block code (Anchal
Agarwal)
- A few NVMe pull requests:
* Improved tracepoints (Keith)
* Larger inline data support for RDMA (Steve Wise)
* RDMA setup/teardown fixes (Sagi)
* Effects log suppor for NVMe target (Chaitanya Kulkarni)
* Buffered IO suppor for NVMe target (Chaitanya Kulkarni)
* TP4004 (ANA) support (Christoph)
* Various NVMe fixes
- Block io-latency controller support. Much needed support for
properly containing block devices. (Josef)
- Series improving how we handle sense information on the stack
(Kees)
- Lightnvm fixes and updates/improvements (Mathias/Javier et al)
- Zoned device support for null_blk (Matias)
- AIX partition fixes (Mauricio Faria de Oliveira)
- DIF checksum code made generic (Max Gurtovoy)
- Add support for discard in iostats (Michael Callahan / Tejun)
- Set of updates for BFQ (Paolo)
- Removal of async write support for bsg (Christoph)
- Bio page dirtying and clone fixups (Christoph)
- Set of bcache fix/changes (via Coly)
- Series improving blk-mq queue setup/teardown speed (Ming)
- Series improving merging performance on blk-mq (Ming)
- Lots of other fixes and cleanups from a slew of folks"
* tag 'for-4.19/block-20180812' of git://git.kernel.dk/linux-block: (190 commits)
blkcg: Make blkg_root_lookup() work for queues in bypass mode
bcache: fix error setting writeback_rate through sysfs interface
null_blk: add lock drop/acquire annotation
Blk-throttle: reduce tail io latency when iops limit is enforced
block: paride: pd: mark expected switch fall-throughs
block: Ensure that a request queue is dissociated from the cgroup controller
block: Introduce blk_exit_queue()
blkcg: Introduce blkg_root_lookup()
block: Remove two superfluous #include directives
blk-mq: count the hctx as active before allocating tag
block: bvec_nr_vecs() returns value for wrong slab
bcache: trivial - remove tailing backslash in macro BTREE_FLAG
bcache: make the pr_err statement used for ENOENT only in sysfs_attatch section
bcache: set max writeback rate when I/O request is idle
bcache: add code comments for bset.c
bcache: fix mistaken comments in request.c
bcache: fix mistaken code comments in bcache.h
bcache: add a comment in super.c
bcache: avoid unncessary cache prefetch bch_btree_node_get()
bcache: display rate debug parameters to 0 when writeback is not running
...
When CONFIG_F2FS_FAULT_INJECTION is disabled, we get a warning about an
unused label:
fs/f2fs/segment.c: In function '__submit_discard_cmd':
fs/f2fs/segment.c:1059:1: error: label 'submit' defined but not used [-Werror=unused-label]
This could be fixed by adding another #ifdef around it, but the more
reliable way of doing this seems to be to remove the other #ifdefs
where that is easily possible.
By defining time_to_inject() as a trivial stub, most of the checks for
CONFIG_F2FS_FAULT_INJECTION can go away. This also leads to nicer
formatting of the code.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Pointer orangefs_inode is being assigned but is never used hence it is
redundant and can be removed.
Cleans up clang warning:
warning: variable 'orangefs_inode' set but not used [-Wunused-but-set-variable]
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
Use new return type vm_fault_t for fault handler. For now,
this is just documenting that the function returns a VM_FAULT
value rather than an errno. Once all instances are converted,
vm_fault_t will become a distinct type.
See the following
commit 1c8f422059 ("mm: change return type to vm_fault_t")
Fixed checkpatch.pl warning.
Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
- Use extent maps to track pagecache page status instead of bufferhead
state.
- Refactor pagecache read and write paths to use the new iomap library
functions, which enable us to drop the old bufferhead code for
pagesize == blocksize filesystems.
- Set up parallel per-block-per-page metadata to track subpage
information that was tracked by buffer heads, which enables us to drop
the old bufferhead code for pagesize > blocksize filesystems.
- Tie a deferred ops control structure to a transaction so that we can
take advantage of an upper-level dfops without having to plumb pointer
passing through the code.
- Refactor the deferred ops code to track deferred ops as part of the
transaction structure (instead of as a separate data structure) so
that we can simplify the scoping rules around defer_ops.
- Refactor twisty delwri buffer submission code to avoid deadlocks.
- Shorten and fix indenting problems in the scrub code.
- Detect obviously bad summary counts at mount and fix them.
- Directly associate deferred ops control structure with a transaction
so that callers no longer have to manage it themselves.
- Remove a couple of IRIX-era inode macros.
- Remove the long-deprecated 'barrier' and 'nobarrier' mount options.
- Clean up the inode fork structure a bit.
- Check for bad fs summary counter values in the superblock.
- Reduce COW fork lookups during writeback.
- Refactor the deferred ops control structures into the transaction
structure, thereby eliminating the need for transaction users to
handle the deferred ops as a separate data structure.
- Add the ability to repair AG headers online.
- Fix a crash due to insufficient return value checking.
- Various fixes and cleanups.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAltwVGoACgkQ+H93GTRK
tOt3bw//WaG1mR44Oo/mhf27MaEJK74LXeViqCH4Sdk10gujClTVnl6h33ChAyEi
7BT4x1JtwM6xOh7nPsXQy/besVxWadjQcTtAz/3U2wJoFyOX2+I27SAawrmX6jfR
Hi1DxXFFK7z/8YvuZqYl3vTgxMNb7bLAUybe2sYX8q+vrQaUvl9eLQlHSaT3sxrc
/lBkog1dYmbw3yjLnWYpQtC0I6Pa3ZuG/S2vpeJ2H5MADtzrRNjuC9MHZJW7tIGm
+rCLm0agk8yFkEA84VvS5Afee3TppY/JBaYlsvG1rp3bs0fELAJFnzS4g/QDbbsX
HAKPcMICJksF4C9y0Xb7wXPz/4PKur5/OSuGXN4QtOivOEoAdWfh2PLInqAjo/Le
mO92PdkBucfVqJzfEC2q2QAnGIaJlG8txhAz87wZ1YfZDQQlJDy385Z9GQXfUpy5
/1xH7V0cze1ZBSxWSddSFg0gCtaWSerfp0CmAG3A+HWKIN6c/ZNSCrqdq0DBC99D
qOn6ThjckZWGvz/KV5xBr/KvUYOpSeEyREtgcAN008TiUaNy4nOhWV2xgLGuPY/J
ed4V2B9qVbq+l+sZyzukB8cmOXmcCey6omwJ7LqZzoTWTAtTQtM2MwhaQFUWtQG8
mCqPXJp1XyL24sn0bI1t2NuKgQcs6QEQWX3zN4DA6I+N9+sTDqo=
=2G+i
-----END PGP SIGNATURE-----
Merge tag 'xfs-4.19-merge-6' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs updates from Darrick Wong:
"This is the second part of the XFS changes for 4.19.
The biggest changes are the removal of buffer heads frm XFS, a massive
reworking of the deferred transaction operations handling code, the
removal of the long defunct barrier/nobarrier mount options, and the
addition of a few more online repair functions.
Summary:
- Use extent maps to track pagecache page status instead of
bufferhead state.
- Refactor pagecache read and write paths to use the new iomap
library functions, which enable us to drop the old bufferhead code
for pagesize == blocksize filesystems.
- Set up parallel per-block-per-page metadata to track subpage
information that was tracked by buffer heads, which enables us to
drop the old bufferhead code for pagesize > blocksize filesystems.
- Tie a deferred ops control structure to a transaction so that we
can take advantage of an upper-level dfops without having to plumb
pointer passing through the code.
- Refactor the deferred ops code to track deferred ops as part of the
transaction structure (instead of as a separate data structure) so
that we can simplify the scoping rules around defer_ops.
- Refactor twisty delwri buffer submission code to avoid deadlocks.
- Shorten and fix indenting problems in the scrub code.
- Detect obviously bad summary counts at mount and fix them.
- Directly associate deferred ops control structure with a
transaction so that callers no longer have to manage it themselves.
- Remove a couple of IRIX-era inode macros.
- Remove the long-deprecated 'barrier' and 'nobarrier' mount options.
- Clean up the inode fork structure a bit.
- Check for bad fs summary counter values in the superblock.
- Reduce COW fork lookups during writeback.
- Refactor the deferred ops control structures into the transaction
structure, thereby eliminating the need for transaction users to
handle the deferred ops as a separate data structure.
- Add the ability to repair AG headers online.
- Fix a crash due to insufficient return value checking.
- Various fixes and cleanups"
* tag 'xfs-4.19-merge-6' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (155 commits)
xfs: fix a null pointer dereference in xfs_bmap_extents_to_btree
xfs: remove b_last_holder & associated macros
iomap: Switch to offset_in_page for clarity
xfs: Close race between direct IO and xfs_break_layouts()
xfs: repair the AGI
xfs: repair the AGFL
xfs: repair the AGF
xfs: remove dead error handling code in xfs_dquot_disk_alloc()
xfs: use WRITE_ONCE to update if_seq
xfs: fix a comment in xfs_log_reserve
xfs: only validate summary counts on primary superblock
xfs: substitute spaces with tabs
xfs: fold dfops into the transaction
xfs: always defer agfl block frees
xfs: pass transaction to xfs_defer_add()
xfs: replace xfs_defer_ops ->dop_pending with on-stack list
xfs: cancel dfops on xfs_defer_finish() error
xfs: clean out superfluous dfops dop params/vars
xfs: drop dop param from xfs_defer_op_type ->finish_item() callback
xfs: automatic dfops inode relogging
...
In commit 9dc55f1389 ("iomap: add support for sub-pagesize buffered
I/O without buffer heads") we moved the initialization of poff (it's
computed from pos) into a separate helper function. Inline data only
ever deals with pos == 0, hence the WARN_ON_ONCE, but now we're testing
an uninitialized variable.
Therefore, change the test to check the parameter directly.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Check the values we read in from the AG headers when calculating the
block reservations for a repair transaction. If they're obviously
wrong, substitute worst case assumptions (rather than ENOSPC on a bogus
reservation request).
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
more likely to be updated as we add new features to ext4. Add 64-bit
timestamp support to ext4's superblock fields. And the usual bug
fixes and cleanups, including a Spectre gadget fixup and some
hardening against maliciously corrupted file systems.
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAltx3CAACgkQ8vlZVpUN
gaOfkAgApkNU46yPWWFqJpeTR76Go4zW00x9Piaita+KpQcTCUokK0BDXXeydOEo
zhtNjCRl00vbhnYJ90T75Y3ce9oGQDkWpXHFZBdl+kmMBaIHQJYEECCLR2zrvxVk
LhsixvLozQd/VFenwzmXLqYshiZjhjBZOKo7/xsoy4waer1vNidA9/SR3hGWTMOk
9IsEVqw1h6pZt7ZB/leZmZ348H9QZZbYOUWl2ArvdmNeg2t8pWH0MD0bos+mj5lW
xihyJEIS/OjWZRJhM2WEPMO7U4jD/aX3zmDMGePp/vC5KSqA4FGwk671xIyD+LXR
Hd8Rn8chRe4/iyhBqHpJNkqHQRUV6g==
=SA3p
-----END PGP SIGNATURE-----
Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 updates from Ted Ts'o:
- Convert content from the ext4 wiki to Documentation rst files so it
is more likely to be updated as we add new features to ext4.
- Add 64-bit timestamp support to ext4's superblock fields.
- ... and the usual bug fixes and cleanups, including a Spectre gadget
fixup and some hardening against maliciously corrupted file systems.
* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (34 commits)
ext4: remove unneeded variable "err" in ext4_mb_release_inode_pa()
ext4: improve code readability in ext4_iget()
ext4: fix spectre gadget in ext4_mb_regular_allocator()
ext4: check for NUL characters in extended attribute's name
ext4: use ext4_warning() for sb_getblk failure
ext4: fix race when setting the bitmap corrupted flag
ext4: reset error code in ext4_find_entry in fallback
ext4: handle layout changes to pinned DAX mappings
dax: dax_layout_busy_page() warn on !exceptional
docs: fix up the obviously obsolete bits in the new ext4 documentation
docs: add new ext4 superblock time extension fields
docs: create filesystem internal section
ext4: use swap macro in mext_page_double_lock
ext4: check allocation failure when duplicating "data" in ext4_remount()
ext4: fix warning message in ext4_enable_quotas()
ext4: super: extend timestamps to 40 bits
jbd2: replace current_kernel_time64 with ktime equivalent
ext4: use timespec64 for all inode times
ext4: use ktime_get_real_seconds for i_dtime
ext4: use 64-bit timestamps for mmp_time
...
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAltwwVkACgkQiiy9cAdy
T1H06gv/d+b52w2X05YlLBB1GboJjv1dl3sVrFisJrMrCCr6Hotk+4+RtC8CJHh6
/9Joq6iY8B/XLl7HrWr61vJhn8vlWOGhOQaUmHCBGeiIMX93RbaP8KvkPfHAKQiJ
T6+7v70z/WL7mUQA2TEQ3Iz6kpCuP2y/XT6eQglTFXasqsWHy08TbsbnmP9yX2i4
E1B6zMwpn6m4PFxEURg14eBJTrQomb/UBSLHNwxwwOQjnKzsmeH3pyiFCJSPJNFF
Xv/lyBibgvQAsuWtLTi82fqei+c60as2LVrsFLdoWXeD6JMlx1O+SHSz0NNJL/Lk
oxMNx9NI+LsYmekIBHgoX01PzVgCpcn6ThfoQv3JLHtqiZuk3Wlai7bPvJMIApoM
8fgHGR9xvhRCfHgYvxx4MmP3e7xewajSe7fh8/y8BqreO+fLeoyqho5G9zDpn6bU
wGFPtVYW/fgTEM5E3NOjsuC1zE35bjwU7U40jqjGHP1VsTs3hkLFy9fZNNq4EP1M
ytfJPlTG
=gzZK
-----END PGP SIGNATURE-----
Merge tag '4.19-smb3' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs updates from Steve French:
"smb3/cifs fixes (including 8 for stable).
Other improvements include:
- improved tracing, improved stats
- snapshots (previous version mounts work now over SMB3)
- performance (compounding enabled for statfs, ~40% faster).
- security (make it possible to build cifs.ko with insecure vers=1.0
disabled in Kconfig)"
* tag '4.19-smb3' of git://git.samba.org/sfrench/cifs-2.6: (43 commits)
smb3: create smb3 equivalent alias for cifs pseudo-xattrs
smb3: allow previous versions to be mounted with snapshot= mount parm
cifs: don't show domain= in mount output when domain is empty
cifs: add missing support for ACLs in SMB 3.11
smb3: enumerating snapshots was leaving part of the data off end
cifs: update smb2_queryfs() to use compounding
cifs: update receive_encrypted_standard to handle compounded responses
cifs: create SMB2_open_init()/SMB2_open_free() helpers.
cifs: add SMB2_query_info_[init|free]()
cifs: add SMB2_close_init()/SMB2_close_free()
smb3: display stats counters for number of slow commands
CIFS: fix uninitialized ptr deref in smb2 signing
smb3: Do not send SMB3 SET_INFO if nothing changed
smb3: fix minor debug output for CONFIG_CIFS_STATS
smb3: add tracepoint for slow responses
cifs: add compound_send_recv()
cifs: make smb_send_rqst take an array of requests
cifs: update init_sg, crypt_message to take an array of rqst
smb3: update readme to correct information about /proc/fs/cifs/Stats
smb3: fix reset of bytes read and written stats
...
Pull fs iomap refactoring from Darrick Wong:
"This is the first part of the XFS changes for 4.19.
Christoph and Andreas coordinated some refactoring work on the iomap
code in preparation for removing buffer heads from XFS and porting
gfs2 to iomap. I'm sending this small pull request ahead of the main
XFS merge to avoid holding up gfs2 unnecessarily"
* 'iomap-4.19-merge' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
iomap: add inline data support to iomap_readpage_actor
iomap: support direct I/O to inline data
iomap: refactor iomap_dio_actor
iomap: add initial support for writes without buffer heads
iomap: add an iomap-based readpage and readpages implementation
iomap: add private pointer to struct iomap
iomap: add a page_done callback
iomap: generic inline data handling
iomap: complete partial direct I/O writes synchronously
iomap: mark newly allocated buffer heads as new
fs: factor out a __generic_write_end helper
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAltxe7QACgkQxWXV+ddt
WDswMA//QlRO+Ln5CH+RlT4fyf1RQUQZblWss2zxrmlo1GRI3Ljf2DNsBE3rD7P4
NSiXfHmgkdjcQP6poPLJwHxwkNd4NFXglYg64wWO10RjHGhKglmH6ztU88wsPfr2
2RZv271/NvYIEkEi6kdyy8ilKeWMshOfyj3+PaeapQn67uJfimyiUvDgUgbvwH3c
yj0nVRLP1C7snNj4Atti/rjXMhG+m1UWfjRkZsmqlBp52k2UAcrtiwQK+DS5b9mL
aWLSaGmIcJtSMkNJPQBST9GTWbJfKTpceoCzkT0o3irvQpN2e2flAJ4ireL8q4mN
MvqJ7giPBFHNDcHEzN6VERvsaA1Rx9Vq20ieQl8JAMd4p/bi5ehN3ww+9vau5zCw
Pc8WeKEILKrLYEAgHOnUO1wxHw994Iv5CA26roTQ0HNXQJjyEZ4m40Ch6LzmfKPm
WKcHX14Uw22GKaFEXHTOpRZ0U0d1cMTcn5zaAajGsB9LwcaiLM+OiFSPtDkwUOB9
QGJHklZVXAD1IH9HFPuq85uUtXTLXbxsw1g8phEJGbmaVxxCOAUAXwEk3qxuZNbz
CHL3G5+l3JEXxfoJSbDW60kr8xic7teqQDszqqP2qlqtP15ty2xc9d5Q8MZajSTZ
H1z9+0gfjYYHrGuAp69MtCbdQhhDSqLyivjJJm0HBaKfVNGW2Xg=
=jBaz
-----END PGP SIGNATURE-----
Merge tag 'for-4.19-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
"Mostly fixes and cleanups, nothing big, though the notable thing is
the inserted/deleted lines delta -1124.
User visible changes:
- allow defrag on opened read-only files that have rw permissions;
similar to what dedupe will allow on such files
Core changes:
- tree checker improvements, reported by fuzzing:
* more checks for: block group items, essential trees
* chunk type validation
* mount time cross-checks that physical and logical chunks match
* switch more error codes to EUCLEAN aka EFSCORRUPTED
Fixes:
- fsync corner case fixes
- fix send failure when root has deleted files still open
- send, fix incorrect file layout after hole punching beyond eof
- fix races between mount and deice scan ioctl, found by fuzzing
- fix deadlock when delayed iput is called from writeback on the same
inode; rare but has been observed in practice, also removes code
- fix pinned byte accounting, using the right percpu helpers; this
should avoid some write IO inefficiency during low space conditions
- don't remove block group that still has pinned bytes
- reset on-disk device stats value after replace, otherwise this
would report stale values for the new device
Cleanups:
- time64_t/timespec64 cleanups
- remove remaining dead code in scrub handling NOCOW extents after
disabling it in previous cycle
- simplify fsync regarding ordered extents logic and remove all the
related code
- remove redundant arguments in order to reduce stack space
consumption
- remove support for V0 type of extents, not in use since 2.6.30
- remove several unused structure members
- fewer indirect function calls by inlining some callbacks
- qgroup rescan timing fixes
- vfs: iget cleanups"
* tag 'for-4.19-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (182 commits)
btrfs: revert fs_devices state on error of btrfs_init_new_device
btrfs: Exit gracefully when chunk map cannot be inserted to the tree
btrfs: Introduce mount time chunk <-> dev extent mapping check
btrfs: Verify that every chunk has corresponding block group at mount time
btrfs: Check that each block group has corresponding chunk at mount time
Btrfs: send, fix incorrect file layout after hole punching beyond eof
btrfs: Use wrapper macro for rcu string to remove duplicate code
btrfs: simplify btrfs_iget
btrfs: lift make_bad_inode into btrfs_iget
btrfs: simplify IS_ERR/PTR_ERR checks
btrfs: btrfs_iget never returns an is_bad_inode inode
btrfs: replace: Reset on-disk dev stats value after replace
btrfs: extent-tree: Remove unused __btrfs_free_block_rsv
btrfs: backref: Use ERR_CAST to return error code
btrfs: Remove redundant btrfs_release_path from btrfs_unlink_subvol
btrfs: Remove root parameter from btrfs_unlink_subvol
btrfs: Remove fs_info from btrfs_add_root_ref
btrfs: Remove fs_info from btrfs_del_root_ref
btrfs: Remove fs_info from btrfs_del_root
btrfs: Remove fs_info from btrfs_delete_delayed_dir_index
...
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJbcDltAAoJEAAOaEEZVoIV9YUP/ioYw3+kIXwRa05ec8Twbn6m
O+1HOX9UAuDWX+97P2kCJSqi9a/TkpfLQxoGxZowDv97ZkVeHdzVaJVN/0sbH1uP
Oj86b2Y5zGyrLk3D4c5VcyHkvogr16DUvhYeRtcjPyufSaKqefftCeetTntRvr4Z
rZ8HLKYxa4AZlP/FkgVZ/S6PmvG7tH8rVLMFCfY+rASbSU4gBxmrS0nB2UgU4ycb
b2SVZo4Zxycw0KiP14T5AgN4puid7VEofFezBpMKjNPjCUk1wJ3mtUyLeS8jRUEx
M6qnl7DU4BWPRtGxiXNebN85M8iaJnlaeQglowklrubrFtqnVeDQM0rz42NiY2/S
2jk6q8b9XKvoDEev2VeKfmFztu4gGXypNR6xAolvgBnUZziKKFUJutZwJe95pZV1
kO6CvdchuCQdLMLdHPji2kk2AA8Go5webONfioFhoWYfeSvgjRELmOFTDjP22eNv
s1/vfCsjuU6vdyaXXXZpN/7VMenoNegSQUXFoxhmHO7BtNuTqk+ZVFNgaWsOrXcx
I0ZIjCAmOsU+DKLn+DTT8kDtU3ihZz/OCXXZJi9JPeLFZ6gSghDRnFThwYqcZvA+
syIn1XGGtn15Q7A5FZNvwGqYir1GtyOslrxxIKxGlhleovhT94HuKXkL1T+k74zl
FXItKgRs0TPQfTsI8q+C
=IMcz
-----END PGP SIGNATURE-----
Merge tag 'locks-v4.19-1' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux
Pull file locking updates from Jeff Layton:
"Just a couple of patches from Konstantin to fix /proc/locks when the
process that set the lock has exited, and a new tracepoint for the
flock() codepath. Also threw in mailmap entries for my addresses and a
comment cleanup"
* tag 'locks-v4.19-1' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux:
locks: remove misleading obsolete comment
mailmap: remap some of my email addresses to kernel.org address
locks: add tracepoint in flock codepath
fs/lock: show locks taken by processes from another pidns
fs/lock: skip lock owner pid translation in case we are in init_pid_ns
Pull misc vfs updates from Al Viro:
"Misc cleanups from various folks all over the place
I expected more fs/dcache.c cleanups this cycle, so that went into a
separate branch. Said cleanups have missed the window, so in the
hindsight it could've gone into work.misc instead. Decided not to
cherry-pick, thus the 'work.dcache' branch"
* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
fs: dcache: Use true and false for boolean values
fold generic_readlink() into its only caller
fs: shave 8 bytes off of struct inode
fs: Add more kernel-doc to the produced documentation
fs: Fix attr.c kernel-doc
removed extra extern file_fdatawait_range
* 'work.dcache' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
kill dentry_update_name_case()
Pull vfs aio updates from Al Viro:
"Christoph's aio poll, saner this time around.
This time it's pretty much local to fs/aio.c. Hopefully race-free..."
* 'work.aio' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
aio: allow direct aio poll comletions for keyed wakeups
aio: implement IOCB_CMD_POLL
aio: add a iocb refcount
timerfd: add support for keyed wakeups
Pull vfs lookup() updates from Al Viro:
"More conversions of ->lookup() to d_splice_alias().
Should be reasonably complete now - the only leftovers are in ceph"
* 'work.lookup' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
afs_try_auto_mntpt(): return NULL instead of ERR_PTR(-ENOENT)
afs_lookup(): switch to d_splice_alias()
afs: switch dynroot lookups to d_splice_alias()
hpfs: fix an inode leak in lookup, switch to d_splice_alias()
hostfs_lookup: switch to d_splice_alias()
Pull vfs icache updates from Al Viro:
- NFS mkdir/open_by_handle race fix
- analogous solution for FUSE, replacing the one currently in mainline
- new primitive to be used when discarding halfway set up inodes on
failed object creation; gives sane warranties re icache lookups not
returning such doomed by still not freed inodes. A bunch of
filesystems switched to that animal.
- Miklos' fix for last cycle regression in iget5_locked(); -stable will
need a slightly different variant, unfortunately.
- misc bits and pieces around things icache-related (in adfs and jfs).
* 'work.mkdir' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
jfs: don't bother with make_bad_inode() in ialloc()
adfs: don't put inodes into icache
new helper: inode_fake_hash()
vfs: don't evict uninitialized inode
jfs: switch to discard_new_inode()
ext2: make sure that partially set up inodes won't be returned by ext2_iget()
udf: switch to discard_new_inode()
ufs: switch to discard_new_inode()
btrfs: switch to discard_new_inode()
new primitive: discard_new_inode()
kill d_instantiate_no_diralias()
nfs_instantiate(): prevent multiple aliases for directory inode
Pull vfs open-related updates from Al Viro:
- "do we need fput() or put_filp()" rules are gone - it's always fput()
now. We keep track of that state where it belongs - in ->f_mode.
- int *opened mess killed - in finish_open(), in ->atomic_open()
instances and in fs/namei.c code around do_last()/lookup_open()/atomic_open().
- alloc_file() wrappers with saner calling conventions are introduced
(alloc_file_clone() and alloc_file_pseudo()); callers converted, with
much simplification.
- while we are at it, saner calling conventions for path_init() and
link_path_walk(), simplifying things inside fs/namei.c (both on
open-related paths and elsewhere).
* 'work.open3' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (40 commits)
few more cleanups of link_path_walk() callers
allow link_path_walk() to take ERR_PTR()
make path_init() unconditionally paired with terminate_walk()
document alloc_file() changes
make alloc_file() static
do_shmat(): grab shp->shm_file earlier, switch to alloc_file_clone()
new helper: alloc_file_clone()
create_pipe_files(): switch the first allocation to alloc_file_pseudo()
anon_inode_getfile(): switch to alloc_file_pseudo()
hugetlb_file_setup(): switch to alloc_file_pseudo()
ocxlflash_getfile(): switch to alloc_file_pseudo()
cxl_getfile(): switch to alloc_file_pseudo()
... and switch shmem_file_setup() to alloc_file_pseudo()
__shmem_file_setup(): reorder allocations
new wrapper: alloc_file_pseudo()
kill FILE_{CREATED,OPENED}
switch atomic_open() and lookup_open() to returning 0 in all success cases
document ->atomic_open() changes
->atomic_open(): return 0 in all success cases
get rid of 'opened' in path_openat() and the helpers downstream
...
An overview of the general architecture changes:
- Massive DMA ops refactoring from Christoph Hellwig (huzzah for
deleting crufty code!).
- We introduce NT_MIPS_DSP & NT_MIPS_FP_MODE ELF notes & corresponding
regsets to expose DSP ASE & floating point mode state respectively,
both for live debugging & core dumps.
- We better optimize our code by hard-coding cpu_has_* macros at
compile time where their values are known due to the ISA revision
that the kernel build is targeting.
- The EJTAG exception handler now better handles SMP systems, where it
was previously possible for CPUs to clobber a register value saved
by another CPU.
- Our implementation of memset() gained a couple of fixes for MIPSr6
systems to return correct values in some cases where stores fault.
- We now implement ioremap_wc() using the uncached-accelerated cache
coherency attribute where supported, which is detected during boot,
and fall back to plain uncached access where necessary. The
MIPS-specific (and unused in tree) ioremap_uncached_accelerated() &
ioremap_cacheable_cow() are removed.
- The prctl(PR_SET_FP_MODE, ...) syscall is better supported for SMP
systems by reworking the way we ensure remote CPUs that may be
running threads within the affected process switch mode.
- Systems using the MIPS Coherence Manager will now set the
MIPS_IC_SNOOPS_REMOTE flag to avoid some unnecessary cache
maintenance overhead when flushing the icache.
- A few fixes were made for building with clang/LLVM, which
now sucessfully builds kernels for many of our platforms.
- Miscellaneous cleanups all over.
And some platform-specific changes:
- ar7 gained stubs for a few clock API functions to fix build failures
for some drivers.
- ath79 gained support for a few new SoCs, a few fixes & better
gpio-keys support.
- Ci20 now exposes its SPI bus using the spi-gpio driver.
- The generic platform can now auto-detect a suitable value for
PHYS_OFFSET based upon the memory map described by the device tree,
allowing us to avoid wasting memory on page book-keeping for systems
where RAM starts at a non-zero physical address.
- Ingenic systems using the jz4740 platform code now link their
vmlinuz higher to allow for kernels of a realistic size.
- Loongson32 now builds the kernel targeting MIPSr1 rather than MIPSr2
to avoid CPU errata.
- Loongson64 gains a couple of fixes, a workaround for a write
buffering issue & support for the Loongson 3A R3.1 CPU.
- Malta now uses the piix4-poweroff driver to handle powering down.
- Microsemi Ocelot gained support for its SPI bus & NOR flash, its
second MDIO bus and can now be supported by a FIT/.itb image.
- Octeon saw a bunch of header cleanups which remove a lot of
duplicate or unused code.
-----BEGIN PGP SIGNATURE-----
iIsEABYIADMWIQRgLjeFAZEXQzy86/s+p5+stXUA3QUCW3G6JxUccGF1bC5idXJ0
b25AbWlwcy5jb20ACgkQPqefrLV1AN0n/gD/Rpdgay31G/4eTTKBmBrcaju6Shjt
/2Iu6WC5Sj4hDHUBAJSbuI+B9YjcNsjekBYxB/LLD7ImcLBl6nLMIvKmXLAL
=cUiF
-----END PGP SIGNATURE-----
Merge tag 'mips_4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux
Pull MIPS updates from Paul Burton:
"Here are the main MIPS changes for 4.19.
An overview of the general architecture changes:
- Massive DMA ops refactoring from Christoph Hellwig (huzzah for
deleting crufty code!).
- We introduce NT_MIPS_DSP & NT_MIPS_FP_MODE ELF notes &
corresponding regsets to expose DSP ASE & floating point mode state
respectively, both for live debugging & core dumps.
- We better optimize our code by hard-coding cpu_has_* macros at
compile time where their values are known due to the ISA revision
that the kernel build is targeting.
- The EJTAG exception handler now better handles SMP systems, where
it was previously possible for CPUs to clobber a register value
saved by another CPU.
- Our implementation of memset() gained a couple of fixes for MIPSr6
systems to return correct values in some cases where stores fault.
- We now implement ioremap_wc() using the uncached-accelerated cache
coherency attribute where supported, which is detected during boot,
and fall back to plain uncached access where necessary. The
MIPS-specific (and unused in tree) ioremap_uncached_accelerated() &
ioremap_cacheable_cow() are removed.
- The prctl(PR_SET_FP_MODE, ...) syscall is better supported for SMP
systems by reworking the way we ensure remote CPUs that may be
running threads within the affected process switch mode.
- Systems using the MIPS Coherence Manager will now set the
MIPS_IC_SNOOPS_REMOTE flag to avoid some unnecessary cache
maintenance overhead when flushing the icache.
- A few fixes were made for building with clang/LLVM, which now
sucessfully builds kernels for many of our platforms.
- Miscellaneous cleanups all over.
And some platform-specific changes:
- ar7 gained stubs for a few clock API functions to fix build
failures for some drivers.
- ath79 gained support for a few new SoCs, a few fixes & better
gpio-keys support.
- Ci20 now exposes its SPI bus using the spi-gpio driver.
- The generic platform can now auto-detect a suitable value for
PHYS_OFFSET based upon the memory map described by the device tree,
allowing us to avoid wasting memory on page book-keeping for
systems where RAM starts at a non-zero physical address.
- Ingenic systems using the jz4740 platform code now link their
vmlinuz higher to allow for kernels of a realistic size.
- Loongson32 now builds the kernel targeting MIPSr1 rather than
MIPSr2 to avoid CPU errata.
- Loongson64 gains a couple of fixes, a workaround for a write
buffering issue & support for the Loongson 3A R3.1 CPU.
- Malta now uses the piix4-poweroff driver to handle powering down.
- Microsemi Ocelot gained support for its SPI bus & NOR flash, its
second MDIO bus and can now be supported by a FIT/.itb image.
- Octeon saw a bunch of header cleanups which remove a lot of
duplicate or unused code"
* tag 'mips_4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux: (123 commits)
MIPS: Remove remnants of UASM_ISA
MIPS: netlogic: xlr: Remove erroneous check in nlm_fmn_send()
MIPS: VDSO: Force link endianness
MIPS: Always specify -EB or -EL when using clang
MIPS: Use dins to simplify __write_64bit_c0_split()
MIPS: Use read-write output operand in __write_64bit_c0_split()
MIPS: Avoid using array as parameter to write_c0_kpgd()
MIPS: vdso: Allow clang's --target flag in VDSO cflags
MIPS: genvdso: Remove GOT checks
MIPS: Remove obsolete MIPS checks for DST node "chosen@0"
MIPS: generic: Remove input symbols from defconfig
MIPS: Delete unused code in linux32.c
MIPS: Remove unused sys_32_mmap2
MIPS: Remove nabi_no_regargs
mips: dts: mscc: enable spi and NOR flash support on ocelot PCB123
mips: dts: mscc: Add spi on Ocelot
MIPS: Loongson: Merge load addresses
MIPS: Loongson: Set Loongson32 to MIPS32R1
MIPS: mscc: ocelot: add interrupt controller properties to GPIO controller
MIPS: generic: Select MIPS_AUTO_PFN_OFFSET
...
The back channel size is allowed to be 1 or greater.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Catch the ERESTARTSYS error so that it can be processed by the callers.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Mark the destination state to indicate a server-side copy is
happening. On detecting a reboot and recovering open state check
if any state is engaged in a server-side copy, if so, find the
copy and mark it and then signal the waiting thread. Upon wakeup,
if copy was marked then propage EAGAIN to the nfsd_copy_file_range
and restart the copy from scratch.
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Pull timer updates from Thomas Gleixner:
"The timers departement more or less proudly presents:
- More Y2038 timekeeping work mostly in the core code. The work is
slowly, but steadily targeting the actuall syscalls.
- Enhanced timekeeping suspend/resume support by utilizing
clocksources which do not stop during suspend, but are otherwise
not the main timekeeping clocksources.
- Make NTP adjustmets more accurate and immediate when the frequency
is set directly and not incrementally.
- Sanitize the overrung handing of posix timers
- A new timer driver for Mediatek SoCs
- The usual pile of fixes and updates all over the place"
* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (32 commits)
clockevents: Warn if cpu_all_mask is used as cpumask
tick/broadcast-hrtimer: Use cpu_possible_mask for ce_broadcast_hrtimer
clocksource/drivers/arm_arch_timer: Fix bogus cpu_all_mask usage
clocksource: ti-32k: Remove CLOCK_SOURCE_SUSPEND_NONSTOP flag
timers: Clear timer_base::must_forward_clk with timer_base::lock held
clocksource/drivers/sprd: Register one always-on timer to compensate suspend time
clocksource/drivers/timer-mediatek: Add support for system timer
clocksource/drivers/timer-mediatek: Convert the driver to timer-of
clocksource/drivers/timer-mediatek: Use specific prefix for GPT
clocksource/drivers/timer-mediatek: Rename mtk_timer to timer-mediatek
clocksource/drivers/timer-mediatek: Add system timer bindings
clocksource/drivers: Set clockevent device cpumask to cpu_possible_mask
time: Introduce one suspend clocksource to compensate the suspend time
time: Fix extra sleeptime injection when suspend fails
timekeeping/ntp: Constify some function arguments
ntp: Use kstrtos64 for s64 variable
ntp: Remove redundant arguments
timer: Fix coding style
ktime: Provide typesafe ktime_to_ns()
hrtimer: Improve kernel message printing
...
Pull locking/atomics update from Thomas Gleixner:
"The locking, atomics and memory model brains delivered:
- A larger update to the atomics code which reworks the ordering
barriers, consolidates the atomic primitives, provides the new
atomic64_fetch_add_unless() primitive and cleans up the include
hell.
- Simplify cmpxchg() instrumentation and add instrumentation for
xchg() and cmpxchg_double().
- Updates to the memory model and documentation"
* 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (48 commits)
locking/atomics: Rework ordering barriers
locking/atomics: Instrument cmpxchg_double*()
locking/atomics: Instrument xchg()
locking/atomics: Simplify cmpxchg() instrumentation
locking/atomics/x86: Reduce arch_cmpxchg64*() instrumentation
tools/memory-model: Rename litmus tests to comply to norm7
tools/memory-model/Documentation: Fix typo, smb->smp
sched/Documentation: Update wake_up() & co. memory-barrier guarantees
locking/spinlock, sched/core: Clarify requirements for smp_mb__after_spinlock()
sched/core: Use smp_mb() in wake_woken_function()
tools/memory-model: Add informal LKMM documentation to MAINTAINERS
locking/atomics/Documentation: Describe atomic_set() as a write operation
tools/memory-model: Make scripts executable
tools/memory-model: Remove ACCESS_ONCE() from model
tools/memory-model: Remove ACCESS_ONCE() from recipes
locking/memory-barriers.txt/kokr: Update Korean translation to fix broken DMA vs. MMIO ordering example
MAINTAINERS: Add Daniel Lustig as an LKMM reviewer
tools/memory-model: Fix ISA2+pooncelock+pooncelock+pombonce name
tools/memory-model: Add litmus test for full multicopy atomicity
locking/refcount: Always allow checked forms
...
Previously, once fault injection is on, by default, all kind of faults
will be injected to f2fs, if we want to trigger single or specified
combined type during the test, we need to configure sysfs entry, it will
be a little inconvenient to integrate sysfs configuring into testsuit,
such as xfstest.
So this patch introduces a new mount option 'fault_type' to assist old
option 'fault_injection', with these two mount options, we can specify
any fault rate/type at mount-time.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
As Dan Carpenter reported:
The patch 20ee438232: "f2fs: issue small discard by LBA order" from
Jul 8, 2018, leads to the following Smatch warning:
fs/f2fs/segment.c:1277 __issue_discard_cmd_orderly()
warn: 'dc' was already freed.
See also:
fs/f2fs/segment.c:2550 __issue_discard_cmd_range() warn: 'dc' was already freed.
In order to fix this issue, let's get error from __submit_discard_cmd(),
and release current discard command after we referenced next one.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch adds to support discard submission error injection for testing
error handling of __submit_discard_cmd().
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Some devices has small max_{hw,}discard_sectors, so that in
__blkdev_issue_discard(), one big size discard bio can be split
into multiple small size discard bios, result in heavy load in IO
scheduler and device, which can hang other sync IO for long time.
Now, f2fs is trying to control discard commands more elaboratively,
in order to make less conflict in between discard IO and user IO
to enhance application's performance, so in this patch, we will
split discard bio in f2fs in prior to in block layer to reduce
issuing multiple discard bios in a short time.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
generic/260 reported below error:
[+] Default length with start set (should succeed)
[+] Length beyond the end of fs (should succeed)
[+] Length beyond the end of fs with start set (should succeed)
+./tests/generic/260: line 94: [: 18446744073709551615: integer expression expected
+./tests/generic/260: line 104: [: 18446744073709551615: integer expression expected
Test done
...
In f2fs_trim_fs(), if there is no discard being trimmed, we need to correct
range->len before return.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Introduce nat_list_lock to protect nm_i->nat_entries list, and manage
it as a LRU list, refresh location for therein recent accessed entries
in the list.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Thread A Background GC
- f2fs_setattr isize to 0
- truncate_setsize
- gc_data_segment
- f2fs_get_read_data_page page #0
- set_page_dirty
- set_cold_data
- f2fs_truncate
- f2fs_setattr isize to 4k
- read 4k <--- hit data in cached page #0
Above race condition can cause read out invalid data in a truncated
page, fix it by i_gc_rwsem[WRITE] lock.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Thread A Background GC
- f2fs_zero_range
- truncate_pagecache_range
- gc_data_segment
- get_read_data_page
- move_data_page
- set_page_dirty
- set_cold_data
- f2fs_do_zero_range
- dn->data_blkaddr = NEW_ADDR;
- f2fs_set_data_blkaddr
Actually, we don't need to set dirty & checked flag on the page, since
all valid data in the page should be zeroed by zero_range().
Use i_gc_rwsem[WRITE] to avoid such race condition.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Pull EFI updates from Thomas Gleixner:
"The EFI pile:
- Make mixed mode UEFI runtime service invocations mutually
exclusive, as mandated by the UEFI spec
- Perform UEFI runtime services calls from a work queue so the calls
into the firmware occur from a kernel thread
- Honor the UEFI memory map attributes for live memory regions
configured by UEFI as a framebuffer. This works around a coherency
problem with KVM guests running on ARM.
- Cleanups, improvements and fixes all over the place"
* 'efi-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
efivars: Call guid_parse() against guid_t type of variable
efi/cper: Use consistent types for UUIDs
efi/x86: Replace references to efi_early->is64 with efi_is_64bit()
efi: Deduplicate efi_open_volume()
efi/x86: Add missing NULL initialization in UGA draw protocol discovery
efi/x86: Merge 32-bit and 64-bit UGA draw protocol setup routines
efi/x86: Align efi_uga_draw_protocol typedef names to convention
efi/x86: Merge the setup_efi_pci32() and setup_efi_pci64() routines
efi/x86: Prevent reentrant firmware calls in mixed mode
efi/esrt: Only call efi_mem_reserve() for boot services memory
fbdev/efifb: Honour UEFI memory map attributes when mapping the FB
efi: Drop type and attribute checks in efi_mem_desc_lookup()
efi/libstub/arm: Add opt-in Kconfig option for the DTB loader
efi: Remove the declaration of efi_late_init() as the function is unused
efi/cper: Avoid using get_seconds()
efi: Use a work queue to invoke EFI Runtime Services
efi/x86: Use non-blocking SetVariable() for efi_delete_dummy_variable()
efi/x86: Clean up the eboot code
The code of ceph_unreserve_caps() and error handling in
ceph_reserve_caps() are duplicated, so introduce a helper
__ceph_unreserve_caps() to reduce duplicated code.
Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
We do not check return code for __do_request() in all callers,
so change to void return type.
Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
In ceph_llseek(), we compare fsc->max_file_size and inode->i_size to
choose max file size limit.
Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
ceph_setattr() finally calls vfs function inode_newsize_ok()
to do offset validation and that is based on sb->s_maxbytes.
Because we set sb->s_maxbytes to MAX_LFS_FILESIZE to through
VFS check and do proper offset validation in cephfs level,
we need adding proper offset validation before calling
inode_newsize_ok().
Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Since the sb write verifier trips on bad icounts, we should also force a
mount time recalculation of the summary counters if the icount is bad.
This helps us avoid blowing up at freeze/unmount time when the bad
counter gets written back out.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
In my testing, v9fs_fid_xattr_set will return successfully even if the
backend ext4 filesystem has no space to store xattr key-value. That will
cause inconsistent behavior between front end and back end. The reason is
that lsetxattr will be triggered by p9_client_clunk, and unfortunately we
did not catch the error. This patch will catch the error to notify upper
caller.
p9_client_clunk (in 9p)
p9_client_rpc(clnt, P9_TCLUNK, "d", fid->fid);
v9fs_clunk (in qemu)
put_fid
free_fid
v9fs_xattr_fid_clunk
v9fs_co_lsetxattr
s->ops->lsetxattr
ext4_xattr_user_set (in host ext4 filesystem)
Link: http://lkml.kernel.org/r/5B57EACC.2060900@huawei.com
Signed-off-by: Jun Piao <piaojun@huawei.com>
Cc: Eric Van Hensbergen <ericvh@gmail.com>
Cc: Ron Minnich <rminnich@sandia.gov>
Cc: Latchesar Ionkov <lucho@ionkov.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: stable@vger.kernel.org
Signed-off-by: Dominique Martinet <dominique.martinet@cea.fr>
fix spelling mistake in pr_info message text
Link: http://lkml.kernel.org/r/20180526150650.10562-1-colin.king@canonical.com
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Cc: Eric Van Hensbergen <ericvh@gmail.com>
Cc: Ron Minnich <rminnich@sandia.gov>
Cc: Latchesar Ionkov <lucho@ionkov.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dominique Martinet <dominique.martinet@cea.fr>
Use new return type vm_fault_t for page_mkwrite handler.
See 1c8f422059 ("mm: change return type to vm_fault_t") for reference.
Link: http://lkml.kernel.org/r/20180702154928.GA3964@jordon-HP-15-Notebook-PC
Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com>
Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com>
Acked-by: Jun Piao <piaojun@huawei.com>
Cc: Eric Van Hensbergen <ericvh@gmail.com>
Cc: Ron Minnich <rminnich@sandia.gov>
Cc: Latchesar Ionkov <lucho@ionkov.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dominique Martinet <dominique.martinet@cea.fr>
Pull vfs fixes from Al Viro:
"A bunch of race fixes, mostly around lazy pathwalk.
All of it is -stable fodder, a large part going back to 2013"
* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
make sure that __dentry_kill() always invalidates d_seq, unhashed or not
fix __legitimize_mnt()/mntput() race
fix mntput/mntput race
root dentries need RCU-delayed freeing
Fuzzing tool reports a write to null pointer error in the
xfs_bmap_extents_to_btree, fix it by bailing out on encountering
a null pointer.
Signed-off-by: Shan Hai <shan.hai@oracle.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The old lock tracking infrastructure in xfs using the b_last_holder
field seems to only be useful if you can get into the system with a
debugger; it seems that the existing tracepoints would be the way to
go these days, and this old infrastructure can be removed.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Instead of open-coding pos & (PAGE_SIZE - 1) and pos & ~PAGE_MASK, use
the offset_in_page macro.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
This patch is the duplicate of ross's fix for ext4 for xfs.
If the refcount of a page is lowered between the time that it is returned
by dax_busy_page() and when the refcount is again checked in
xfs_break_layouts() => ___wait_var_event(), the waiting function
xfs_wait_dax_page() will never be called. This means that
xfs_break_layouts() will still have 'retry' set to false, so we'll stop
looping and never check the refcount of other pages in this inode.
Instead, always continue looping as long as dax_layout_busy_page() gives us
a page which it found with an elevated refcount.
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
- Add the SPI-NAND framework.
- Create a helper to find the best ECC configuration.
- Create NAND controller operations.
- Allocate dynamically ONFI parameters structure.
- Add defines for ONFI version bits.
- Add manufacturer fixup for ONFI parameter page.
- Add an option to specify NAND chip as a boot device.
- Add Reed-Solomon error correction algorithm.
- Better name for the controller structure.
- Remove unused caller_is_module() definition.
- Make subop helpers return unsigned values.
- Expose _notsupp() helpers for raw page accessors.
- Add default values for dynamic timings.
- Kill the chip->scan_bbt() hook.
- Rename nand_default_bbt() into nand_create_bbt().
- Start to clean the nand_chip structure.
- Remove stale prototype from rawnand.h.
Raw NAND controllers drivers changes:
- Qcom: structuring cleanup.
- Denali: use core helper to find the best ECC configuration.
- Possible build of almost all drivers by adding a dependency on
COMPILE_TEST for almost all of them in Kconfig, implies various
fixes, Kconfig cleanup, GPIO headers inclusion cleanup, and even
changes in sparc64 and ia64 architectures.
- Clean the ->probe() functions error path of a lot of drivers.
- Migrate all drivers to use nand_scan() instead of
nand_scan_ident()/nand_scan_tail() pair.
- Use mtd_device_register() where applicable to simplify the code.
- Marvell:
* Handle on-die ECC.
* Better clocks handling.
* Remove bogus comment.
* Add suspend and resume support.
- Tegra: add NAND controller driver.
- Atmel:
* Add module param to avoid using dma.
* Drop Wenyou Yang from MAINTAINERS.
- Denali: optimize timings handling.
- FSMC: Stop using chip->read_buf().
- FSL:
* Switch to SPDX license tag identifiers.
* Fix qualifiers in MXC init functions.
Raw NAND chip drivers changes:
- Micron:
* Add fixup for ONFI revision.
* Update ecc_stats.corrected.
* Make ECC activation stateful.
* Avoid enabling/disabling ECC when it can't be disabled.
* Get the actual number of bitflips.
* Allow forced on-die ECC.
* Support 8/512 on-die ECC.
* Fix on-die ECC detection logic.
- Hynix:
* Fix decoding the OOB size on H27UCG8T2BTR.
* Use ->exec_op() in hynix_nand_reg_write_op().
-----BEGIN PGP SIGNATURE-----
iQEcBAABCAAGBQJbYYGVAAoJECVq6hhHvVaE0poIAJy+VpZl0jTPQ/oO8TQui9hE
IZbc8LwohCvegYYhiY1cNESyMYamDfoK6M93i/0zTJF2AJAxPl25ldT8N5Wr16DO
5Vfsdjv75V8l0JEY2SvWYmC6glOAYs0UEDdcFNJRMPqUnQz+VvBIafJOCQqzo4ZH
SDnLx3XzOxO4PAPnztWEg50WvaqMPt7ThcqoxThHMcQaLrNjgJUsV0mN+vNEv16Q
6gH6hl1C019k+Kj2Zu0vAifHw1K7gIYT4HvqKwstQ6HYUX2IzIzuEpRIcIze0S5z
XKzZ57USItb3l+Y3YwFBLjgP4N+VTT5X59LxdtCOXJ+YvzgxwtKElRvalNcryYI=
=zkEf
-----END PGP SIGNATURE-----
Merge tag 'nand/for-4.19' of git://git.infradead.org/linux-mtd into mtd/next
Pull NAND updates from Miquel Raynal:
"
NAND core changes:
- Add the SPI-NAND framework.
- Create a helper to find the best ECC configuration.
- Create NAND controller operations.
- Allocate dynamically ONFI parameters structure.
- Add defines for ONFI version bits.
- Add manufacturer fixup for ONFI parameter page.
- Add an option to specify NAND chip as a boot device.
- Add Reed-Solomon error correction algorithm.
- Better name for the controller structure.
- Remove unused caller_is_module() definition.
- Make subop helpers return unsigned values.
- Expose _notsupp() helpers for raw page accessors.
- Add default values for dynamic timings.
- Kill the chip->scan_bbt() hook.
- Rename nand_default_bbt() into nand_create_bbt().
- Start to clean the nand_chip structure.
- Remove stale prototype from rawnand.h.
Raw NAND controllers drivers changes:
- Qcom: structuring cleanup.
- Denali: use core helper to find the best ECC configuration.
- Possible build of almost all drivers by adding a dependency on
COMPILE_TEST for almost all of them in Kconfig, implies various
fixes, Kconfig cleanup, GPIO headers inclusion cleanup, and even
changes in sparc64 and ia64 architectures.
- Clean the ->probe() functions error path of a lot of drivers.
- Migrate all drivers to use nand_scan() instead of
nand_scan_ident()/nand_scan_tail() pair.
- Use mtd_device_register() where applicable to simplify the code.
- Marvell:
* Handle on-die ECC.
* Better clocks handling.
* Remove bogus comment.
* Add suspend and resume support.
- Tegra: add NAND controller driver.
- Atmel:
* Add module param to avoid using dma.
* Drop Wenyou Yang from MAINTAINERS.
- Denali: optimize timings handling.
- FSMC: Stop using chip->read_buf().
- FSL:
* Switch to SPDX license tag identifiers.
* Fix qualifiers in MXC init functions.
Raw NAND chip drivers changes:
- Micron:
* Add fixup for ONFI revision.
* Update ecc_stats.corrected.
* Make ECC activation stateful.
* Avoid enabling/disabling ECC when it can't be disabled.
* Get the actual number of bitflips.
* Allow forced on-die ECC.
* Support 8/512 on-die ECC.
* Fix on-die ECC detection logic.
- Hynix:
* Fix decoding the OOB size on H27UCG8T2BTR.
* Use ->exec_op() in hynix_nand_reg_write_op().
"
We really, really don't want to be encouraging people to use
cifs (the dialect) since it is insecure, so to avoid confusion
we want to move them to names which include 'smb3' instead of
'cifs' - so this simply creates an alias for the pseudo-xattrs
e.g. can now do:
getfattr -n user.smb3.creationtime /mnt1/file
and
getfattr -n user.smb3.dosattrib /mnt1/file
and
getfattr -n system.smb3_acl /mnt1/file
instead of forcing you to use the string 'cifs' in
these (e.g. getfattr -n system.cifs_acl /mnt1/file)
Signed-off-by: Steve French <stfrench@microsoft.com>
Let's reset i_gc_failures to zero when we unset pinned state for file.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
f2fs recovery flow is relying on dnode block link list, it means fsynced
file recovery depends on previous dnode's persistence in the list, so
during fsync() we should wait on all regular inode's dnode writebacked
before issuing flush.
By this way, we can avoid dnode block list being broken by out-of-order
IO submission due to IO scheduler or driver.
Sheng Yong helps to do the test with this patch:
Target:/data (f2fs, -)
64MB / 32768KB / 4KB / 8
1 / PERSIST / Index
Base:
SEQ-RD(MB/s) SEQ-WR(MB/s) RND-RD(IOPS) RND-WR(IOPS) Insert(TPS) Update(TPS) Delete(TPS)
1 867.82 204.15 41440.03 41370.54 680.8 1025.94 1031.08
2 871.87 205.87 41370.3 40275.2 791.14 1065.84 1101.7
3 866.52 205.69 41795.67 40596.16 694.69 1037.16 1031.48
Avg 868.7366667 205.2366667 41535.33333 40747.3 722.21 1042.98 1054.753333
After:
SEQ-RD(MB/s) SEQ-WR(MB/s) RND-RD(IOPS) RND-WR(IOPS) Insert(TPS) Update(TPS) Delete(TPS)
1 798.81 202.5 41143 40613.87 602.71 838.08 913.83
2 805.79 206.47 40297.2 41291.46 604.44 840.75 924.27
3 814.83 206.17 41209.57 40453.62 602.85 834.66 927.91
Avg 806.4766667 205.0466667 40883.25667 40786.31667 603.3333333 837.83 922.0033333
Patched/Original:
0.928332713 0.999074239 0.984300676 1.000957528 0.835398753 0.803303994 0.874141189
It looks like atomic write will suffer performance regression.
I suspect that the criminal is that we forcing to wait all dnode being in
storage cache before we issue PREFLUSH+FUA.
BTW, will commit ("f2fs: don't need to wait for node writes for atomic write")
cause the problem: we will lose data of last transaction after SPO, even if
atomic write return no error:
- atomic_open();
- write() P1, P2, P3;
- atomic_commit();
- writeback data: P1, P2, P3;
- writeback node: N1, N2, N3; <--- If N1, N2 is not writebacked, N3 with fsync_mark is
writebacked, In SPOR, we won't find N3 since node chain is broken, turns out that losing
last transaction.
- preflush + fua;
- power-cut
If we don't wait dnode writeback for atomic_write:
SEQ-RD(MB/s) SEQ-WR(MB/s) RND-RD(IOPS) RND-WR(IOPS) Insert(TPS) Update(TPS) Delete(TPS)
1 779.91 206.03 41621.5 40333.16 716.9 1038.21 1034.85
2 848.51 204.35 40082.44 39486.17 791.83 1119.96 1083.77
3 772.12 206.27 41335.25 41599.65 723.29 1055.07 971.92
Avg 800.18 205.55 41013.06333 40472.99333 744.0066667 1071.08 1030.18
Patched/Original:
0.92108464 1.001526693 0.987425886 0.993268102 1.030180511 1.026942031 0.976702294
SQLite's performance recovers.
Jaegeuk:
"Practically, I don't see db corruption becase of this. We can excuse to lose
the last transaction."
Finally, we decide to keep original implementation of atomic write interface
sematics that we don't wait all dnode writeback before preflush+fua submission.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Return statements in functions returning bool should use true or false
instead of an integer value.
This issue was detected with the help of Coccinelle.
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
There is a subtle race condition to invoke f2fs_bug_on() in shutdown tests. I've
confirmed that the last checkpoint is preserved in consistent state, so it'd be
fine to just return error at this moment.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
PG_checked flag will be set on data page during GC, later, we can
recognize such page by the flag and migrate page to cold segment.
But previously, we don't clear this flag when invalidating data page,
after page redirtying, we will write it into wrong log.
Let's clear PG_checked flag in set_page_dirty() to avoid this.
Signed-off-by: Weichao Guo <guoweichao@huawei.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Rebuild the AGI header items with some help from the rmapbt.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
mounting with the "snapshots=" mount parm allows a read-only
view of a previous version of a file system (see MS-SMB2
and "timewarp" tokens, section 2.2.13.2.6) based on the timestamp
passed in on the snapshots mount parm.
Add processing to optionally send this create context.
Example output:
/mnt1 is mounted with "snapshots=..." and will see an earlier
version of the directory, with three fewer files than /mnt2
the current version of the directory.
root@Ubuntu-17-Virtual-Machine:~/cifs-2.6# cat /proc/mounts | grep cifs
//172.22.149.186/public /mnt1 cifs
ro,relatime,vers=default,cache=strict,username=smfrench,uid=0,noforceuid,gid=0,noforcegid,addr=172.22.149.186,file_mode=0755,dir_mode=0755,soft,nounix,mapposix,rsize=1048576,wsize=1048576,echo_interval=60,snapshot=131748608570000000,actimeo=1
//172.22.149.186/public /mnt2 cifs
rw,relatime,vers=default,cache=strict,username=smfrench,uid=0,noforceuid,gid=0,noforcegid,addr=172.22.149.186,file_mode=0755,dir_mode=0755,soft,nounix,mapposix,rsize=1048576,wsize=1048576,echo_interval=60,actimeo=1
root@Ubuntu-17-Virtual-Machine:~/cifs-2.6# ls /mnt1
EmptyDir newerdir
root@Ubuntu-17-Virtual-Machine:~/cifs-2.6# ls /mnt1/newerdir
root@Ubuntu-17-Virtual-Machine:~/cifs-2.6# ls /mnt2
EmptyDir file newerdir newestdir timestamp-trace.cap
root@Ubuntu-17-Virtual-Machine:~/cifs-2.6# ls /mnt2/newerdir
new-file-not-in-snapshot
Snapshots are extremely useful for comparing previous versions of files or directories,
and recovering from data corruptions or mistakes.
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Reported-by: Xiaoli Feng <xifeng@redhat.com>
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
We were missing the methods for get_acl and friends for the 3.11
dialect.
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
CC: Stable <stable@vger.kernel.org>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
When enumerating snapshots, the last few bytes of the final
snapshot could be left off since we were miscalculating the
length returned (leaving off the sizeof struct SRV_SNAPSHOT_ARRAY)
See MS-SMB2 section 2.2.32.2. In addition fixup the length used
to allow smaller buffer to be passed in, in order to allow
returning the size of the whole snapshot array more easily.
Sample userspace output with a kernel patched with this
(mounted to a Windows volume with two snapshots).
Before this patch, the second snapshot would be missing a
few bytes at the end.
~/cifs-2.6# ~/enum-snapshots /mnt/file
press enter to issue the ioctl to retrieve snapshot information ...
size of snapshot array = 102
Num snapshots: 2 Num returned: 2 Array Size: 102
Snapshot 0:@GMT-2018.06.30-19.34.17
Snapshot 1:@GMT-2018.06.30-19.33.37
CC: Stable <stable@vger.kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Change smb2_queryfs() to use a Create/QueryInfo/Close compound request.
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Paulo Alcantara <palcantara@suse.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Paulo Alcantara <palcantara@suse.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
RCU pathwalk relies upon the assumption that anything that changes
->d_inode of a dentry will invalidate its ->d_seq. That's almost
true - the one exception is that the final dput() of already unhashed
dentry does *not* touch ->d_seq at all. Unhashing does, though,
so for anything we'd found by RCU dcache lookup we are fine.
Unfortunately, we can *start* with an unhashed dentry or jump into
it.
We could try and be careful in the (few) places where that could
happen. Or we could just make the final dput() invalidate the damn
thing, unhashed or not. The latter is much simpler and easier to
backport, so let's do it that way.
Reported-by: "Dae R. Jeong" <threeearcat@gmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
__legitimize_mnt() has two problems - one is that in case of success
the check of mount_lock is not ordered wrt preceding increment of
refcount, making it possible to have successful __legitimize_mnt()
on one CPU just before the otherwise final mntpu() on another,
with __legitimize_mnt() not seeing mntput() taking the lock and
mntput() not seeing the increment done by __legitimize_mnt().
Solved by a pair of barriers.
Another is that failure of __legitimize_mnt() on the second
read_seqretry() leaves us with reference that'll need to be
dropped by caller; however, if that races with final mntput()
we can end up with caller dropping rcu_read_lock() and doing
mntput() to release that reference - with the first mntput()
having freed the damn thing just as rcu_read_lock() had been
dropped. Solution: in "do mntput() yourself" failure case
grab mount_lock, check if MNT_DOOMED has been set by racing
final mntput() that has missed our increment and if it has -
undo the increment and treat that as "failure, caller doesn't
need to drop anything" case.
It's not easy to hit - the final mntput() has to come right
after the first read_seqretry() in __legitimize_mnt() *and*
manage to miss the increment done by __legitimize_mnt() before
the second read_seqretry() in there. The things that are almost
impossible to hit on bare hardware are not impossible on SMP
KVM, though...
Reported-by: Oleg Nesterov <oleg@redhat.com>
Fixes: 48a066e72d ("RCU'd vsfmounts")
Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
mntput_no_expire() does the calculation of total refcount under mount_lock;
unfortunately, the decrement (as well as all increments) are done outside
of it, leading to false positives in the "are we dropping the last reference"
test. Consider the following situation:
* mnt is a lazy-umounted mount, kept alive by two opened files. One
of those files gets closed. Total refcount of mnt is 2. On CPU 42
mntput(mnt) (called from __fput()) drops one reference, decrementing component
* After it has looked at component #0, the process on CPU 0 does
mntget(), incrementing component #0, gets preempted and gets to run again -
on CPU 69. There it does mntput(), which drops the reference (component #69)
and proceeds to spin on mount_lock.
* On CPU 42 our first mntput() finishes counting. It observes the
decrement of component #69, but not the increment of component #0. As the
result, the total it gets is not 1 as it should've been - it's 0. At which
point we decide that vfsmount needs to be killed and proceed to free it and
shut the filesystem down. However, there's still another opened file
on that filesystem, with reference to (now freed) vfsmount, etc. and we are
screwed.
It's not a wide race, but it can be reproduced with artificial slowdown of
the mnt_get_count() loop, and it should be easier to hit on SMP KVM setups.
Fix consists of moving the refcount decrement under mount_lock; the tricky
part is that we want (and can) keep the fast case (i.e. mount that still
has non-NULL ->mnt_ns) entirely out of mount_lock. All places that zero
mnt->mnt_ns are dropping some reference to mnt and they call synchronize_rcu()
before that mntput(). IOW, if mntput() observes (under rcu_read_lock())
a non-NULL ->mnt_ns, it is guaranteed that there is another reference yet to
be dropped.
Reported-by: Jann Horn <jannh@google.com>
Tested-by: Jann Horn <jannh@google.com>
Fixes: 48a066e72d ("RCU'd vsfmounts")
Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Return statements in functions returning bool should use true or false
instead of an integer value.
This issue was detected with the help of Coccinelle.
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
write_op[] is never modified, so make it 'const'.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
READ_BUF(8);
dummy = be32_to_cpup(p++);
dummy = be32_to_cpup(p++);
...
READ_BUF(4);
dummy = be32_to_cpup(p++);
Assigning value to "dummy" here, but that stored value
is overwritten before it can be used.
At the same time READ_BUF() will re-update the pointer p.
delete invalid assignment statements
Signed-off-by: nixiaoming <nixiaoming@huawei.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trondmy@hammerspace.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
I've given up on the idea of zero-copy handling of SYMLINK on the
server side. This is because the Linux VFS symlink API requires the
symlink pathname to be in a NUL-terminated kmalloc'd buffer. The
NUL-termination is going to be problematic (watching out for
landing on a page boundary and dealing with a 4096-byte pathname).
I don't believe that SYMLINK creation is on a performance path or is
requested frequently enough that it will cause noticeable CPU cache
pollution due to data copies.
There will be two places where a transport callout will be necessary
to fill in the rqstp: one will be in the svc_fill_symlink_pathname()
helper that is used by NFSv2 and NFSv3, and the other will be in
nfsd4_decode_create().
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
fill_in_write_vector() is nearly the same logic as
svc_fill_write_vector(), but there are a few differences so that
the former can handle multiple WRITE payloads in a single COMPOUND.
svc_fill_write_vector() can be adjusted so that it can be used in
the NFSv4 WRITE code path too. Instead of assuming the pages are
coming from rq_args.pages, have the caller pass in the page list.
The immediate benefit is a reduction of code duplication. It also
prevents the NFSv4 WRITE decoder from passing an empty vector
element when the transport has provided the payload in the xdr_buf's
page array.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
In preparation to enabling -Wimplicit-fallthrough, mark switch cases
where we are expecting to fall through.
Warning level 2 was used: -Wimplicit-fallthrough=2
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
nfsd and lockd call vfs_lock_file() to lock/unlock the inode
returned by locks_inode(file).
Many places in nfsd/lockd code use the inode returned by
file_inode(file) for lock manipulation. With Overlayfs, file_inode()
(the underlying inode) is not the same object as locks_inode() (the
overlay inode). This can result in "Leaked POSIX lock" messages
and eventually to a kernel crash as reported by Eddie Horng:
https://marc.info/?l=linux-unionfs&m=153086643202072&w=2
Fix all the call sites in nfsd/lockd that should use locks_inode().
This is a correctness bug that manifested when overlayfs gained
NFS export support in v4.16.
Reported-by: Eddie Horng <eddiehorng.tw@gmail.com>
Tested-by: Eddie Horng <eddiehorng.tw@gmail.com>
Cc: Jeff Layton <jlayton@kernel.org>
Fixes: 8383f17488 ("ovl: wire up NFS export operations")
Cc: stable@vger.kernel.org
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
A COPY with unstable write data needs a simple sync commit.
Filehandle value is gotten as a part of the inner loop so in
case of a reboot retry it should get the new value.
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
If client sent async COPY and server replied with
ERR_OFFLOAD_NO_REQS, client should retry with a synchronous copy.
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
When COPY is killed by the user send OFFLOAD_CANCEL to server
processing the copy.
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Make this function available to nfs42proc.c
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
It's possible that server replies back with CB_OFFLOAD call and
COPY reply at the same time such that client will process
CB_OFFLOAD before reply to COPY. For that keep a list of pending
callback stateids received and then before waiting on completion
check the pending list.
Cleanup any pending copies on the client shutdown.
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Change xdr to always send COPY asynchronously.
Keep the list copies send in a list under a server structure.
Once copy is sent, it waits on a completion structure that will
be signalled by the callback thread that receives CB_OFFLOAD.
If CB_OFFLOAD returned an error and even if it returned partial
bytes, ignore them (as we can't commit without a verifier to
match) and return an error.
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
If server returns async reply, it must include a callback stateid,
wr_callback_id in the write_response4.
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Paulo Alcantara <palcantara@suse.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Reviewed-by: Paulo Alcantara <palcantara@suse.com>
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Reviewed-by: Paulo Alcantara <palcantara@suse.com>
When a direct-write completes, a work_struct is schedule to handle
the completion.
When NFS is being used for swap, the direct write might be a swap-out,
so memory allocation can block until the write completes.
The work queue currently used is not WQ_MEM_RECLAIM, so tasks
can block waiting for memory - this leads to deadlock.
So use nfsiod_workqueue instead. This will always have a running
thread, and work items should never block waiting for memory.
Signed-off-by: Neil Brown <neilb@suse.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Error code is set in the error handling cases but never used. Fix it.
Fixes: 937e3133cd ("NFSv4.1: Ensure we clear the SP4_MACH_CRED flags in nfs4_sp4_select_mode()")
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Return statements in functions returning bool should use true or false
instead of an integer value.
This issue was detected with the help of Coccinelle.
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Layout segment validity is determined only by the NFS_LSEG_VALID flag. If
it is set, the layout segment is finable. As it is, when the flexfiles
driver sets NFS_LSEG_LAYOUTRETURN to indicate that we cannot discard
the layout segment, but that it must be returned, then this can result
in an unnecessary layoutget storm.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
In preparation to enabling -Wimplicit-fallthrough, mark switch cases
where we are expecting to fall through.
Warning level 2 was used: -Wimplicit-fallthrough=2
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
When we update the change attribute, we should also clear the flag that
says it is out of date.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
If the object being renamed from one directory to another is also
a directory, then 'nlink' will change for both directories.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Ensure that we always bump or drop the nlink count on the parent directory
when we do a mkdir or a rmdir(). This needs to be done by hand as we don't
have pre/post op attributes.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
If the server tells us that out layoutreturn raced with another layout
update, then we must ensure that the new layout segments are not in use
before we resend with an updated layout stateid.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
The spinlock handling in this file has changed significantly since this
comment was written, and the file_lock_lock is no more. In addition,
this overall comment no longer applies. Deleting an entry now requires
both locks.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Function update_rgrp_lvb_unlinked used to do the same thing as
be32_add_cpu. This patch removes it in favor of using be32_add_cpu
directly.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Reviewed-by: Andrew Price <anprice@redhat.com>
When CONFIG_CIFS_STATS2 is enabled keep counters for slow
commands (ie server took longer than 1 second to respond)
by SMB2/SMB3 command code. This can help in diagnosing
whether performance problems are on server (instead of
client) and which commands are causing the problem.
Sample output (the new lines contain words "slow responses ...")
$ cat /proc/fs/cifs/Stats
Resources in use
CIFS Session: 1
Share (unique mount targets): 2
SMB Request/Response Buffer: 1 Pool size: 5
SMB Small Req/Resp Buffer: 1 Pool size: 30
Total Large 10 Small 490 Allocations
Operations (MIDs): 0
0 session 0 share reconnects
Total vfs operations: 67 maximum at one time: 2
4 slow responses from localhost for command 5
1 slow responses from localhost for command 6
1 slow responses from localhost for command 14
1 slow responses from localhost for command 16
1) \\localhost\test
SMBs: 243
Bytes read: 1024000 Bytes written: 104857600
TreeConnects: 1 total 0 failed
TreeDisconnects: 0 total 0 failed
Creates: 40 total 0 failed
Closes: 39 total 0 failed
...
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
server->secmech.sdeschmacsha256 is not properly initialized before
smb2_shash_allocate(), set shash after that call.
also fix typo in error message
Fixes: 8de8c4608f ("cifs: Fix validation of signed data in smb2")
Signed-off-by: Aurelien Aptel <aaptel@suse.com>
Reviewed-by: Paulo Alcantara <palcantara@suse.com>
Reported-by: Xiaoli Feng <xifeng@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
CC: Stable <stable@vger.kernel.org>
An earlier commit had a typo which prevented the
optimization from working:
commit 18dd8e1a65 ("Do not send SMB3 SET_INFO request if nothing is changing")
Thank you to Metze for noticing this. Also clear a
reserved field in the FILE_BASIC_INFO struct we send
that should be zero (all the other fields in that
struct were set or cleared explicitly already in
cifs_set_file_info).
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
CC: Stable <stable@vger.kernel.org> # 4.9.x+
Reported-by: Stefan Metzmacher <metze@samba.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
CONFIG_CIFS_STATS is now always enabled (to simplify the
code and since the STATS are important for some common
customer use cases and also debugging), but needed one
minor change so that STATS shows as enabled in the debug
output in /proc/fs/cifs/DebugData, otherwise it could
get confusing with STATS no longer showing up in the
"Features" list in /proc/fs/cifs/DebugData when basic
stats were in fact available.
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
If responses take longer than one second from the server,
we can optionally log them to dmesg in current cifs.ko code
(CONFIG_CIFS_STATS2 must be configured and a
/proc/fs/cifs/cifsFYI flag must be set), but can be more useful
to log these via ftrace (tracepoint is smb3_slow_rsp) which
is easier and more granular (still requires CONFIG_CIFS_STATS2
to be configured in the build though).
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
These are used for SMB3 encryption and compounded requests.
Update these functions and the other functions related to SMB3 encryption to
take an array of requests.
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
echo 0 > /proc/fs/cifs/Stats is supposed to reset the stats
but there were four (see example below) that were not reset
(bytes read and witten, total vfs ops and max ops
at one time).
...
0 session 0 share reconnects
Total vfs operations: 100 maximum at one time: 2
1) \\localhost\test
SMBs: 0
Bytes read: 502092 Bytes written: 31457286
TreeConnects: 0 total 0 failed
TreeDisconnects: 0 total 0 failed
...
This patch fixes cifs_stats_proc_write to properly reset
those four.
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
We were only displaying bytes_read and bytes_written in cifs
stats, fix smb3 stats to also display them. Sample output
with this patch:
cat /proc/fs/cifs/Stats:
CIFS Session: 1
Share (unique mount targets): 2
SMB Request/Response Buffer: 1 Pool size: 5
SMB Small Req/Resp Buffer: 1 Pool size: 30
Operations (MIDs): 0
0 session 0 share reconnects
Total vfs operations: 94 maximum at one time: 2
1) \\localhost\test
SMBs: 214
Bytes read: 502092 Bytes written: 31457286
TreeConnects: 1 total 0 failed
TreeDisconnects: 0 total 0 failed
Creates: 52 total 3 failed
Closes: 48 total 0 failed
Flushes: 0 total 0 failed
Reads: 17 total 0 failed
Writes: 31 total 0 failed
...
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
CONFIG_CIFS_STATS should always be enabled as Pavel recently
noted. Simple statistics are not a significant performance hit,
and removing the ifdef simplifies the code slightly.
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Cc: <stable@vger.kernel.org>
Add tracepoints for reconnecting an smb3 session
Example output (from trace-cmd) with the patch
(showing the session marked for reconnect, the stat failing, and then
the subsequent SMB3 commands after the server comes back up).
The "smb3_reconnect" event is the new one.
cifsd-25993 [000] .... 29635.368265: smb3_reconnect: server=localhost current_mid=0x1e
stat-26200 [001] .... 29638.516403: smb3_enter: cifs_revalidate_dentry_attr: xid=22
stat-26200 [001] .... 29648.723296: smb3_exit_err: cifs_revalidate_dentry_attr: xid=22 rc=-112
kworker/0:1-22830 [000] .... 29653.850947: smb3_cmd_done: sid=0x0 tid=0x0 cmd=0 mid=0
kworker/0:1-22830 [000] .... 29653.851191: smb3_cmd_err: sid=0x8ae4683c tid=0x0 cmd=1 mid=1 status=0xc0000016 rc=-5
kworker/0:1-22830 [000] .... 29653.855254: smb3_cmd_done: sid=0x8ae4683c tid=0x0 cmd=1 mid=2
kworker/0:1-22830 [000] .... 29653.855482: smb3_cmd_done: sid=0x8ae4683c tid=0x8084f30d cmd=3 mid=3
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
In debugging reconnection problems, want to be able to more easily
trace cases in which the server has marked the SMB3 session
expired or deleted (to distinguish from timeout cases).
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
These timers were a good idea but weren't used in current code,
and the idea was cifs specific. Future patch will add similar timers
for SMB2/SMB3, but no sense using memory for cifs timers that
aren't used in current code.
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Fixes problem pointed out by Pavel in discussions about commit
729c0c9dd5
Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
CC: Stable <stable@vger.kernel.org> # 3.18.x+
Remove counters from the per-tree connection /proc/fs/cifs/Stats
output that will always be zero (since they are not per-tcon ops)
ie SMB3 Negotiate, SessionSetup, Logoff, Echo, Cancel.
Also clarify "sent" to be "total" per-Pavel's suggestion
(since this "total" includes total for all operations that we try to
send whether or not succesffully sent). Sample output below:
Resources in use
CIFS Session: 1
Share (unique mount targets): 2
SMB Request/Response Buffer: 1 Pool size: 5
SMB Small Req/Resp Buffer: 1 Pool size: 30
Operations (MIDs): 0
1 session 2 share reconnects
Total vfs operations: 23 maximum at one time: 2
1) \\localhost\test
SMBs: 45
TreeConnects: 2 total 0 failed
TreeDisconnects: 0 total 0 failed
Creates: 13 total 2 failed
Closes: 9 total 0 failed
Flushes: 0 total 0 failed
Reads: 0 total 0 failed
Writes: 1 total 0 failed
Locks: 0 total 0 failed
IOCTLs: 3 total 1 failed
QueryDirectories: 4 total 2 failed
ChangeNotifies: 0 total 0 failed
QueryInfos: 10 total 0 failed
SetInfos: 3 total 0 failed
OplockBreaks: 0 sent 0 failed
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
For SMB2/SMB3 the number of requests sent was not displayed
in /proc/fs/cifs/Stats unless CONFIG_CIFS_STATS2 was
enabled (only number of failed requests displayed). As
with earlier dialects, we should be displaying these
counters if CONFIG_CIFS_STATS is enabled. They
are important for debugging.
e.g. when you cat /proc/fs/cifs/Stats (before the patch)
Resources in use
CIFS Session: 1
Share (unique mount targets): 2
SMB Request/Response Buffer: 1 Pool size: 5
SMB Small Req/Resp Buffer: 1 Pool size: 30
Operations (MIDs): 0
0 session 0 share reconnects
Total vfs operations: 690 maximum at one time: 2
1) \\localhost\test
SMBs: 975
Negotiates: 0 sent 0 failed
SessionSetups: 0 sent 0 failed
Logoffs: 0 sent 0 failed
TreeConnects: 0 sent 0 failed
TreeDisconnects: 0 sent 0 failed
Creates: 0 sent 2 failed
Closes: 0 sent 0 failed
Flushes: 0 sent 0 failed
Reads: 0 sent 0 failed
Writes: 0 sent 0 failed
Locks: 0 sent 0 failed
IOCTLs: 0 sent 1 failed
Cancels: 0 sent 0 failed
Echos: 0 sent 0 failed
QueryDirectories: 0 sent 63 failed
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
snapshot mounts were not marked as read-only and did not display the snapshot
time (in /proc/mounts) specified on mount
With this patch - note that can not write to the snapshot mount (see "ro" in
/proc/mounts line) and also the missing snapshot timewarp token time is
dumped. Sample line from /proc/mounts with the patch:
//127.0.0.1/scratch /mnt2 smb3 ro,relatime,vers=default,cache=strict,username=testuser,domain=,uid=0,noforceuid,gid=0,noforcegid,addr=127.0.0.1,file_mode=0755,dir_mode=0755,soft,nounix,serverino,mapposix,noperm,rsize=1048576,wsize=1048576,echo_interval=60,snapshot=1234567,actimeo=1 0 0
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Paulo Alcantara <palcantara@suse.de>
Some servers, like Samba, don't support the fsctl for
query_network_interface_info so don't log a noisy warning
message on mount for this by default unless the error is more serious.
Lower the error to an FYI level so it does not get logged by
default.
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
We really, really want to be encouraging use of secure dialects,
and SMB3.1.1 offers useful security features, and will soon
be the recommended dialect for many use cases. Simplify the code
by removing the CONFIG_CIFS_SMB311 ifdef so users don't disable
it in the build, and create compatibility and/or security issues
with modern servers - many of which have been supporting this
dialect for multiple years.
Also clarify some of the Kconfig text for cifs.ko about
SMB3.1.1 and current supported features in the module.
Signed-off-by: Steve French <stfrench@microsoft.com>
Acked-by: Aurelien Aptel <aaptel@suse.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
/proc/fs/cifs/DebugData displays the features (Kconfig options)
used to build cifs.ko but it was missing some, and needed comma
separator. These can be useful in debugging certain problems
so we know which optional features were enabled in the user's build.
Also clarify them, by making them more closely match the
corresponding CONFIG_CIFS_* parm.
Old format:
Features: dfs fscache posix spnego xattr acl
New format:
Features: DFS,FSCACHE,SMB_DIRECT,STATS,DEBUG2,ALLOW_INSECURE_LEGACY,CIFS_POSIX,UPCALL(SPNEGO),XATTR,ACL
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Reviewed-by: Paulo Alcantara <palcantara@suse.de>
CC: Stable <stable@vger.kernel.org>
Output now matches expected stat -f output for all fields
except for Namelen and ID which were addressed in a companion
patch (which retrieves them from existing SMB3 mechanisms
and works whether POSIX enabled or not)
Signed-off-by: Steve French <smfrench@gmail.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Fil in the correct namelen (typically 255 not 4096) in the
statfs response and also fill in a reasonably unique fsid
(in this case taken from the volume id, and the creation time
of the volume).
In the case of the POSIX statfs all fields are now filled in,
and in the case of non-POSIX mounts, all fields are filled
in which can be.
Signed-off-by: Steve French <stfrench@gmail.com>
CC: Stable <stable@vger.kernel.org>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Check if every data page is signed correctly in sigining helper.
Signed-off-by: Paulo Alcantara <palcantara@suse.de>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
also fixes error code in smb311_posix_mkdir() (where
the error assignment needs to go before the goto)
a typo that Dan Carpenter and Paulo and Gustavo
pointed out.
Signed-off-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Reviewed-by: Paulo Alcantara <palcantara@suse.de>
Signed-off-by: Steve French <stfrench@microsoft.com>
allow disabling cifs (SMB1 ie vers=1.0) and vers=2.0 in the
config for the build of cifs.ko if want to always prevent mounting
with these less secure dialects.
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Reviewed-by: Jeremy Allison <jra@samba.org>
If user specifies "posix" on an SMB3.11 mount, then fail the mount
if server does not return the POSIX negotiate context indicating
support for posix.
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
In the fscache, we just need the timestamps as cookies to check for
changes, so we don't really care about the overflow, but it's better
to stop using the deprecated timespec so we don't have to go through
explicit conversion functions.
To avoid comparing uninitialized padding values that are copied
while assigning the timespec values, this rearranges the members of
cifs_fscache_inode_auxdata to avoid padding, and assigns them
individually.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Paulo Alcantara <palcantara@suse.de>
Signed-off-by: Steve French <stfrench@microsoft.com>
In cifs, the timestamps are stored in memory in the cifs_fattr structure,
which uses the deprecated 'timespec' structure. Now that the VFS code
has moved on to 'timespec64', the next step is to change over the fattr
as well.
This also makes 32-bit and 64-bit systems behave the same way, and
no longer overflow the 32-bit time_t in year 2038.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Paulo Alcantara <palcantara@suse.de>
Signed-off-by: Steve French <stfrench@microsoft.com>
This is not really a runtime issue but Smatch complains that:
fs/cifs/smb2ops.c:1740 smb2_query_symlink()
error: uninitialized symbol 'resp_buftype'.
The warning is right that it can be uninitialized... Also "err_buf"
would be NULL at this point and we're not supposed to pass NULLs to
free_rsp_buf() or it might trigger some extra output if we turn on
debugging.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Paulo Alcantara <palcantara@suse.de>
Signed-off-by: Steve French <stfrench@microsoft.com>
Colin Ian King reports that commit 82ff27bc52 ("xfs: automatic dfops
buffer relogging") leaves around some dead error handling code in
xfs_dquot_disk_alloc(). This was discovered via Coverity scan.
Since the associated commit eliminates the act of joining a buffer
to a dfops, this intermediate error state is no longer possible and
the error handling code can be removed. Since the caller cancels the
transaction on error, which cancels the dfops, eliminate the
unnecessary xfs_defer_cancel() call and error handling labels.
Fixes: 82ff27bc52 ("xfs: automatic dfops buffer relogging")
Reported-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
This adds ordering of the updates and makes sure we always see the if_seq
update before the extent tree is modified.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Function gfs2_testbit is called in three places. Two of those places,
gfs2_alloc_extent and gfs2_unaligned_extlen, should be using the clone
bitmaps, not the "real" bitmaps. Function gfs2_unaligned_extlen is used
by the block reservations scheme to determine the length of an extent of
free blocks. Before this patch, it wasn't using the clone bitmap, which
means recently-freed blocks were treated as free blocks for the purposes
of an allocation.
This patch adds a new parameter to gfs2_testbit to indicate whether or
not the clone bitmaps should be used (if available).
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com>
Since mountpoint crossing can happen without leaving lazy mode,
root dentries do need the same protection against having their
memory freed without RCU delay as everything else in the tree.
It's partially hidden by RCU delay between detaching from the
mount tree and dropping the vfsmount reference, but the starting
point of pathwalk can be on an already detached mount, in which
case umount-caused RCU delay has already passed by the time the
lazy pathwalk grabs rcu_read_lock(). If the starting point
happens to be at the root of that vfsmount *and* that vfsmount
covers the entire filesystem, we get trouble.
Fixes: 48a066e72d ("RCU'd vsfmounts")
Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
When btrfs hits error after modifying fs_devices in
btrfs_init_new_device() (such as btrfs_add_dev_item() returns error), it
leaves everything as is, but frees allocated btrfs_device. As a result,
fs_devices->devices and fs_devices->alloc_list contain already freed
btrfs_device, leading to later use-after-free bug.
Error path also messes the things like ->num_devices. While they go back
to the original value by unscanning btrfs devices, it is safe to revert
them here.
Fixes: 79787eaab4 ("btrfs: replace many BUG_ONs with proper error handling")
Signed-off-by: Naohiro Aota <naota@elisp.net>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It's entirely possible that a crafted btrfs image contains overlapping
chunks.
Although we can't detect such problem by tree-checker, it's not a
catastrophic problem, current extent map can already detect such problem
and return -EEXIST.
We just only need to exit gracefully and fail the mount.
Reported-by: Xu Wen <wen.xu@gatech.edu>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=200409
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This patch will introduce chunk <-> dev extent mapping check, to protect
us against invalid dev extents or chunks.
Since chunk mapping is the fundamental infrastructure of btrfs, extra
check at mount time could prevent a lot of unexpected behavior (BUG_ON).
Reported-by: Xu Wen <wen.xu@gatech.edu>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=200403
Link: https://bugzilla.kernel.org/show_bug.cgi?id=200407
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Su Yue <suy.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If a crafted image has missing block group items, it could cause
unexpected behavior and breaks the assumption of 1:1 chunk<->block group
mapping.
Although we have the block group -> chunk mapping check, we still need
chunk -> block group mapping check.
This patch will do extra check to ensure each chunk has its
corresponding block group.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=199847
Reported-by: Xu Wen <wen.xu@gatech.edu>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Gu Jinxiang <gujx@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
A crafted btrfs image with incorrect chunk<->block group mapping will
trigger a lot of unexpected things as the mapping is essential.
Although the problem can be caught by block group item checker
added in "btrfs: tree-checker: Verify block_group_item", it's still not
sufficient. A sufficiently valid block group item can pass the check
added by the mentioned patch but could fail to match the existing chunk.
This patch will add extra block group -> chunk mapping check, to ensure
we have a completely matching (start, len, flags) chunk for each block
group at mount time.
Here we reuse the original helper find_first_block_group(), which is
already doing the basic bg -> chunk checks, adding further checks of the
start/len and type flags.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=199837
Reported-by: Xu Wen <wen.xu@gatech.edu>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Su Yue <suy.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When doing an incremental send, if we have a file in the parent snapshot
that has prealloc extents beyond EOF and in the send snapshot it got a
hole punch that partially covers the prealloc extents, the send stream,
when replayed by a receiver, can result in a file that has a size bigger
than it should and filled with zeroes past the correct EOF.
For example:
$ mkfs.btrfs -f /dev/sdb
$ mount /dev/sdb /mnt
$ xfs_io -f -c "falloc -k 0 4M" /mnt/foobar
$ xfs_io -c "pwrite -S 0xea 0 1M" /mnt/foobar
$ btrfs subvolume snapshot -r /mnt /mnt/snap1
$ btrfs send -f /tmp/1.send /mnt/snap1
$ xfs_io -c "fpunch 1M 2M" /mnt/foobar
$ btrfs subvolume snapshot -r /mnt /mnt/snap2
$ btrfs send -f /tmp/2.send -p /mnt/snap1 /mnt/snap2
$ stat --format %s /mnt/snap2/foobar
1048576
$ md5sum /mnt/snap2/foobar
d31659e82e87798acd4669a1e0a19d4f /mnt/snap2/foobar
$ umount /mnt
$ mkfs.btrfs -f /dev/sdc
$ mount /dev/sdc /mnt
$ btrfs receive -f /mnt/1.snap /mnt
$ btrfs receive -f /mnt/2.snap /mnt
$ stat --format %s /mnt/snap2/foobar
3145728
# --> should be 1Mb and not 3Mb (which was the end offset of hole
# punch operation)
$ md5sum /mnt/snap2/foobar
117baf295297c2a995f92da725b0b651 /mnt/snap2/foobar
# --> should be d31659e82e87798acd4669a1e0a19d4f as in the original fs
This issue actually happens only since commit ffa7c4296e ("Btrfs: send,
do not issue unnecessary truncate operations"), but before that commit we
were issuing a write operation full of zeroes (to "punch" a hole) which
was extending the file size beyond the correct value and then immediately
issue a truncate operation to the correct size and undoing the previous
write operation. Since the send protocol does not support fallocate, for
extent preallocation and hole punching, fix this by not even attempting
to send a "hole" (regular write full of zeroes) if it starts at an offset
greater then or equals to the file's size. This approach, besides being
much more simple then making send issue the truncate operation, adds the
benefit of avoiding the useless pair of write of zeroes and truncate
operations, saving time and IO at the receiver and reducing the size of
the send stream.
A test case for fstests follows soon.
Fixes: ffa7c4296e ("Btrfs: send, do not issue unnecessary truncate operations")
CC: stable@vger.kernel.org # 4.17+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Don't open-code iget_failed(), don't bother with btrfs_free_path(NULL),
move handling of positive return values of btrfs_lookup_inode() from
btrfs_read_locked_inode() to btrfs_iget() and kill now obviously
pointless ASSERT() in there.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We don't need to check is_bad_inode() after the call of
btrfs_read_locked_inode() - it's exactly the same as checking return
value for being non-zero.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Reviewed-by: David Sterba <dsterba@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
IS_ERR(p) && PTR_ERR(p) == n is a weird way to spell p == ERR_PTR(n).
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Reviewed-by: David Sterba <dsterba@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
Just get rid of pointless checks.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
on-disk devs stats value is updated in btrfs_run_dev_stats(),
which is called during commit transaction, if device->dev_stats_ccnt
is not zero.
Since current replace operation does not touch dev_stats_ccnt,
on-disk dev stats value is not updated. Therefore "btrfs device stats"
may return old device's value after umount/mount
(Example: See "btrfs ins dump-t -t DEV $DEV" after btrfs/100 finish).
Fix this by just incrementing dev_stats_ccnt in
btrfs_dev_replace_finishing() when replace is succeeded and this will
update the values.
Signed-off-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There is no user of this function anymore.
This was forgotten to be removed in commit a575ceeb13
("Btrfs: get rid of unused orphan infrastructure").
Signed-off-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Use ERR_CAST() instead of void * to make meaning clear.
Signed-off-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Although it is safe to call this on already released paths with no locks
held or extent buffers, removing the redundant btrfs_release_path is
reasonable.
Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
All callers pass the root tree of dir, we can push that down to the
function itself.
Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It can be referenced from the passed transaction handle.
Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It can be referenced from the passed transaction handle.
Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It can be referenced from the passed transaction handle.
Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It can be referenced from the passed transaction handle.
Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It can be referenced from the passed transaction handle.
Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Leftover after fix e339a6b097 ("Btrfs: __btrfs_mod_ref should always
use no_quota"), that removed it from the function calls but not the
structure.
Signed-off-by: David Sterba <dsterba@suse.com>
The more common use case of send involves creating a RO snapshot and then
use it for a send operation. In this case it's not possible to have inodes
in the snapshot that have a link count of zero (inode with an orphan item)
since during snapshot creation we do the orphan cleanup. However, other
less common use cases for send can end up seeing inodes with a link count
of zero and in this case the send operation fails with a ENOENT error
because any attempt to generate a path for the inode, with the purpose
of creating it or updating it at the receiver, fails since there are no
inode reference items. One use case it to use a regular subvolume for
a send operation after turning it to RO mode or turning a RW snapshot
into RO mode and then using it for a send operation. In both cases, if a
file gets all its hard links deleted while there is an open file
descriptor before turning the subvolume/snapshot into RO mode, the send
operation will encounter an inode with a link count of zero and then
fail with errno ENOENT.
Example using a full send with a subvolume:
$ mkfs.btrfs -f /dev/sdb
$ mount /dev/sdb /mnt
$ btrfs subvolume create /mnt/sv1
$ touch /mnt/sv1/foo
$ touch /mnt/sv1/bar
# keep an open file descriptor on file bar
$ exec 73</mnt/sv1/bar
$ unlink /mnt/sv1/bar
# Turn the subvolume to RO mode and use it for a full send, while
# holding the open file descriptor.
$ btrfs property set /mnt/sv1 ro true
$ btrfs send -f /tmp/full.send /mnt/sv1
At subvol /mnt/sv1
ERROR: send ioctl failed with -2: No such file or directory
Example using an incremental send with snapshots:
$ mkfs.btrfs -f /dev/sdb
$ mount /dev/sdb /mnt
$ btrfs subvolume create /mnt/sv1
$ touch /mnt/sv1/foo
$ touch /mnt/sv1/bar
$ btrfs subvolume snapshot -r /mnt/sv1 /mnt/snap1
$ echo "hello world" >> /mnt/sv1/bar
$ btrfs subvolume snapshot -r /mnt/sv1 /mnt/snap2
# Turn the second snapshot to RW mode and delete file foo while
# holding an open file descriptor on it.
$ btrfs property set /mnt/snap2 ro false
$ exec 73</mnt/snap2/foo
$ unlink /mnt/snap2/foo
# Set the second snapshot back to RO mode and do an incremental send.
$ btrfs property set /mnt/snap2 ro true
$ btrfs send -f /tmp/inc.send -p /mnt/snap1 /mnt/snap2
At subvol /mnt/snap2
ERROR: send ioctl failed with -2: No such file or directory
So fix this by ignoring inodes with a link count of zero if we are either
doing a full send or if they do not exist in the parent snapshot (they
are new in the send snapshot), and unlink all paths found in the parent
snapshot when doing an incremental send (and ignoring all other inode
items, such as xattrs and extents).
A test case for fstests follows soon.
CC: stable@vger.kernel.org # 4.4+
Reported-by: Martin Wilck <martin.wilck@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If we end up with logging an inode reference item which has the same name
but different index from the one we have persisted, we end up failing when
replaying the log with an errno value of -EEXIST. The error comes from
btrfs_add_link(), which is called from add_inode_ref(), when we are
replaying an inode reference item.
Example scenario where this happens:
$ mkfs.btrfs -f /dev/sdb
$ mount /dev/sdb /mnt
$ touch /mnt/foo
$ ln /mnt/foo /mnt/bar
$ sync
# Rename the first hard link (foo) to a new name and rename the second
# hard link (bar) to the old name of the first hard link (foo).
$ mv /mnt/foo /mnt/qwerty
$ mv /mnt/bar /mnt/foo
# Create a new file, in the same parent directory, with the old name of
# the second hard link (bar) and fsync this new file.
# We do this instead of calling fsync on foo/qwerty because if we did
# that the fsync resulted in a full transaction commit, not triggering
# the problem.
$ touch /mnt/bar
$ xfs_io -c "fsync" /mnt/bar
<power fail>
$ mount /dev/sdb /mnt
mount: mount /dev/sdb on /mnt failed: File exists
So fix this by checking if a conflicting inode reference exists (same
name, same parent but different index), removing it (and the associated
dir index entries from the parent inode) if it exists, before attempting
to add the new reference.
A test case for fstests follows soon.
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If we're trying to make a data reservation and we have to allocate a
data chunk we could leak ret == 1, as do_chunk_alloc() will return 1 if
it allocated a chunk. Since the end of the function is the success path
just return 0.
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The exported helper just calls the static one. There's no obvious reason
to have them separate eg. for performance reasons where the static one
could be better optimized in the same unit. There's a slight decrease in
code size and stack consumption.
Signed-off-by: David Sterba <dsterba@suse.com>
Lock owner and nesting level have been unused since day 1, probably
copy&pasted from the extent_buffer locking scheme without much thinking.
The locking of device replace is simpler and does not need any lock
nesting.
Signed-off-by: David Sterba <dsterba@suse.com>
Added in 58176a9604 ("Btrfs: Add per-root block accounting and sysfs
entries") in 2007, the roots had names exported in sysfs. The code
was commented out in 4df27c4d5c ("Btrfs: change how subvolumes
are organized") and cleaned by 182608c829 ("btrfs: remove old
unused commented out code").
Signed-off-by: David Sterba <dsterba@suse.com>
Requiring a read-write descriptor conflicts both ways with exec,
returning ETXTBSY whenever you try to defrag a program that's currently
being run, or causing intermittent exec failures on a live system being
defragged.
As defrag doesn't change the file's contents in any way, there's no
reason to consider it a rw operation. Thus, let's check only whether
the file could have been opened rw. Such access control is still needed
as currently defrag can use extra disk space, and might trigger bugs.
We return EINVAL when the request is invalid; here it's ok but merely
the user has insufficient privileges. Thus, the EPERM return value
reflects the error better -- as discussed in the identical case for
dedupe.
According to codesearch.debian.net, no userspace program distinguishes
these values beyond strerror().
Signed-off-by: Adam Borowski <kilobyte@angband.pl>
Reviewed-by: David Sterba <dsterba@suse.com>
[ fold the EPERM patch from Adam ]
Signed-off-by: David Sterba <dsterba@suse.com>
We recently ran into the following deadlock involving
btrfs_write_inode():
[ +0.005066] __schedule+0x38e/0x8c0
[ +0.007144] schedule+0x36/0x80
[ +0.006447] bit_wait+0x11/0x60
[ +0.006446] __wait_on_bit+0xbe/0x110
[ +0.007487] ? bit_wait_io+0x60/0x60
[ +0.007319] __inode_wait_for_writeback+0x96/0xc0
[ +0.009568] ? autoremove_wake_function+0x40/0x40
[ +0.009565] inode_wait_for_writeback+0x21/0x30
[ +0.009224] evict+0xb0/0x190
[ +0.006099] iput+0x1a8/0x210
[ +0.006103] btrfs_run_delayed_iputs+0x73/0xc0
[ +0.009047] btrfs_commit_transaction+0x799/0x8c0
[ +0.009567] btrfs_write_inode+0x81/0xb0
[ +0.008008] __writeback_single_inode+0x267/0x320
[ +0.009569] writeback_sb_inodes+0x25b/0x4e0
[ +0.008702] wb_writeback+0x102/0x2d0
[ +0.007487] wb_workfn+0xa4/0x310
[ +0.006794] ? wb_workfn+0xa4/0x310
[ +0.007143] process_one_work+0x150/0x410
[ +0.008179] worker_thread+0x6d/0x520
[ +0.007490] kthread+0x12c/0x160
[ +0.006620] ? put_pwq_unlocked+0x80/0x80
[ +0.008185] ? kthread_park+0xa0/0xa0
[ +0.007484] ? do_syscall_64+0x53/0x150
[ +0.007837] ret_from_fork+0x29/0x40
Writeback calls:
btrfs_write_inode
btrfs_commit_transaction
btrfs_run_delayed_iputs
If iput() is called on that same inode, evict() will wait for writeback
forever.
btrfs_write_inode() was originally added way back in 4730a4bc5b
("btrfs_dirty_inode") to support O_SYNC writes. However, ->write_inode()
hasn't been used for O_SYNC since 148f948ba8 ("vfs: Introduce new
helpers for syncing after writing to O_SYNC file or IS_SYNC inode"), so
btrfs_write_inode() is actually unnecessary (and leads to a bunch of
unnecessary commits). Get rid of it, which also gets rid of the
deadlock.
CC: stable@vger.kernel.org # 3.2+
Signed-off-by: Josef Bacik <jbacik@fb.com>
[Omar: new commit message]
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It can be referenced from the passed transaction handle.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It can be referenced from the passed transaction handle.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function is always passed a well-formed tgtdevice so the fs_info
can be referenced from there.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It can be referenced from the passed 'device' argument which is always
a well-formed device.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It can be referenced from the passed transaction handle.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It can be referenced from the passed srcdev argument.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It can be referenced form the passed transaction handle.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In find_free_extent() under checks: label, we have the following code:
search_start = ALIGN(offset, fs_info->stripesize);
/* move on to the next group */
if (search_start + num_bytes >
block_group->key.objectid + block_group->key.offset) {
btrfs_add_free_space(block_group, offset, num_bytes);
goto loop;
}
if (offset < search_start)
btrfs_add_free_space(block_group, offset,
search_start - offset);
BUG_ON(offset > search_start);
However ALIGN() is rounding up, thus @search_start >= @offset and that
BUG_ON() will never be triggered.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
At send.c:full_send_tree() we were setting the 'key' variable in the loop
while never using it later. We were also using two btrfs_key variables
to store the initial key for search and the key found in every iteration
of the loop. So remove this useless key assignment and use the same
btrfs_key variable to store the initial search key and the key found in
each iteration. This was introduced in the initial send commit but was
never used (commit 31db9f7c23 ("Btrfs: introduce BTRFS_IOC_SEND for
btrfs send/receive").
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The data and metadata callback implementation both use the same
function. We can remove the call indirection and intermediate helper
completely.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The data and metadata callback implementation both use the same
function. We can remove the call indirection completely.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
All implementations of the callback are trivial and do the same and
there's only one user. Merge everything together.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The end_io callbacks passed to btrfs_wq_submit_bio
(btrfs_submit_bio_done and btree_submit_bio_done) are effectively the
same code, there's no point to do the indirection. Export
btrfs_submit_bio_done and call it directly.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
After splitting the start and end hooks in a758781d4b ("btrfs:
separate types for submit_bio_start and submit_bio_done"), some of
the function arguments were dropped but not removed from the structure.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Introduced by c6100a4b4e ("Btrfs: replace tree->mapping with
tree->private_data") to be used in run_one_async_done where it got
unused after 736cd52e0c ("Btrfs: remove nr_async_submits and
async_submit_draining").
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Reported in https://bugzilla.kernel.org/show_bug.cgi?id=199839, with an
image that has an invalid chunk type but does not return an error.
Add chunk type check in btrfs_check_chunk_valid, to detect the wrong
type combinations.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=199839
Reported-by: Xu Wen <wen.xu@gatech.edu>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Gu Jinxiang <gujx@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
EXTENT_BUFFER_DUMMY is an awful name for this flag. Buffers which have
this flag set are not in any way dummy. Rather, they are private in the
sense that are not mapped and linked to the global buffer tree. This
flag has subtle implications to the way free_extent_buffer works for
example, as well as controls whether page->mapping->private_lock is held
during extent_buffer release. Pages for an unmapped buffer cannot be
under io, nor can they be written by a 3rd party so taking the lock is
unnecessary.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ EXTENT_BUFFER_UNMAPPED, update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
Remove stale comment since there is no longer an eb->eb_lock and
document the locking expectation with a lockdep_assert_held statement.
No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function used to release one page (and always the first one), but
not anymore since a50924e3a4 ("btrfs: drop constant param
from btrfs_release_extent_buffer_page"). Update the name and comment.
Signed-off-by: David Sterba <dsterba@suse.com>
The purpose of the function is to free all the pages comprising an
extent buffer. This can be achieved with a simple for loop rather than
the slightly more involved 'do {} while' construct. So rewrite the
loop using a 'for' construct. Additionally we can never have an
extent_buffer that has 0 pages so remove the check for index == 0. No
functional changes.
The reversed order used to have a meaning in the past where the first
page served as a blocking point for several callers. See eg
4f2de97ace ("Btrfs: set page->private to the eb").
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Commit eb14ab8ed2 ("Btrfs: fix page->private races") fixed a genuine
race between extent buffer initialisation and btree_releasepage.
Unfortunately as the code has evolved the comments weren't changed which
made them slightly wrong and they weren't very clear in the fist place.
Fix this by (hopefully) rewording them in a more approachable manner.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Current version of the page unlocking code was added in
727011e07c ("Btrfs: allow metadata blocks larger than the page size")
but even in this commit that particular flag was never used per-se. In
fact, btrfs only uses PageChecked for data pages to identify pages
which have been dirtied but don't have ORDERED bit set. For more
information see 247e743cbe ("Btrfs: Use async helpers to deal with
pages that have been improperly dirtied").
However, this doesn't apply to extent buffer pages. The important bit
here is that the pages are unlocked AFTER the extent buffer has been
properly recorded in the radix tree to avoid races with
btree_releasepage. Let's exploit this fact and simplify the page
unlocking sequence by unlocking the pages in-order and removing the
redundant PageChecked flag setting/clearing.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Remove the remaining code that misused the page cache pages during
device replace and could cause data corruption for compressed nodatasum
extents. Such files do not normally exist but there's a bug that allows
this combination and the corruption was exposed by device replace fixup
code.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There are many places that open code the duplicity factor of the block
group profiles, create a common helper. This can be easily extended for
more copies.
Signed-off-by: David Sterba <dsterba@suse.com>
We have assigned the %fs_info->fs_devices in %fs_devices as its not
modified just use it for the mutex_lock().
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The fs_info can be fetched from the transaction handle directly.
Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It can be fetched from the transaction handle. In addition, remove the
WARN_ON(trans == NULL) because it's not possible to hit this condition.
Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Rename btrfs_parse_early_options() to btrfs_parse_device_options(). As
btrfs_parse_early_options() parses the -o device options and scan the
device provided. So this rename specifies its action. Also the function
name is in line with btrfs_parse_subvol_options().
No functional changes.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since commit 0b246afa62 ("btrfs: root->fs_info cleanup, add fs_info
convenience variables"), the srcroot is no longer used to get
fs_info::nodesize. In fact, it can be dropped after commit 707e8a0715
("btrfs: use nodesize everywhere, kill leafsize").
Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Introduce a small helper, btrfs_mark_bg_unused(), to acquire locks and
add a block group to unused_bgs list.
No functional modification, and only 3 callers are involved.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
do_chunk_alloc implements logic to detect whether there is currently
pending chunk allocation (by means of space_info->chunk_alloc being
set) and if so it loops around to the 'again' label. Additionally,
based on the state of the space_info (e.g. whether it's full or not)
and the return value of should_alloc_chunk() it decides whether this
is a "hard" error (ENOSPC) or we can just return 0.
This patch refactors all of this:
1. Put order to the scattered ifs handling the various cases in an
easy-to-read if {} else if{} branches. This makes clear the various
cases we are interested in handling.
2. Call should_alloc_chunk only once and use the result in the
if/else if constructs. All of this is done under space_info->lock, so
even before multiple calls of should_alloc_chunk were unnecessary.
3. Rewrite the "do {} while()" loop currently implemented via label
into an explicit loop construct.
4. Move the mutex locking for the case where the caller is the one doing
the allocation. For the case where the caller needs to wait a concurrent
allocation, introduce a pair of mutex_lock/mutex_unlock to act as a
barrier and reword the comment.
5. Switch local vars to bool type where pertinent.
All in all this shouldn't introduce any functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In commit b150a4f10d ("Btrfs: use a percpu to keep track of possibly
pinned bytes") we use total_bytes_pinned to track how many bytes we are
going to free in this transaction. When we are close to ENOSPC, we check it
and know if we can make the allocation by commit the current transaction.
For every data/metadata extent we are going to free, we add
total_bytes_pinned in btrfs_free_extent() and btrfs_free_tree_block(), and
release it in unpin_extent_range() when we finish the transaction. So this
is a variable we frequently update but rarely read - just the suitable
use of percpu_counter. But in previous commit we update total_bytes_pinned
by default 32 batch size, making every update essentially a spin lock
protected update. Since every spin lock/unlock operation involves syncing
a globally used variable and some kind of barrier in a SMP system, this is
more expensive than using total_bytes_pinned as a simple atomic64_t.
So fix this by using a customized batch size. Since we only read
total_bytes_pinned when we are close to ENOSPC and fail to allocate new
chunk, we can use a really large batch size and have nearly no penalty
in most cases.
[Test]
We tested the patch on a 4-cores x86 machine:
1. fallocate a 16GiB size test file
2. take snapshot (so all following writes will be COW)
3. run a 180 sec, 4 jobs, 4K random write fio on test file
We also added a temporary lockdep class on percpu_counter's spin lock
used by total_bytes_pinned to track it by lock_stat.
[Results]
unpatched:
lock_stat version 0.4
-----------------------------------------------------------------------
class name con-bounces contentions
waittime-min waittime-max waittime-total waittime-avg acq-bounces
acquisitions holdtime-min holdtime-max holdtime-total holdtime-avg
total_bytes_pinned_percpu: 82 82
0.21 0.61 29.46 0.36 298340
635973 0.09 11.01 173476.25 0.27
patched:
lock_stat version 0.4
-----------------------------------------------------------------------
class name con-bounces contentions
waittime-min waittime-max waittime-total waittime-avg acq-bounces
acquisitions holdtime-min holdtime-max holdtime-total holdtime-avg
total_bytes_pinned_percpu: 1 1
0.62 0.62 0.62 0.62 13601
31542 0.14 9.61 11016.90 0.35
[Analysis]
Since the spin lock only protects a single in-memory variable, the
contentions (number of lock acquisitions that had to wait) in both
unpatched and patched version are low. But when we see acquisitions and
acq-bounces, we get much lower counts in patched version. Here the most
important metric is acq-bounces. It means how many times the lock gets
transferred between different cpus, so the patch can really reduce
cacheline bouncing of spin lock (also the global counter of percpu_counter)
in a SMP system.
Fixes: b150a4f10d ("Btrfs: use a percpu to keep track of possibly pinned bytes")
Signed-off-by: Ethan Lien <ethanlien@synology.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We use customized, nodesize batch value to update dirty_metadata_bytes.
We should also use batch version of compare function or we will easily
goto fast path and get false result from percpu_counter_compare().
Fixes: e2d845211e ("Btrfs: use percpu counter for dirty metadata count")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Ethan Lien <ethanlien@synology.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Return device pointer (with the IS_ERR semantics) from
btrfs_scan_one_device so we don't have to return in through pointer.
And since btrfs_fs_devices can be obtained from btrfs_device, return that.
Signed-off-by: Gu Jinxiang <gujx@cn.fujitsu.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ fixed conflics after recent changes to btrfs_scan_one_device ]
Signed-off-by: David Sterba <dsterba@suse.com>
fs_devices is always passed to btrfs_scan_one_device which overrides it.
In the call stack below fs_devices is passed to btrfs_scan_one_device
from btrfs_mount_root. In btrfs_mount_root the output fs_devices of
this call stack is not used.
btrfs_mount_root
btrfs_parse_early_options
btrfs_scan_one_device
So, it is not necessary to pass fs_devices from btrfs_mount_root, using
a local variable in btrfs_parse_early_options is enough.
Signed-off-by: Gu Jinxiang <gujx@cn.fujitsu.com>
Reviewed-by: Anand Jain <Anand.Jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Technically this extends the critical section covered by uuid_mutex to:
- parse early mount options -- here we can call device scan on paths
that can be passed as 'device=/dev/...'
- scan the device passed to mount
- open the devices related to the fs_devices -- this increases
fs_devices::opened
The race can happen when mount calls one of the scans and there's
another one called eg. by mkfs or 'btrfs dev scan':
Mount Scan
----- ----
scan_one_device (dev1, fsid1)
scan_one_device (dev2, fsid1)
add the device
free stale devices
fsid1 fs_devices::opened == 0
find fsid1:dev1
free fsid1:dev1
if it's the last one,
free fs_devices of fsid1
too
open_devices (dev1, fsid1)
dev1 not found
When fixed, the uuid mutex will make sure that mount will increase
fs_devices::opened and this will not be touched by the racing scan
ioctl.
Reported-and-tested-by: syzbot+909a5177749d7990ffa4@syzkaller.appspotmail.com
Reported-and-tested-by: syzbot+ceb2606025ec1cc3479c@syzkaller.appspotmail.com
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In preparation to take a big lock, move resource initialization before
the critical section. It's not obvious from the diff, the desired order
is:
- initialize mount security options
- allocate temporary fs_info
- allocate superblock buffers
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Prepartory work to fix race between mount and device scan.
btrfs_parse_early_options calls the device scan from mount and we'll
need to let mount completely manage the critical section.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Prepartory work to fix race between mount and device scan.
The callers will have to manage the critical section, eg. mount wants to
scan and then call btrfs_open_devices without the ioctl scan walking in
and modifying the fs devices in the meantime.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Prepartory work to fix race between mount and device scan.
The callers will have to manage the critical section, eg. mount wants to
scan and then call btrfs_open_devices without the ioctl scan walking in
and modifying the fs devices in the meantime.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_free_stale_devices() finds a stale (not opened) device matching
path in the fs_uuid list. We are already under uuid_mutex so when we
check for each fs_devices, hold the device_list_mutex too.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Over the years we named %fs_devices and %devices to represent the
struct btrfs_fs_devices and the struct btrfs_device. So follow the same
scheme here too. No functional changes.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Make sure the device_list_lock is held the whole time:
* when the device is being looked up
* new device is initialized and put to the list
* the list counters are updated (fs_devices::opened, fs_devices::total_devices)
Signed-off-by: Anand Jain <anand.jain@oracle.com>
[ update changelog ]
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_free_stale_devices() looks for device path reused for another
filesystem, and deletes the older fs_devices::device entry.
In preparation to handle locking in device_list_add, move
btrfs_free_stale_devices outside as these two functions serve a
different purpose.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since commit 88c14590cd ("btrfs: use RCU in btrfs_show_devname for
device list traversal") btrfs_show_devname no longer takes
device_list_mutex. As such the deadlock that 0ccd05285e ("btrfs: fix a
possible umount deadlock") aimed to fix no longer exists, we can free
the devices immediatelly and remove the code that does the pending work.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
[ update changelog ]
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function is not used since the alloc_start parameter has been
obsoleted in commit 0d0c71b317 ("btrfs: obsolete and remove
mount option alloc_start").
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since parameter flags is no more used since commit d740760656 ("btrfs:
split parse_early_options() in two"), remove it.
Signed-off-by: Gu Jinxiang <gujx@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In case of deleting the seed device the %cur_devices (seed) and the
%fs_devices (parent) are different. Now, as the parent
fs_devices::total_devices also maintains the total number of devices
including the seed device, so decrement its in-memory value for the
successful seed delete. We are already updating its corresponding
on-disk btrfs_super_block::number_devices value.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Commit 5d23515be6 ("btrfs: Move qgroup rescan on quota enable to
btrfs_quota_enable") not only resulted in an easier to follow code but
it also introduced a subtle bug. It changed the timing when the initial
transaction rescan was happening:
- before the commit: it would happen after transaction commit had occured
- after the commit: it might happen before the transaction was committed
This results in failure to correctly rescan the quota since there could
be data which is still not committed on disk.
This patch aims to fix this by moving the transaction creation/commit
inside btrfs_quota_enable, which allows to schedule the quota commit
after the transaction has been committed.
Fixes: 5d23515be6 ("btrfs: Move qgroup rescan on quota enable to btrfs_quota_enable")
Reported-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com>
Link: https://marc.info/?l=linux-btrfs&m=152999289017582
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add fall-back code to catch failure of full_stripe_write. Proper error
handling from inside run_plug would need more code restructuring as it's
called at arbitrary points by io scheduler.
Signed-off-by: David Sterba <dsterba@suse.com>
Add helper that schedules a given function to run on the rmw workqueue.
This will replace several standalone helpers.
Signed-off-by: David Sterba <dsterba@suse.com>
The loops iterating eb pages use unsigned long, that's an overkill as
we know that there are at most 16 pages (64k / 4k), and 4 by default
(with nodesize 16k).
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Almost all callers pass the start and len as 2 arguments but this is not
necessary, all the information is provided by the eb. By reordering the
calls to num_extent_pages, we don't need the local variables with
start/len.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Functions that get btrfs inode can simply reach the fs_info by
dereferencing the root and this looks a bit more straightforward
compared to the btrfs_sb(...) indirection.
If the transaction handle is available and not NULL it's used instead.
Signed-off-by: David Sterba <dsterba@suse.com>
There are several places when the btrfs inode is converted to the
generic inode, back to btrfs and then passed to btrfs_ino. We can remove
the extra back and forth conversions.
Signed-off-by: David Sterba <dsterba@suse.com>
io_ctl_set_generation() assumes that the generation number shares
the same page with inline CRCs. Let's make sure this is always true.
Signed-off-by: Zhihui Zhang <zzhsuny@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There is only usage of the declared devices variable, instead use its
value directly.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There are many instances of the %fs_info->fs_devices pointer
dereferences, use a temporary variable instead.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Invalid reloc tree can cause kernel NULL pointer dereference when btrfs
does some cleanup of the reloc roots.
It turns out that fs_info::reloc_ctl can be NULL in
btrfs_recover_relocation() as we allocate relocation control after all
reloc roots have been verified.
So when we hit: note, we haven't called set_reloc_control() thus
fs_info::reloc_ctl is still NULL.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=199833
Reported-by: Xu Wen <wen.xu@gatech.edu>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Tested-by: Gu Jinxiang <gujx@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
A crafted image has empty root tree block, which will later cause NULL
pointer dereference.
The following trees should never be empty:
1) Tree root
Must contain at least root items for extent tree, device tree and fs
tree
2) Chunk tree
Or we can't even bootstrap as it contains the mapping.
3) Fs tree
At least inode item for top level inode (.).
4) Device tree
Dev extents for chunks
5) Extent tree
Must have corresponding extent for each chunk.
If any of them is empty, we are sure the fs is corrupted and no need to
mount it.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=199847
Reported-by: Xu Wen <wen.xu@gatech.edu>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Tested-by: Gu Jinxiang <gujx@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
A crafted image with invalid block group items could make free space cache
code to cause panic.
We could detect such invalid block group item by checking:
1) Item size
Known fixed value.
2) Block group size (key.offset)
We have an upper limit on block group item (10G)
3) Chunk objectid
Known fixed value.
4) Type
Only 4 valid type values, DATA, METADATA, SYSTEM and DATA|METADATA.
No more than 1 bit set for profile type.
5) Used space
No more than the block group size.
This should allow btrfs to detect and refuse to mount the crafted image.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=199849
Reported-by: Xu Wen <wen.xu@gatech.edu>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Gu Jinxiang <gujx@cn.fujitsu.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Tested-by: Gu Jinxiang <gujx@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The v0 extent type checks are the right case for the unlikely
annotations as we don't expect to ever see them, so let's give the
compiler some hint.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Following the removal of the v0 handling code let's be courteous and
print an error message when such extents are handled. In the cases
where we have a transaction just abort it, otherwise just call
btrfs_handle_fs_error. Both cases result in the FS being re-mounted RO.
In case the error handling would be too intrusive, leave the BUG_ON in
place, like extent_data_ref_count, other proper handling would catch
that earlier.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The v0 compat code was introduced in commit 5d4f98a28c
("Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)") 9
years ago, which was merged in 2.6.31. This means that the code is
there to support filesystems which are _VERY_ old and if you are using
btrfs on such an old kernel, you have much bigger problems. This coupled
with the fact that no one is likely testing/maintining this code likely
means it has bugs lurking. All things considered I think 43 kernel
releases later it's high time this remnant of the past got removed.
This patch removes all code wrapped in #ifdefs but leaves the BUG_ONs in case
we have a v0 with no support intact as a sort of safety-net.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It's only coding style fix not functinal change. When if/else has only
one statement then the braces are not needed.
Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It's not good to override the error code when failing from
btrfs_getxattr() in btrfs_get_acl() because it hides the real reason of
the failure.
Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There is no chance to get into -ERANGE error condition because we first
call btrfs_getxattr to get the length of the attribute, then we do a
subsequent call with the size from the first call. Between the 2 calls
the size shouldn't change. So remove the unnecessary -ERANGE error
check.
Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In btrfs_get_acl() the first call of btr_getxattr() is for getting the
length of attribute, the value buffer is never used in this case. So
it's better to replace empty string with NULL.
Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The caller of btrfs_get_acl() checks error condition so there is no
impact from this change. In practice there is no chance to get into
default case of switch statement because VFS has already checked the
type.
Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If type of extent_inline_ref found is not expected, filesystem may have
been corrupted, should return EUCLEAN instead of EINVAL.
Signed-off-by: Su Yue <suy.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
struct kiocb carries the ki_pos, so there is no need to pass it as
a separate function parameter.
generic_file_direct_write() increments ki_pos, so we now assign pos
after the function.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com>
[ rename to btrfs_buffered_write ]
Signed-off-by: David Sterba <dsterba@suse.com>
For easier debugging, print eb->start if level is invalid. Also make
clear if bytenr found is not expected.
Signed-off-by: Su Yue <suy.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently the function uses 2 goto labels to properly handle allocation
failures. This could be simplified by simply re-arranging the code so
that allocations are the in the beginning of the function. This allows
to use simple return statements. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Su Yue <suy.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
Under certain KVM load and LTP tests, it is possible to hit the
following calltrace if quota is enabled:
BTRFS critical (device vda2): unable to find logical 8820195328 length 4096
BTRFS critical (device vda2): unable to find logical 8820195328 length 4096
WARNING: CPU: 0 PID: 49 at ../block/blk-core.c:172 blk_status_to_errno+0x1a/0x30
CPU: 0 PID: 49 Comm: kworker/u2:1 Not tainted 4.12.14-15-default #1 SLE15 (unreleased)
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014
Workqueue: btrfs-endio-write btrfs_endio_write_helper [btrfs]
task: ffff9f827b340bc0 task.stack: ffffb4f8c0304000
RIP: 0010:blk_status_to_errno+0x1a/0x30
Call Trace:
submit_extent_page+0x191/0x270 [btrfs]
? btrfs_create_repair_bio+0x130/0x130 [btrfs]
__do_readpage+0x2d2/0x810 [btrfs]
? btrfs_create_repair_bio+0x130/0x130 [btrfs]
? run_one_async_done+0xc0/0xc0 [btrfs]
__extent_read_full_page+0xe7/0x100 [btrfs]
? run_one_async_done+0xc0/0xc0 [btrfs]
read_extent_buffer_pages+0x1ab/0x2d0 [btrfs]
? run_one_async_done+0xc0/0xc0 [btrfs]
btree_read_extent_buffer_pages+0x94/0xf0 [btrfs]
read_tree_block+0x31/0x60 [btrfs]
read_block_for_search.isra.35+0xf0/0x2e0 [btrfs]
btrfs_search_slot+0x46b/0xa00 [btrfs]
? kmem_cache_alloc+0x1a8/0x510
? btrfs_get_token_32+0x5b/0x120 [btrfs]
find_parent_nodes+0x11d/0xeb0 [btrfs]
? leaf_space_used+0xb8/0xd0 [btrfs]
? btrfs_leaf_free_space+0x49/0x90 [btrfs]
? btrfs_find_all_roots_safe+0x93/0x100 [btrfs]
btrfs_find_all_roots_safe+0x93/0x100 [btrfs]
btrfs_find_all_roots+0x45/0x60 [btrfs]
btrfs_qgroup_trace_extent_post+0x20/0x40 [btrfs]
btrfs_add_delayed_data_ref+0x1a3/0x1d0 [btrfs]
btrfs_alloc_reserved_file_extent+0x38/0x40 [btrfs]
insert_reserved_file_extent.constprop.71+0x289/0x2e0 [btrfs]
btrfs_finish_ordered_io+0x2f4/0x7f0 [btrfs]
? pick_next_task_fair+0x2cd/0x530
? __switch_to+0x92/0x4b0
btrfs_worker_helper+0x81/0x300 [btrfs]
process_one_work+0x1da/0x3f0
worker_thread+0x2b/0x3f0
? process_one_work+0x3f0/0x3f0
kthread+0x11a/0x130
? kthread_create_on_node+0x40/0x40
ret_from_fork+0x35/0x40
BTRFS critical (device vda2): unable to find logical 8820195328 length 16384
BTRFS: error (device vda2) in btrfs_finish_ordered_io:3023: errno=-5 IO failure
BTRFS info (device vda2): forced readonly
BTRFS error (device vda2): pending csums is 2887680
[CAUSE]
It's caused by race with block group auto removal:
- There is a meta block group X, which has only one tree block
The tree block belongs to fs tree 257.
- In current transaction, some operation modified fs tree 257
The tree block gets COWed, so the block group X is empty, and marked
as unused, queued to be deleted.
- Some workload (like fsync) wakes up cleaner_kthread()
Which will call btrfs_delete_unused_bgs() to remove unused block
groups.
So block group X along its chunk map get removed.
- Some delalloc work finished for fs tree 257
Quota needs to get the original reference of the extent, which will
read tree blocks of commit root of 257.
Then since the chunk map gets removed, the above warning gets
triggered.
[FIX]
Just let btrfs_delete_unused_bgs() skip block group which still has
pinned bytes.
However there is a minor side effect: currently we only queue empty
blocks at update_block_group(), and such empty block group with pinned
bytes won't go through update_block_group() again, such block group
won't be removed, until it gets new extent allocated and removed.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
With gcc 4.1.2:
fs/btrfs/inode-map.c: In function ‘btrfs_unpin_free_ino’:
fs/btrfs/inode-map.c:241: warning: ‘count’ may be used uninitialized in this function
While this warning is a false-positive, it can easily be killed by
refactoring the code.
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
While the regular inode timestamps all use timespec64 now, the i_otime
field is btrfs specific and still needs to be converted to correctly
represent times beyond 2038.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The transaction times were changed to ktime_get_real_seconds to avoid
the y2038 overflow, but they still have a minor problem when they go
backwards or jump due to settimeofday() or leap seconds.
This changes the transaction handling to instead use ktime_get_seconds(),
which returns a CLOCK_MONOTONIC timestamp that has neither of those
problems.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We used to call btrfs_file_extent_inline_len() to get the uncompressed
data size of an inlined extent.
However this function is hiding evil, for compressed extent, it has no
choice but to directly read out ram_bytes from btrfs_file_extent_item.
While for uncompressed extent, it uses item size to calculate the real
data size, and ignoring ram_bytes completely.
In fact, for corrupted ram_bytes, due to above behavior kernel
btrfs_print_leaf() can't even print correct ram_bytes to expose the bug.
Since we have the tree-checker to verify all EXTENT_DATA, such mismatch
can be detected pretty easily, thus we can trust ram_bytes without the
evil btrfs_file_extent_inline_len().
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When a new extent buffer is allocated there are a few mandatory fields
which need to be set in order for the buffer to be sane: level,
generation, bytenr, backref_rev, owner and FSID/UUID. Currently this
is open coded in the callers of btrfs_alloc_tree_block, meaning it's
fairly high in the abstraction hierarchy of operations. This patch
solves this by simply moving this init code in btrfs_init_new_buffer,
since this is the function which initializes a newly allocated
extent buffer. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Commit f8f84b2dfd ("btrfs: index check-integrity state hash by a dev_t")
changed how btrfsic indexes device state.
Now we need to access device->bdev->bd_dev, while for degraded mount
it's completely possible to have device->bdev as NULL, thus it will
trigger a NULL pointer dereference at mount time.
Fix it by checking if the device is degraded before accessing
device->bdev->bd_dev.
There are a lot of other places accessing device->bdev->bd_dev, however
the other call sites have either checked device->bdev, or the
device->bdev is passed from btrfsic_map_block(), so it won't cause harm.
Fixes: f8f84b2dfd ("btrfs: index check-integrity state hash by a dev_t")
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It can be referenced from the passed transaction handle.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It can be referenced from the passed bg cache.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It can be referenced from trans since the function is always called
within a valid transaction.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It can be referenced directly from the transaction handle since it's
always valid.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It can be referenced from the passed transaction handle, since it's
always valid.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It can be referenced from the passed transaction handle, since it's
always valid.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It can be referenced from the passed block group.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It can be referenced from the passed block group.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It can be referenced from trans since the function is always called
within a transaction.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It can be referenced from trans since the function is always called
within a transaction.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function is always called with a valid transaction handle from
where fs_info can be referenced. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It can always be referneced from the passed transaction handle since
it's always valid. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
fs_info can be refenreced from the transaction handle, since it's always
valid. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The argument is no longer used so remove it.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function is always called with a valid transaction handle so
fs_info can be referenced from there. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function is always called with a valid transaction from where
fs_info can be referenced. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function already takes a transaction which holds a reference to
the fs_info struct. Use that reference and remove the extra arg. No
functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
fs_info can be referenced from the transaction handle, which is always
valid. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function is always called with a valid transaction handle so we
can reference the fs_info from there. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function is always called with a valid transaction handle from
where we can reference fs_info. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function is always called with a valid transaction handle from
where we can reference the fs_info. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function is always called with a valid transaction handle from
where fs_info can be referenced. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function is always called with a valid transaction handle from
where fs_info can be referenced. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This argument is unused. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function is always called with a valid transaction handle from
where fs_info can be referenced. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function is always called with a valid transaction handle from
where the fs_info can be referenced. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function always uses the leaf's extent_buffer which already
contains a reference to the fs_info. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function is always called with a valid transaction handle from
where the fs_info can be referenced. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This argument is unused. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function is always called with a valid transaction from where the
fs_info can be referenced. No functional change.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function is always called with a valid transaction handle from
where fs_info can be referenced. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function is always called with a valid transaction handle from
where fs_info can be referenced. So remove the redundant argument.
No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function is always called with a valid transaction so there is no
need to duplicate the fs_info, we can reference it directly from the
trans handle. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The C programming language does not allow to use preprocessor statements
inside macro arguments (pr_info() is defined as a macro). Hence rework
the pr_info() statement in btrfs_print_mod_info() such that it becomes
compliant. This patch allows tools like sparse to analyze the BTRFS
source code.
Fixes: 62e855771d ("btrfs: convert printk(KERN_* to use pr_* calls")
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This patch avoids that the compiler complains that a fall-through
annotation is missing when building with W=1.
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This patch avoids that building the BTRFS source code with smatch
triggers complaints about inconsistent indenting.
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently this function takes the root as an argument only to get the
log_root from it. Simplify this by directly passing the log root from
the caller. Also eliminate the fs_info local variable, since it's used
only once, so directly reference it from the transaction handle.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The logic to check if the inode is already in the log can now be
simplified since we always wait for the ordered extents to complete
before deciding whether the inode needs to be logged. The big comment
about it can go away too.
CC: Filipe Manana <fdmanana@suse.com>
Suggested-by: Filipe Manana <fdmanana@suse.com>
[ code and changelog copied from mail discussion ]
Signed-off-by: David Sterba <dsterba@suse.com>
This is no longer used anywhere, remove all of it.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We no longer use this list we've passed around so remove it everywhere.
Also remove the extra checks for ordered/filemap errors as this is
handled higher up now that we're waiting on ordered_extents before
getting to the tree log code.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since we are waiting on all ordered extents at the start of the fsync()
path we don't need to wait on any logged ordered extents, and we don't
need to look up the checksums on the ordered extents as they will
already be on disk prior to getting here. Rework this so we're only
looking up and copying the on-disk checksums for the extent range we
care about.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There's a priority inversion that exists currently with btrfs fsync. In
some cases we will collect outstanding ordered extents onto a list and
only wait on them at the very last second. However this "very last
second" falls inside of a transaction handle, so if we are in a lower
priority cgroup we can end up holding the transaction open for longer
than needed, so if a high priority cgroup is also trying to fsync()
it'll see latency.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The comment wrongfully states that the owner parameter is the level of
the parent block. In fact owner is the level of the current block and
by adding 1 to it we can eventually get to the parent/root.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Here is a doc-only patch which tires to deobfuscate the terra-incognita
that arguments for delayed refs are.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since commit ac0b4145d6 ("btrfs: scrub: Don't use inode pages
for device replace") the function is not used and we can remove all
functions down the call chain.
There was an optimization that reused inode pages to speed up device
replace, but broke when there was nodatasum and compressed page. The
potential performance gain is small so we don't loose much by removing
it and using scrub_pages same as the other pages.
Signed-off-by: Qu Wenruo <wqu@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
The get_seconds() function is deprecated as it truncates the timestamp
to 32 bits. Change it to or ktime_get_real_seconds().
Signed-off-by: Allen Pais <allen.lkml@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
If we get a keyed wakeup for a aio poll waitqueue and wake can acquire the
ctx_lock without spinning we can just complete the iocb straight from the
wakeup callback to avoid a context switch.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Avi Kivity <avi@scylladb.com>
Simple one-shot poll through the io_submit() interface. To poll for
a file descriptor the application should submit an iocb of type
IOCB_CMD_POLL. It will poll the fd for the events specified in the
the first 32 bits of the aio_buf field of the iocb.
Unlike poll or epoll without EPOLLONESHOT this interface always works
in one shot mode, that is once the iocb is completed, it will have to be
resubmitted.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Avi Kivity <avi@scylladb.com>
This is needed to prevent races caused by the way the ->poll API works.
To avoid introducing overhead for other users of the iocbs we initialize
it to zero and only do refcount operations if it is non-zero in the
completion path.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Avi Kivity <avi@scylladb.com>
-----BEGIN PGP SIGNATURE-----
iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAltU8z0eHHRvcnZhbGRz
QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiG5X8H/2fJr7m3k242+t76
sitwvx1eoPqTgryW59dRKm9IuXAGA+AjauvHzaz1QxomeQa50JghGWefD0eiJfkA
1AphQ/24EOiAbbVk084dAI/C2p122dE4D5Fy7CrfLnuouyrbFaZI5STbnrRct7sR
9deeYW0GDHO1Uenp4WDCj0baaqJqaevZ+7GG09DnWpya2nQtSkGBjqn6GpYmrfOU
mqFuxAX8mEOW6cwK16y/vYtnVjuuMAiZ63/OJ8AQ6d6ArGLwAsdn7f8Fn4I4tEr2
L0d3CRLUyegms4++Dmlu05k64buQu46WlPhjCZc5/Ts4kjrNxBuHejj2/jeSnUSt
vJJlibI=
=42a5
-----END PGP SIGNATURE-----
Merge tag 'v4.18-rc6' into for-4.19/block2
Pull in 4.18-rc6 to get the NVMe core AEN change to avoid a
merge conflict down the line.
Signed-of-by: Jens Axboe <axboe@kernel.dk>
Return statements in functions returning bool should use true or false
instead of an integer value.
This issue was detected with the help of Coccinelle.
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
->lookup() methods can (and should) use d_splice_alias() instead of
d_add(). Even if they are not going to be hit by open_by_handle(),
code does get copied around; besides, d_splice_alias() has better
calling conventions for use in ->lookup(), so the code gets simpler.
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
->lookup() methods can (and should) use d_splice_alias() instead of
d_add(). Even if they are not going to be hit by open_by_handle(),
code does get copied around...
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
"overflow" inline inode data.
-----BEGIN PGP SIGNATURE-----
Comment: Kees Cook <kees@outflux.net>
iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAltlv70WHGtlZXNjb29r
QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJj3eD/9+goWbs9U62tlO2cIqT5lgTFX9
3vvVpqNJ3atw/fU6SZoa0Z2nRa9TIpZEfEljgARGhCyd2p2MplLJalWo6bq/gUUA
5aWQbVxhyVXUp8kh8m3OnZsAZz658Y5geLMk8vakXdyQ//PF43wO0cZyOFYdG3ec
sYuWA318UKaIsxqB9tT/K8YBRjjBgJ8wjgtpoSAr+FUhgg9Qqp3NL4fjr6SOEQrv
XWXLVLLhyUo8kQJ29E/VyzIfLysgiA67O4ClW+DEyD6rr/ZK9XOeG6yv3vwLGSHI
06/4BXMJce23iGIYf57Jz7b5tfAaZntdzNqUiTW6up0TuvG09auwjv/hKkwfjQv2
fNf0TVnYOCN4ZWbCm4FTEU5q31u+pZDHiRxOTJG9EbuBVEsFiV5X9B9VTLoA+dWK
SqWmE2X9YdONt9q6K8TGpxxrdQnQXGKOOtEY23KoF04XGYQtutIfj7cdFCDDX/eC
AjcOIfV3u8dHWjc/wKIYS5XRUzbkdpEeNOaiCQ3RN79JN2UrA0/w7lAxZTICmYCs
HtMtaFTER5weKQGzTzg0SP3M95qKPdWhlIq5lspdhGedonZAKBNfGahMltTH2UoY
vIb1qGT8uz8tQSzGsu0uvrR92ZIJZ0qUVjB/nhE9HxTz26xqtDDlLXpdIXTGrnU4
hM+Omud//MNgNbsmrg==
=pwq8
-----END PGP SIGNATURE-----
Merge tag 'usercopy-fix-v4.18-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull usercopy whitelisting fix from Kees Cook:
"Bart Massey discovered that the usercopy whitelist for JFS was
incomplete: the inline inode data may intentionally "overflow" into
the neighboring "extended area", so the size of the whitelist needed
to be raised to include the neighboring field"
* tag 'usercopy-fix-v4.18-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
jfs: Fix usercopy whitelist for inline inode data
- Fix incorrect shifting in the iomap bmap functions.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAltj7lcACgkQ+H93GTRK
tOutzw//U9yOrUhuOTFrkmA2x+CB1ArncFZysUTXDFUsLRrMp0dwNaRV0WafyypO
tCr4nvqgwg2VTFf21zWblOSAajstgJzH+x3FdxTsU5Gktd/BH9gCHSDV0lORnFhp
jbqNOsBRPJjNIBzaIIAnyDGMuOF/MeWpGsbGF4eDcsgj58lr098rir7grEVgCkKM
+eCMtqJwUXUZ2bo/grBWvZXnLms+8LMPYeRQMJSnw59DVunwxRdSbWumJwkPt9DD
mz3Upa/qFJCZw2n9kluU1b/tnrpxkNWYuSjzp9iu0cMdo52HF+yDNHriUPCHa4PB
3KfyGrhOvg+iC2pUAJ5mI3Dpv+NNAk7j/+4mOVtlUYIf0mi5gszIjNHbLiONKVH5
5x8G59wM/ae6a1WcHe7tJRma7r984G+JLTIxAQnWaWhFq5AsYMOplk4oq6X7Lmir
wAR7laDN0Xgl3WCFj6SXaKcnuzZFsE4A3SnZILtlgO/WxAujUyFEnICx1cO4Q9OY
64txWii6ora9kdBtalolxjfGHwyScbhx6FiBQdKIznxGgBQR89X8hzxghdp7KTIx
kxkC1hAM3KXXtokArjad2hgd8QG23jUshyBwKpfHnEwL75GiZQ3qtPdDm/oJlVWV
ItnRSt+tGIT/fsT5szNPmLKgtIW4kQHflVuih0KR4IsGPMmZ8dU=
=3omy
-----END PGP SIGNATURE-----
Merge tag 'xfs-4.18-fixes-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs bugfix from Darrick Wong:
"One more patch for 4.18 to fix a coding error in the iomap_bmap()
function introduced in -rc1: fix incorrect shifting"
* tag 'xfs-4.18-fixes-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
fs: fix iomap_bmap position calculation
The err is not used after initalization. So just remove the variable.
Signed-off-by: zhong jiang <zhongjiang@huawei.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Bart Massey reported what turned out to be a usercopy whitelist false
positive in JFS when symlink contents exceeded 128 bytes. The inline
inode data (i_inline) is actually designed to overflow into the "extended
area" following it (i_inline_ea) when needed. So the whitelist needed to
be expanded to include both i_inline and i_inline_ea (the whole size
of which is calculated internally using IDATASIZE, 256, instead of
sizeof(i_inline), 128).
$ cd /mnt/jfs
$ touch $(perl -e 'print "B" x 250')
$ ln -s B* b
$ ls -l >/dev/null
[ 249.436410] Bad or missing usercopy whitelist? Kernel memory exposure attempt detected from SLUB object 'jfs_ip' (offset 616, size 250)!
Reported-by: Bart Massey <bart.massey@gmail.com>
Fixes: 8d2704d382 ("jfs: Define usercopy region in jfs_ip slab cache")
Cc: Dave Kleikamp <shaggy@kernel.org>
Cc: jfs-discussion@lists.sourceforge.net
Cc: stable@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
This patch added the 6th compression algorithm support for pstore: zstd.
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
We hit that when inumber allocation has failed. In that case
the in-core inode is not hashed and since its ->i_nlink is 1
the only place where jfs checks is_bad_inode() won't be reached.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
We never look them up in there; inode_fake_hash() will make them appear
hashed for mark_inode_dirty() purposes. And don't leave them around
until memory pressure kicks them out - we never look them up again.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
iput() ends up calling ->evict() on new inode, which is not yet initialized
by owning fs. So use destroy_inode() instead.
Add to sb->s_inodes list only if inode is not in I_CREATING state (meaning
that it wasn't allocated with new_inode(), which already does the
insertion).
Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Fixes: 80ea09a002 ("vfs: factor out inode_insert5()")
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
We don't want open-by-handle picking half-set-up in-core
struct inode from e.g. mkdir() having failed halfway through.
In other words, we don't want such inodes returned by iget_locked()
on their way to extinction. However, we can't just have them
unhashed - otherwise open-by-handle immediately *after* that would've
ended up creating a new in-core inode over the on-disk one that
is in process of being freed right under us.
Solution: new flag (I_CREATING) set by insert_inode_locked() and
removed by unlock_new_inode() and a new primitive (discard_new_inode())
to be used by such halfway-through-setup failure exits instead of
unlock_new_inode() / iput() combinations. That primitive unlocks new
inode, but leaves I_CREATING in place.
iget_locked() treats finding an I_CREATING inode as failure
(-ESTALE, once we sort out the error propagation).
insert_inode_locked() treats the same as instant -EBUSY.
ilookup() treats those as icache miss.
[Fix by Dan Carpenter <dan.carpenter@oracle.com> folded in]
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Push iov_iter up from rxrpc_kernel_recv_data() to its caller to allow
non-contiguous iovs to be passed down, thereby permitting file reading to
be simplified in the AFS filesystem in a future patch.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix the comment in xfs_log_reserve to avoid confusing.
Signed-of-by: Huang Chong <huang.chong@zte.com.cn>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Skip the summary counter checks for secondary superblocks and inprogress
primary superblocks because mkfs has always written those out with
zeroed summary counters.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Function gfs2_ea_strlen is only called from ea_list_i, so inline it
there. Remove the duplicate switch statement and the creative use of
memcpy to set a null byte.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Andrew Price <anprice@redhat.com>
Reviewed-by: Bob Peterson <rpeterso@redhat.com>
Inside xfs_attr_shortform_list removes spaces at the beginnig of the line
and replaces with tabs.
Issue found by checkpatch.
ERROR: code indent should use tabs where possible
Signed-off-by: Thomas Bianchi <thomas.bianchi8@gmail.com>
Reviewed-by: Bill O'Donnell <billodo@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
struct xfs_defer_ops has now been reduced to a single list_head. The
external dfops mechanism is unused and thus everywhere a (permanent)
transaction is accessible the associated dfops structure is as well.
Remove the xfs_defer_ops structure and fold the list_head into the
transaction. Also remove the last remnant of external dfops in
xfs_trans_dup().
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The AGFL fixup code conditionally defers block frees from the free
list based on whether the current transaction has an associated
xfs_defer_ops structure. Now that dfops is embedded in the
transaction and the internal dfops is used unconditionally, this
invariant is always true.
Remove the now dead logic to check for ->t_dfops in
xfs_alloc_fix_freelist() and unconditionally defer AGFL block frees.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The majority of remaining references to struct xfs_defer_ops in XFS
are associated with xfs_defer_add(). At this point, there are no
more external xfs_defer_ops users left. All instances of
xfs_defer_ops are embedded in the transaction, which means we can
safely pass the transaction down to the dfops add interface.
Update xfs_defer_add() to receive the transaction as a parameter.
Various subsystems implement wrappers to allocate and construct the
context specific data structures for the associated deferred
operation type. Update these to also carry the transaction down as
needed and clean up unused dfops parameters along the way.
This removes most of the remaining references to struct
xfs_defer_ops throughout the code and facilitates removal of the
structure.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
[darrick: fix unused variable warnings with ftrace disabled]
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The xfs_defer_ops ->dop_pending list is used to track active
deferred operations once intents are logged. These items must be
aborted in the event of an error. The list is populated as intents
are logged and items are removed as they complete (or are aborted).
Now that xfs_defer_finish() cancels on error, there is no need to
ever access ->dop_pending outside of xfs_defer_finish(). The list is
only ever populated after xfs_defer_finish() begins and is either
completed or cancelled before it returns.
Remove ->dop_pending from xfs_defer_ops and replace it with a local
list in the xfs_defer_finish() path. Pass the local list to the
various helpers now that it is not accessible via dfops. Note that
we have to check for NULL in the abort case as the final tx roll
occurs outside of the scope of the new local list (once the dfops
has completed and thus drained the list).
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The current semantics of xfs_defer_finish() require the caller to
call xfs_defer_cancel() on error. This is slightly inconsistent with
transaction commit error handling where a failed commit cleans up
the transaction before returning.
More significantly, the only requirement for exposure of
->dop_pending outside of xfs_defer_finish() is so that
xfs_defer_cancel() can drain it on error. Since the only recourse of
xfs_defer_finish() errors is cancellation, mirror the transaction
logic and cancel remaining dfops before returning from
xfs_defer_finish() with an error.
Beside simplifying xfs_defer_finish() semantics, this ensures that
xfs_defer_finish() always returns with an empty ->dop_pending and
thus facilitates removal of the list from xfs_defer_ops.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The dfops code still passes around the xfs_defer_ops pointer
superfluously in a few places. Clean this up wherever the
transaction will suffice.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The dfops infrastructure ->finish_item() callback passes the
transaction and dfops as separate parameters. Since dfops is always
part of a transaction, the latter parameter is no longer necessary.
Remove it from the various callbacks.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Inodes that are held across deferred operations are explicitly
joined to the dfops structure to ensure appropriate relogging.
While inodes are currently joined explicitly, we can detect the
conditions that require relogging at dfops finish time by inspecting
the transaction item list for inodes with ili_lock_flags == 0.
Replace the xfs_defer_ijoin() infrastructure with such detection and
automatic relogging of held inodes. This eliminates the need for the
per-dfops inode list, replaced by an on-stack variant in
xfs_defer_trans_roll().
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Buffers that are held across deferred operations are explicitly
joined to the dfops structure to ensure appropriate relogging.
While buffers are currently joined explicitly, we can detect the
conditions that require relogging at dfops finish time by inspecting
the transaction item list for held buffers.
Replace the xfs_defer_bjoin() infrastructure with such detection and
automatic relogging of held buffers. This eliminates the need for
the per-dfops buffer list, replaced by an on-stack variant in
xfs_defer_trans_roll().
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Log items that require relogging during deferred operations
processing are explicitly joined to the associated dfops via the
xfs_defer_*join() helpers. These calls imply that the associated
object is "held" by the transaction such that when rolled, the item
can be immediately joined to a follow up transaction. For buffers,
this means the buffer remains locked and held after each roll. For
inodes, this means that the inode remains locked.
Failure to join a held item to the dfops structure means the
associated object pins the tail of the log while dfops processing
completes, because the item never relogs and is not unlocked or
released until deferred processing completes.
Currently, all buffers that are held in transactions (XFS_BLI_HOLD)
with deferred operations are explicitly joined to the dfops. This is
not the case for inodes, however, as various contexts defer
operations to transactions with held inodes without explicit joins
to the associated dfops (and thus not relogging).
While this is not a catastrophic problem, it is not ideal. Given
that we want to eventually relog such items automatically during
dfops processing, start by explicitly adding these missing
xfs_defer_ijoin() calls. A call is added everywhere an inode is
joined to a transaction without transferring lock ownership and
said transaction runs deferred operations.
All xfs_defer_ijoin() calls will eventually be replaced by automatic
dfops inode relogging. This patch essentially implements the
behavior change that would otherwise occur due to automatic inode
dfops relogging.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The dop_low field enables the low free space allocation mode when a
previous allocation has detected difficulty allocating blocks. It
has historically been part of the xfs_defer_ops structure, which
means if enabled, it remains enabled across a set of transactions
until the deferred operations have completed and the dfops is reset.
Now that the dfops is embedded in the transaction, we can save a bit
more space by using a transaction flag rather than a standalone
boolean. Drop the ->dop_low field and replace it with a transaction
flag that is set at the same points, carried across rolling
transactions and cleared on completion of deferred operations. This
essentially emulates the behavior of ->dop_low and so should not
change behavior.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
All callers pass ->t_dfops of the associated transactions. Refactor
the helpers to receive the transactions and facilitate further
cleanups between xfs_defer_ops and xfs_trans.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
With no more external dfops users, there is no need for an
xfs_defer_ops cancel wrapper.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Log intent recovery is the last user of an external (on-stack)
dfops. The pattern exists because the dfops is used to collect
additional deferred operations queued during the whole recovery
sequence. The dfops is finished with a new transaction after intent
recovery completes.
We already have a mechanism to create an empty, container-like
transaction to support the scrub infrastructure. We can reuse that
mechanism here to drop the final user of external dfops. This
facilitates folding dfops state (i.e., dop_low) into the
transaction, the elimination of now unused external dfops support
and also eliminates the only caller of __xfs_defer_cancel().
Replace the on-stack dfops with an empty transaction and pass it
around to the various helpers that queue and finish deferred
operations during intent recovery.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The current transaction allocation code conditionally initializes
the ->t_dfops indirection pointer. Transaction commit/cancel check
the validity of the pointer to determine whether to finish/cancel
the internal dfops.
This disallows the ability to use the internal dfops list as a
temporary container (via xfs_trans_alloc_empty()). Refactor
transaction allocation to always initialize ->t_dfops and check
permanent reservation state on transaction commit/cancel.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The fix in commit 0cbb4b4f4c ("userfaultfd: clear the
vma->vm_userfaultfd_ctx if UFFD_EVENT_FORK fails") cleared the
vma->vm_userfaultfd_ctx but kept userfaultfd flags in vma->vm_flags
that were copied from the parent process VMA.
As the result, there is an inconsistency between the values of
vma->vm_userfaultfd_ctx.ctx and vma->vm_flags which triggers BUG_ON
in userfaultfd_release().
Clearing the uffd flags from vma->vm_flags in case of UFFD_EVENT_FORK
failure resolves the issue.
Link: http://lkml.kernel.org/r/1532931975-25473-1-git-send-email-rppt@linux.vnet.ibm.com
Fixes: 0cbb4b4f4c ("userfaultfd: clear the vma->vm_userfaultfd_ctx if UFFD_EVENT_FORK fails")
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Reported-by: syzbot+121be635a7a35ddb7dcb@syzkaller.appspotmail.com
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Eric Biggers <ebiggers3@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The position calculation in iomap_bmap() shifts bno the wrong way,
so we don't progress properly and end up re-mapping block zero
over and over, yielding an unchanging physical block range as the
logical block advances:
# filefrag -Be file
ext: logical_offset: physical_offset: length: expected: flags:
0: 0.. 0: 21.. 21: 1: merged
1: 1.. 1: 21.. 21: 1: 22: merged
Discontinuity: Block 1 is at 21 (was 22)
2: 2.. 2: 21.. 21: 1: 22: merged
Discontinuity: Block 2 is at 21 (was 22)
3: 3.. 3: 21.. 21: 1: 22: merged
This breaks the FIBMAP interface for anyone using it (XFS), which
in turn breaks LILO, zipl, etc.
Bug-actually-spotted-by: Darrick J. Wong <darrick.wong@oracle.com>
Fixes: 89eb1906a9 ("iomap: add an iomap-based bmap implementation")
Cc: stable@vger.kernel.org
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Introduce these two functions and export them such that the next patch
can add calls to these functions from the SCSI core.
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
If the offset is larger or equal to both real file size and
max file size, then return -EFBIG.
Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
If the range is larger than both real file size and limit of
max file size, then return -EFBIG.
Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
In order to not bother to VFS and other specific filesystems,
we decided to do offset validation inside ceph kernel client,
so just simply set sb->s_maxbytes to MAX_LFS_FILESIZE so that
it can successfully pass VFS check. We add new field max_file_size
in ceph_fs_client to store real file size limit and doing proper
check based on it.
Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
When a client authenticates with a service, an authorizer is sent with
a nonce to the service (ceph_x_authorize_[ab]) and the service responds
with a mutation of that nonce (ceph_x_authorize_reply). This lets the
client verify the service is who it says it is but it doesn't protect
against a replay: someone can trivially capture the exchange and reuse
the same authorizer to authenticate themselves.
Allow the service to reject an initial authorizer with a random
challenge (ceph_x_authorize_challenge). The client then has to respond
with an updated authorizer proving they are able to decrypt the
service's challenge and that the new authorizer was produced for this
specific connection instance.
The accepting side requires this challenge and response unconditionally
if the client side advertises they have CEPHX_V2 feature bit.
This addresses CVE-2018-1128.
Link: http://tracker.ceph.com/issues/24836
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>
Use new return type vm_fault_t for page_mkwrite
and fault handler.
Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com>
Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
The ceph_mds_request stamp still uses the deprecated timespec structure,
this converts it over as well.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
The request mtime field is used all over ceph, and is currently
represented as a 'timespec' structure in Linux. This changes it to
timespec64 to allow times beyond 2038, modifying all users at the
same time.
[ Remove now redundant ts variable in writepage_nounlock(). ]
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Since the vfs structures are all using timespec64, we can now
change the internal representation, using ceph_encode_timespec64 and
ceph_decode_timespec64.
In case of ceph_aux_inode however, we need to avoid doing a memcmp()
on uninitialized padding data, so the members of the i_mtime field get
copied individually into 64-bit integers.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
ceph_mdsc_create_request() is one of the last callers of the
deprecated current_kernel_time() as well as timespec_trunc().
This changes it to use the timespec64 based interfaces instead,
though we still need to convert the result until we are ready to
change over req->r_stamp.
The output of the two functions, ktime_get_coarse_real_ts64() and
current_kernel_time() is the same coarse-granular timestamp,
the only difference here is that ktime_get_coarse_real_ts64()
doesn't overflow in 2038.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
When file num exceeds quota limit, should call d_drop to drop
dentry from cache as well.
Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
When file num exceeds quota limit or fails from ceph_per_init_acls()
should call d_drop to drop dentry from cache as well.
Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
In order to return correct error code should replace variable ret
using err in error case.
Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
GCC8 prints following warning:
fs/ceph/mds_client.c:3683:2: warning: ‘strncpy’ output may be truncated
copying 64 bytes from a string of length 64 [-Wstringop-truncation]
[ Change to strscpy() while at it. ]
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
It's better to restore ctime as well in the case of restoring old mode
in ceph_set_acl().
Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
When the size of acl extended attribution is larger than pre-allocated
value buffer size, we will hit error '-ERANGE' and it's probabaly caused
by concurrent get/set acl from different clients. In this case, current
logic just sets acl to NULL so that we cannot get proper information but
the operation looks successful.
This patch adds retry logic for error -ERANGE and return -EIO if fail
from the retry. Additionally, print real errno when failing from
__ceph_getxattr().
Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
The BTF conflicts were simple overlapping changes.
The virtio_net conflict was an overlap of a fix of statistics counter,
happening alongisde a move over to a bonafide statistics structure
rather than counting value on the stack.
Signed-off-by: David S. Miller <davem@davemloft.net>
Previously in squashfs_readpage() when copying data into the page
cache, it used the length of the datablock read from the filesystem
(after decompression). However, if the filesystem has been corrupted
this data block may be short, which will leave pages unfilled.
The fix for this is to compute the expected number of bytes to copy
from the inode size, and use this to detect if the block is short.
Signed-off-by: Phillip Lougher <phillip@squashfs.org.uk>
Tested-by: Willy Tarreau <w@1wt.eu>
Cc: Анатолий Тросиненко <anatoly.trosinenko@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The squashfs fragment reading code doesn't actually verify that the
fragment is inside the fragment table. The end result _is_ verified to
be inside the image when actually reading the fragment data, but before
that is done, we may end up taking a page fault because the fragment
table itself might not even exist.
Another report from Anatoly and his endless squashfs image fuzzing.
Reported-by: Анатолий Тросиненко <anatoly.trosinenko@gmail.com>
Acked-by:: Phillip Lougher <phillip.lougher@gmail.com>,
Cc: Willy Tarreau <w@1wt.eu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
'ac->ac_g_ex.fe_len' is a user-controlled value which is used in the
derivation of 'ac->ac_2order'. 'ac->ac_2order', in turn, is used to
index arrays which makes it a potential spectre gadget. Fix this by
sanitizing the value assigned to 'ac->ac2_order'. This covers the
following accesses found with the help of smatch:
* fs/ext4/mballoc.c:1896 ext4_mb_simple_scan_group() warn: potential
spectre issue 'grp->bb_counters' [w] (local cap)
* fs/ext4/mballoc.c:445 mb_find_buddy() warn: potential spectre issue
'EXT4_SB(e4b->bd_sb)->s_mb_offsets' [r] (local cap)
* fs/ext4/mballoc.c:446 mb_find_buddy() warn: potential spectre issue
'EXT4_SB(e4b->bd_sb)->s_mb_maxs' [r] (local cap)
Suggested-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Jeremy Cline <jcline@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
The only user is fuse_create_new_entry(), and there it's used to
mitigate the same mkdir/open-by-handle race as in nfs_mkdir().
The same solution applies - unhash the mkdir argument, then
call d_splice_alias() and if that returns a reference to preexisting
alias, dput() and report success. ->mkdir() argument left unhashed
negative with the preexisting alias moved in the right place is just
fine from the ->mkdir() callers point of view.
Cc: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The patch to fix the case where a lock request was interrupted ended up
changing default handling of errors such as NFS4ERR_DENIED and caused the
client to immediately resend the lock request. Let's do a partial revert
of that request so that the default is now to exit, but change the way
we handle resends to take into account the fact that the user may have
interrupted the request.
Reported-by: Kenneth Johansson <ken@kenjo.org>
Fixes: a3cf9bca2a ("NFSv4: Don't add a new lock on an interrupted wait..")
Cc: Benjamin Coddington <bcodding@redhat.com>
Cc: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
No need to have this in the top-level Kconfig.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Randy Dunlap <rdunlap@infradead.org>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
If config CONFIG_F2FS_FAULT_INJECTION is on, for both read or write path
we will call find_lock_page() to get the page, but for read path, it
missed to passing FGP_ACCESSED to allocator to active the page in LRU
list, result in being reclaimed in advance incorrectly, fix it.
Reported-by: Xianrong Zhou <zhouxianrong@huawei.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
For migration of encrypted inode's block, we load data of encrypted block
into meta inode's page cache, after checkpoint, those all intermediate
pages should be clean, and no one will read them again, so let's just
release them for more memory.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Like quota_ino feature, we need to reject mounting RDWR with image
which enables project_quota feature when there is no CONFIG_QUOTA
be set in kernel.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If quota feature is enabled, quota is on by default. However, if
CONFIG_QUOTA is not built in kernel, dquot entries will not get updated,
which leads to quota inconsistency.
Signed-off-by: Sheng Yong <shengyong1@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
According to fs/quota/dquot.c, `dq_data_lock' protects mem_dqinfo
structures and modifications of dquot pointers in the inode, and
`dquot->dq_dqb_lock' protects data from dq_dqb.
We should use dquot->dq_dqb_lock in statfs_project instead of
dq_dat_lock.
Signed-off-by: Sheng Yong <shengyong1@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch adds a new proc entry to show victim_secmap information in
more detail, which is very helpful to know the get_victim candidate
status clearly, and helpful to debug problems (e.g., some sections can
not gc all of its blocks, since some blocks belong to atomic file,
leaving victim_secmap with section bit setting, in extrem case, this
will lead all bytes of victim_secmap setting with 0xff).
Signed-off-by: Yunlong Song <yunlong.song@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Fsyncer will wait on all dnode pages of regular writeback before flushing,
if there are async dnode pages blocked by IO scheduler, it may decrease
fsync's performance.
In this patch, we choose to let f2fs_balance_fs_bg() to trigger checkpoint
to flush these dnode pages of regular, so async IO of dnode page can be
elimitnated, making fsyncer only need to wait for sync IO.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
For the case when sbi->segs_per_sec > 1 with lfs mode, take
section:segment = 5 for example, if the section prefree_map is
...previous section | current section (1 1 0 1 1) | next section...,
then the start = x, end = x + 1, after start = start_segno +
sbi->segs_per_sec, start = x + 5, then it will skip x + 3 and x + 4, but
their bitmap is still set, which will cause duplicated
f2fs_issue_discard of this same section in the next write_checkpoint:
round 1: section bitmap : 1 1 1 1 1, all valid, prefree_map: 0 0 0 0 0
then rm data block NO.2, block NO.2 becomes invalid, prefree_map: 0 0 1 0 0
write_checkpoint: section bitmap: 1 1 0 1 1, prefree_map: 0 0 0 0 0,
prefree of NO.2 is cleared, and no discard issued
round 2: rm data block NO.0, NO.1, NO.3, NO.4
all invalid, but prefree bit of NO.2 is set and cleared in round 1, then
prefree_map: 1 1 0 1 1
write_checkpoint: section bitmap: 0 0 0 0 0, prefree_map: 0 0 0 1 1, no
valid blocks of this section, so discard issued, but this time prefree
bit of NO.3 and NO.4 is skipped due to start = start_segno + sbi->segs_per_sec;
round 3:
write_checkpoint: section bitmap: 0 0 0 0 0, prefree_map: 0 0 0 1 1 ->
0 0 0 0 0, no valid blocks of this section, so discard issued,
this time prefree bit of NO.3 and NO.4 is cleared, but the discard of
this section is sent again...
To fix this problem, we can align the start and end value to section
boundary for fstrim and real-time discard operation, and decide to issue
discard only when the whole section is invalid, which can issue discard
aligned to section size as much as possible and avoid redundant discard.
Signed-off-by: Yunlong Song <yunlong.song@huawei.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In order to prevent abusing atomic writes by abnormal users, we've added a
threshold, 20% over memory footprint, which disallows further atomic writes.
Previously, however, SQLite doesn't know the files became normal, so that
it could write stale data and commit on revoked normal database file.
Once f2fs detects such the abnormal behavior, this patch tries to avoid further
writes in write_begin().
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In order to give advise to f2fs to recognize hot/cold file, it is possible
that we can set specific bit in inode.i_advise through setxattr(), but
there are several bits which are used internally, such as encrypt_bit,
keep_size_bit, they should never be changed through setxattr().
So that this patch 1) adds FADVISE_MODIFIABLE_BITS to filter modifiable
bits user given, 2) supports to clear {hot,cold}_file bits.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch fix wrong message info for recover fsync data
on readonly fs.
Signed-off-by: Yunlei He <heyunlei@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Romve redundant prefix 'f2fs_' in the middle of f2fs_ioc_f2fs_write_checkpoint().
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Since commit 201ef5e080 ("f2fs: improve shrink performance of extent nodes"),
there is no user of EXT_TREE_VEC_SIZE, just kill it for cleanup.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Because xattr_permission already checks CAP_SYS_ADMIN
capability, we don't need to check it.
Signed-off-by: Hyunchul Lee <cheol.lee@lge.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If caller of __get_meta_page() can handle error, let's propagate error
from __get_meta_page().
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Expand the blk_finish_plug action from blkzoned to normal lfs mode,
since plug will cause the out-of-order IO submission, which is not
friendly to flash in lfs mode.
Signed-off-by: Yunlong Song <yunlong.song@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
For the case when sbi->segs_per_sec > 1, take section:segment = 5 for
example, if segment 1 is just used and allocate new segment 2, and the
blocks of segment 1 is invalidated, at this time, the previous code will
use __set_test_and_free to free the free_secmap and free_sections++,
this is not correct since it is still a current section, so fix it.
Signed-off-by: Yunlong Song <yunlong.song@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If we attempt to request more blocks than we have room for, we try to
instead request as much as we can, however, alloc_valid_block_count
is not decremented to match the new value, allowing it to drift higher
until the next checkpoint. This always decrements it when the requested
amount cannot be fulfilled.
Signed-off-by: Daniel Rosenberg <drosen@google.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
For small granularity discard which size is smaller than 64KB, if we
issue those kind of discards orderly by size, their IOs will be spread
into entire logical address, so that in FTL, L2P table will be updated
randomly, result bad wear rate in the table.
In this patch, we choose to issue small discard by LBA order, by this
way, we can expect that L2P table updates from adjacent discard IOs can
be merged in the cache, so it can reduce lifetime wearing of flash.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
For background discard policy, even if there is queued user IO, still
we will check max_requests times for next discard entry, it is unneeded,
let's just stop this round submission immediately.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Fix build warnings in f2fs when CONFIG_PROC_FS is not enabled
by marking the unused functions as __maybe_unused.
../fs/f2fs/sysfs.c:519:12: warning: 'segment_info_seq_show' defined but not used [-Wunused-function]
../fs/f2fs/sysfs.c:546:12: warning: 'segment_bits_seq_show' defined but not used [-Wunused-function]
../fs/f2fs/sysfs.c:570:12: warning: 'iostat_info_seq_show' defined but not used [-Wunused-function]
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Chao Yu <yuchao0@huawei.com>
Cc: linux-f2fs-devel@lists.sourceforge.net
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
f2fs is focused on flash based storage, so let's enable real-time
discard by default, if user don't want to enable it, 'nodiscard'
mount option should be used on mount.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Below dmesg was printed when testing generic/388 of fstest:
F2FS-fs (zram1): find_fsync_dnodes: detect looped node chain, blkaddr:526615, next:526616
F2FS-fs (zram1): Cannot recover all fsync data errno=-22
F2FS-fs (zram1): Mounted with checkpoint version = 22300d0e
F2FS-fs (zram1): find_fsync_dnodes: detect looped node chain, blkaddr:526615, next:526616
F2FS-fs (zram1): Cannot recover all fsync data errno=-22
The reason is that we initialize free_blocks with free blocks of
filesystem, so if filesystem is full, free_blocks can be zero,
below condition will be true, so that, it will fail recovery.
if (++loop_cnt >= free_blocks ||
blkaddr == next_blkaddr_of_node(page))
To fix this issue, initialize free_blocks with correct value which
includes over-privision blocks.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Anatoly continues to find issues with fuzzed squashfs images.
This time, corrupt, missing, or undersized data for the page filling
wasn't checked for, because the squashfs_{copy,read}_cache() functions
did the squashfs_copy_data() call without checking the resulting data
size.
Which could result in the page cache pages being incompletely filled in,
and no error indication to the user space reading garbage data.
So make a helper function for the "fill in pages" case, because the
exact same incomplete sequence existed in two places.
[ I should have made a squashfs branch for these things, but I didn't
intend to start doing them in the first place.
My historical connection through cramfs is why I got into looking at
these issues at all, and every time I (continue to) think it's a
one-off.
Because _this_ time is always the last time. Right? - Linus ]
Reported-by: Anatoly Trosinenko <anatoly.trosinenko@gmail.com>
Tested-by: Willy Tarreau <w@1wt.eu>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Phillip Lougher <phillip@squashfs.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Extended attribute names are defined to be NUL-terminated, so the name
must not contain a NUL character. This is important because there are
places when remove extended attribute, the code uses strlen to
determine the length of the entry. That should probably be fixed at
some point, but code is currently really messy, so the simplest fix
for now is to simply validate that the extended attributes are sane.
https://bugzilla.kernel.org/show_bug.cgi?id=200401
Reported-by: Wen Xu <wen.xu@gatech.edu>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
Out of memory should not be considered as critical errors; so replace
ext4_error() with ext4_warnig().
Signed-off-by: Wang Shilong <wshilong@ddn.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org