In a bmapx call, bmv_count is the total size of the array, including the
zeroth element that userspace uses to supply the search key. The output
array starts at offset 1 so that we can set up the user for the next
invocation. Since we now can split an extent into multiple bmap records
due to shared/unshared status, we have to be careful that we don't
overflow the output array.
In the original patch f86f403794 ("xfs: teach get_bmapx about shared
extents and the CoW fork") I used cur_ext (the output index) to check
for overflows, albeit with an off-by-one error. Since nexleft no longer
describes the number of unfilled slots in the output, we can rip all
that out and use cur_ext for the overflow check directly.
Failure to do this causes heap corruption in bmapx callers such as
xfs_io and xfs_scrub. xfs/328 can reproduce this problem.
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
If we try to allocate memory pages to back an xfs_buf that we're trying
to read, it's possible that we'll be so short on memory that the page
allocation fails. For a blocking read we'll just wait, but for
readahead we simply dump all the pages we've collected so far.
Unfortunately, after dumping the pages we neglect to clear the
_XBF_PAGES state, which means that the subsequent call to xfs_buf_free
thinks that b_pages still points to pages we own. It then double-frees
the b_pages pages.
This results in screaming about negative page refcounts from the memory
manager, which xfs oughtn't be triggering. To reproduce this case,
mount a filesystem where the size of the inodes far outweighs the
availalble memory (a ~500M inode filesystem on a VM with 300MB memory
did the trick here) and run bulkstat in parallel with other memory
eating processes to put a huge load on the system. The "check summary"
phase of xfs_scrub also works for this purpose.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
With COW files they are the hotpath, just like for files with the
extent size hint attribute. We really shouldn't micro-manage anything
but failure cases with unlikely.
Additionally Arnd Bergmann recently reported that one of these two
unlikely annotations causes link failures together with an upcoming
kernel instrumentation patch, so let's get rid of it ASAP.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reported-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
xfs_attr_[get|remove]() have unlocked attribute fork checks to optimize
away a lock cycle in cases where the fork does not exist or is otherwise
empty. This check is not safe, however, because an attribute fork short
form to extent format conversion includes a transient state that causes
the xfs_inode_hasattr() check to fail. Specifically,
xfs_attr_shortform_to_leaf() creates an empty extent format attribute
fork and then adds the existing shortform attributes to it.
This means that lookup of an existing xattr can spuriously return
-ENOATTR when racing against a setxattr that causes the associated
format conversion. This was originally reproduced by an untar on a
particularly configured glusterfs volume, but can also be reproduced on
demand with properly crafted xattr requests.
The format conversion occurs under the exclusive ilock. xfs_attr_get()
and xfs_attr_remove() already have the proper locking and checks further
down in the functions to handle this situation correctly. Drop the
unlocked checks to avoid the spurious failure and rely on the existing
logic.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Currently we try to rely on the global reserved block pool for block
allocations for the free inode btree, but I have customer reports
(fairly complex workload, need to find an easier reproducer) where that
is not enough as the AG where we free an inode that requires a new
finobt block is entirely full. This causes us to cancel a dirty
transaction and thus a file system shutdown.
I think the right way to guard against this is to treat the finot the same
way as the refcount btree and have a per-AG reservations for the possible
worst case size of it, and the patch below implements that.
Note that this could increase mount times with large finobt trees. In
an ideal world we would have added a field for the number of finobt
fields to the AGI, similar to what we did for the refcount blocks.
We should do add it next time we rev the AGI or AGF format by adding
new fields.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Try to reserve the blocks first and only then update the fields in
or hanging off the mount structure. This way we can call __xfs_ag_resv_init
again after a previous failure.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Linux 4.9 added two ioctl() operations that can be used to discover:
* the parental relationships for hierarchical namespaces (user and PID)
[NS_GET_PARENT]
* the user namespaces that owns a specified non-user-namespace
[NS_GET_USERNS]
For no good reason that I can glean, NS_GET_USERNS was made synonymous
with NS_GET_PARENT for user namespaces. It might have been better if
NS_GET_USERNS had returned an error if the supplied file descriptor
referred to a user namespace, since it suggests that the caller may be
confused. More particularly, if it had generated an error, then I wouldn't
need the new ioctl() operation proposed here. (On the other hand, what
I propose here may be more generally useful.)
I would like to write code that discovers namespace relationships for
the purpose of understanding the namespace setup on a running system.
In particular, given a file descriptor (or pathname) for a namespace,
N, I'd like to obtain the corresponding user namespace. Namespace N
might be a user namespace (in which case my code would just use N) or
a non-user namespace (in which case my code will use NS_GET_USERNS to
get the user namespace associated with N). The problem is that there
is no way to tell the difference by looking at the file descriptor
(and if I try to use NS_GET_USERNS on an N that is a user namespace, I
get the parent user namespace of N, which is not what I want).
This patch therefore adds a new ioctl(), NS_GET_NSTYPE, which, given
a file descriptor that refers to a user namespace, returns the
namespace type (one of the CLONE_NEW* constants).
Signed-off-by: Michael Kerrisk <mtk-manpages@gmail.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Commit 8a59f5d252 ("fs/romfs: return f_fsid for statfs(2)") generates
a 64bit id from sb->s_bdev->bd_dev. This is only correct when romfs is
defined with CONFIG_ROMFS_ON_BLOCK. If romfs is only defined with
CONFIG_ROMFS_ON_MTD, sb->s_bdev is NULL, referencing sb->s_bdev->bd_dev
will triger an oops.
Richard Weinberger points out that when CONFIG_ROMFS_BACKED_BY_BOTH=y,
both CONFIG_ROMFS_ON_BLOCK and CONFIG_ROMFS_ON_MTD are defined.
Therefore when calling huge_encode_dev() to generate a 64bit id, I use
the follow order to choose parameter,
- CONFIG_ROMFS_ON_BLOCK defined
use sb->s_bdev->bd_dev
- CONFIG_ROMFS_ON_BLOCK undefined and CONFIG_ROMFS_ON_MTD defined
use sb->s_dev when,
- both CONFIG_ROMFS_ON_BLOCK and CONFIG_ROMFS_ON_MTD undefined
leave id as 0
When CONFIG_ROMFS_ON_MTD is defined and sb->s_mtd is not NULL, sb->s_dev
is set to a device ID generated by MTD_BLOCK_MAJOR and mtd index,
otherwise sb->s_dev is 0.
This is a try-best effort to generate a uniq file system ID, if all the
above conditions are not meet, f_fsid of this romfs instance will be 0.
Generally only one romfs can be built on single MTD block device, this
method is enough to identify multiple romfs instances in a computer.
Link: http://lkml.kernel.org/r/1482928596-115155-1-git-send-email-colyli@suse.de
Signed-off-by: Coly Li <colyli@suse.de>
Reported-by: Nong Li <nongli1031@gmail.com>
Tested-by: Nong Li <nongli1031@gmail.com>
Cc: Richard Weinberger <richard.weinberger@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We have seen proc_pid_readdir() invocations holding cpu for more than 50
ms. Add a cond_resched() to be gentle with other tasks.
[akpm@linux-foundation.org: coding style fix]
Link: http://lkml.kernel.org/r/1484238380.15816.42.camel@edumazet-glaptop3.roam.corp.google.com
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
With >=32 CPUs the userfaultfd selftest triggered a graceful but
unexpected SIGBUS because VM_FAULT_RETRY was returned by
handle_userfault() despite the UFFDIO_COPY wasn't completed.
This seems caused by rwsem waking the thread blocked in
handle_userfault() and we can't run up_read() before the wait_event
sequence is complete.
Keeping the wait_even sequence identical to the first one, would require
running userfaultfd_must_wait() again to know if the loop should be
repeated, and it would also require retaking the rwsem and revalidating
the whole vma status.
It seems simpler to wait the targeted wakeup so that if false wakeups
materialize we still wait for our specific wakeup event, unless of
course there are signals or the uffd was released.
Debug code collecting the stack trace of the wakeup showed this:
$ ./userfaultfd 100 99999
nr_pages: 25600, nr_pages_per_cpu: 800
bounces: 99998, mode: racing ver poll, userfaults: 32 35 90 232 30 138 69 82 34 30 139 40 40 31 20 19 43 13 15 28 27 38 21 43 56 22 1 17 31 8 4 2
bounces: 99997, mode: rnd ver poll, Bus error (core dumped)
save_stack_trace+0x2b/0x50
try_to_wake_up+0x2a6/0x580
wake_up_q+0x32/0x70
rwsem_wake+0xe0/0x120
call_rwsem_wake+0x1b/0x30
up_write+0x3b/0x40
vm_mmap_pgoff+0x9c/0xc0
SyS_mmap_pgoff+0x1a9/0x240
SyS_mmap+0x22/0x30
entry_SYSCALL_64_fastpath+0x1f/0xbd
0xffffffffffffffff
FAULT_FLAG_ALLOW_RETRY missing 70
CPU: 24 PID: 1054 Comm: userfaultfd Tainted: G W 4.8.0+ #30
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.3-0-ge2fc41e-prebuilt.qemu-project.org 04/01/2014
Call Trace:
dump_stack+0xb8/0x112
handle_userfault+0x572/0x650
handle_mm_fault+0x12cb/0x1520
__do_page_fault+0x175/0x500
trace_do_page_fault+0x61/0x270
do_async_page_fault+0x19/0x90
async_page_fault+0x25/0x30
This always happens when the main userfault selftest thread is running
clone() while glibc runs either mprotect or mmap (both taking mmap_sem
down_write()) to allocate the thread stack of the background threads,
while locking/userfault threads already run at full throttle and are
susceptible to false wakeups that may cause handle_userfault() to return
before than expected (which results in graceful SIGBUS at the next
attempt).
This was reproduced only with >=32 CPUs because the loop to start the
thread where clone() is too quick with fewer CPUs, while with 32 CPUs
there's already significant activity on ~32 locking and userfault
threads when the last background threads are started with clone().
This >=32 CPUs SMP race condition is likely reproducible only with the
selftest because of the much heavier userfault load it generates if
compared to real apps.
We'll have to allow "one more" VM_FAULT_RETRY for the WP support and a
patch floating around that provides it also hidden this problem but in
reality only is successfully at hiding the problem.
False wakeups could still happen again the second time
handle_userfault() is invoked, even if it's a so rare race condition
that getting false wakeups twice in a row is impossible to reproduce.
This full fix is needed for correctness, the only alternative would be
to allow VM_FAULT_RETRY to be returned infinitely. With this fix the WP
support can stick to a strict "one more" VM_FAULT_RETRY logic (no need
of returning it infinite times to avoid the SIGBUS).
Link: http://lkml.kernel.org/r/20170111005535.13832-2-aarcange@redhat.com
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reported-by: Shubham Kumar Sharma <shubham.kumar.sharma@oracle.com>
Tested-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Michael Rapoport <RAPOPORT@il.ibm.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
As reported by Arnd:
https://lkml.org/lkml/2017/1/10/756
Compiling with the following configuration:
# CONFIG_EXT2_FS is not set
# CONFIG_EXT4_FS is not set
# CONFIG_XFS_FS is not set
# CONFIG_FS_IOMAP depends on the above filesystems, as is not set
CONFIG_FS_DAX=y
generates build warnings about unused functions in fs/dax.c:
fs/dax.c:878:12: warning: `dax_insert_mapping' defined but not used [-Wunused-function]
static int dax_insert_mapping(struct address_space *mapping,
^~~~~~~~~~~~~~~~~~
fs/dax.c:572:12: warning: `copy_user_dax' defined but not used [-Wunused-function]
static int copy_user_dax(struct block_device *bdev, sector_t sector, size_t size,
^~~~~~~~~~~~~
fs/dax.c:542:12: warning: `dax_load_hole' defined but not used [-Wunused-function]
static int dax_load_hole(struct address_space *mapping, void **entry,
^~~~~~~~~~~~~
fs/dax.c:312:14: warning: `grab_mapping_entry' defined but not used [-Wunused-function]
static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index,
^~~~~~~~~~~~~~~~~~
Now that the struct buffer_head based DAX fault paths and I/O path have
been removed we really depend on iomap support being present for DAX.
Make this explicit by selecting FS_IOMAP if we compile in DAX support.
This allows us to remove conditional selections of FS_IOMAP when FS_DAX
was present for ext2 and ext4, and to remove an #ifdef in fs/dax.c.
Link: http://lkml.kernel.org/r/1484087383-29478-1-git-send-email-ross.zwisler@linux.intel.com
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reported-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
sb_dirblklog is added to sb_blocklog to compute the directory block size
in bytes. Therefore, we must compare the sum of both those values
against XFS_MAX_BLOCKSIZE_LOG, not just dirblklog.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Some nfsv4.0 servers may return a mode for the verifier following an open
with EXCLUSIVE4 createmode, but this does not mean the client should skip
setting the mode in the following SETATTR. It should only do that for
EXCLUSIVE4_1 or UNGAURDED createmode.
Fixes: 5334c5bdac ("NFS: Send attributes in OPEN request for NFS4_CREATE_EXCLUSIVE4_1")
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Cc: stable@vger.kernel.org # v4.3+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
We can't dereference the dio structure after submitting the last bio for
this request, as I/O completion might have happened before the code is
run. Introduce a local is_sync variable instead.
Fixes: 542ff7bf ("block: new direct I/O implementation")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reported-by: Matias Bjørling <m@bjorling.me>
Tested-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
We cannot call nfs4_handle_exception() without first ensuring that the
slot has been freed. If not, we end up deadlocking with the process
waiting for recovery to complete, and recovery waiting for the slot
table to drain.
Fixes: 2e80dbe7ac ("NFSv4.1: Close callback races for OPEN, LAYOUTGET...")
Cc: stable@vger.kernel.org # v4.8+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Instead of making the files owned by the GLOBAL_ROOT_USER. Make
non-dumpable files whose mm has always lived in a user namespace owned
by the user namespace root. This allows the container root to have
things work as expected in a container.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
With previous changes every location that tests for
LSM_UNSAFE_PTRACE_CAP also tests for LSM_UNSAFE_PTRACE making the
LSM_UNSAFE_PTRACE_CAP redundant, so remove it.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
This patchset converts inotify to using the newly introduced
per-userns sysctl infrastructure.
Currently the inotify instances/watches are being accounted in the
user_struct structure. This means that in setups where multiple
users in unprivileged containers map to the same underlying
real user (i.e. pointing to the same user_struct) the inotify limits
are going to be shared as well, allowing one user(or application) to exhaust
all others limits.
Fix this by switching the inotify sysctls to using the
per-namespace/per-user limits. This will allow the server admin to
set sensible global limits, which can further be tuned inside every
individual user namespace. Additionally, in order to preserve the
sysctl ABI make the existing inotify instances/watches sysctls
modify the values of the initial user namespace.
Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com>
Acked-by: Jan Kara <jack@suse.cz>
Acked-by: Serge Hallyn <serge@hallyn.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Due to the way how xfs_iomap_write_allocate tries to convert the whole
found extents from delalloc to real space we can run into a race
condition with multiple threads doing writes to this same extent.
For the non-COW case that is harmless as the only thing that can happen
is that we call xfs_bmapi_write on an extent that has already been
converted to a real allocation. For COW writes where we move the extent
from the COW to the data fork after I/O completion the race is, however,
not quite as harmless. In the worst case we are now calling
xfs_bmapi_write on a region that contains hole in the COW work, which
will trip up an assert in debug builds or lead to file system corruption
in non-debug builds. This seems to be reproducible with workloads of
small O_DSYNC write, although so far I've not managed to come up with
a with an isolated reproducer.
The fix for the issue is relatively simple: tell xfs_bmapi_write
that we are only asked to convert delayed allocations and skip holes
in that case.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The last BUG_ON in mb_find_extent() is apparently triggering in some
rare cases. Most of the time it indicates a bug in the buddy bitmap
algorithms, but there are some weird cases where it can trigger when
buddy bitmap is still in memory, but the block bitmap has to be read
from disk, and there is disk or memory corruption such that the block
bitmap and the buddy bitmap are out of sync.
Google-Bug-Id: #33702157
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
marked for stable) and two fixups for this merge window's patches.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQEcBAABCAAGBQJYghs8AAoJEEp/3jgCEfOLOz0IAI/xNUMO121S57GEhzkKDdWC
5PCHjg9itU+2eMCCZ2Nyuikj2NVwEFh9HLpMz5jtFa3oWCIhljh9wT8zlKDgpn5R
Q1GCT4LkHGhV+HA2sM04aynKBmC90ZVAHfDt/BTs5mLzW7neSpxFOQEPdS4FG6Zg
NxUGcI/GhqmfpcLnm5IqXxI1cc0bXf6BmEzlGrPAkvzJBhHXWKCVpr1Q/nBW96Q5
ko1EpP16wZoeRvsr1ztXmBTNURUrCi7S6PyK4M5MAro381U3a7zwQuFq9uuREahO
nJtCjWD3bd6U3ENDe/Gacz3czXQyjOjE2/w42jL1dA84UMQbz+wv1SyNCkQgiyI=
=1LTx
-----END PGP SIGNATURE-----
Merge tag 'ceph-for-4.10-rc5' of git://github.com/ceph/ceph-client
Pull ceph fixes from Ilya Dryomov:
"Three filesystem endianness fixes (one goes back to the 2.6 era, all
marked for stable) and two fixups for this merge window's patches"
* tag 'ceph-for-4.10-rc5' of git://github.com/ceph/ceph-client:
ceph: fix bad endianness handling in parse_reply_info_extra
ceph: fix endianness bug in frag_tree_split_cmp
ceph: fix endianness of getattr mask in ceph_d_revalidate
libceph: make sure ceph_aes_crypt() IV is aligned
ceph: fix ceph_get_caps() interruption
Pull overlayfs fix from Miklos Szeredi:
"This fixes a regression introduced in this cycle"
* 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
ovl: fix possible use after free on redirect dir lookup
Pull fuse fixes from Miklos Szeredi:
"Fix two regressions, one introduced in 4.9 and a less recent one in
4.2"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
fuse: fix time_to_jiffies nsec sanity check
fuse: clear FR_PENDING flag when moving requests out of pending queue
udf_fill_super() used udf_parse_options() to flag UDF_FLAG_BLOCKSIZE_SET
when blocksize was specified otherwise used 512 bytes
(bdev_logical_block_size) and 2048 bytes (UDF_DEFAULT_BLOCKSIZE)
IOW both 1024 and 4096 specifications were required or resulted in
"mount: wrong fs type, bad option, bad superblock on /dev/loop1"
This patch loops through different block values but also updates
udf_load_vrs() to return -EINVAL instead of 0 when udf_check_vsd()
fails (and uopt->novrs = 0).
The later being the reason for the RFC; we have that case when mounting
a 4kb blocksize against other values but maybe VRS is not mandatory
there ?
Tested with 512, 1024, 2048 and 4096 blocksize
Reported-by: Jan Kara <jack@suse.com>
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Jan Kara <jack@suse.cz>
- Inode i_mode sanitization
- Prevent overflows in getnextquota
- Minor build fixes
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABCgAGBQJYf9VQAAoJEPh/dxk0SrTr6CMQAJRUULHRS2SJe22pJSRzkjsl
H838pav3bmTDGafWx5SgaazDHsVd205DHLU9ovi9zyqQGviG1kgIl9a7TvXtEU6R
DXbjnxRTb9tgDbFKYstvl0lDXiWUFoB1nMOkNa8BDhBUe1sGadMXY5gh+CnpoKeg
qEwNiE3yUJpnPVog4d+SRQSMTUbD3VQBtkh4IcucR+zqWsOiT7jiepqJvee5ZXYj
Wz2Eo0QFhSGq7lk9IRl7C3ter77QMz6f4Ba3oZMjM2WD317JmKk5vDU1B7mHgRIZ
cnwOayICECKs9GQWmgD5ew+GEaCRHuSzeoiP72O2bjwiLgJbG/eHQJ/8T7t3sgV0
mAIYZJXf8jjmTjzjV3n7aEt4iTKYssaLvmMQfg6AYwDXapzm7xYkcSDQa1pDFnEd
DKDBsd17Xx399dtD/pjSw4X+/Z2ILEhcLvVKOO68jvM/wXIVQENKRt6/5QEUdBkw
HjuLccA5e5gpjZ00S7cZo44TRgbJPG+hw611Fy3reJPW1sHQJEbpspOMIRB3vpff
HxoTQm864ZNxR7pXv5sX9gxHp4JLAZ2wfxa+iLe/0rAelq01OHlz7fygc27jXilU
v6URno92PbARzvtyeVAZ+FJLRBpQy90Y4ZRM87GDebCGCDo8xqtxxhiG1xIqriON
PYLuQNeF2MELGUm3nlJO
=4Mxu
-----END PGP SIGNATURE-----
Merge tag 'xfs-for-linux-4.10-rc5-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs fixes from Darrick Wong:
"I have a few more patches this week -- one to make the behavior of a
quota id ioctl consistent with the other filesystems, and the rest
improve validation of i_mode & i_size values coming into xfs so that
we don't read off the ends of arrays or crash when handed garbage disk
data.
Summary:
- inode i_mode sanitization
- prevent overflows in getnextquota
- minor build fixes"
* tag 'xfs-for-linux-4.10-rc5-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: fix xfs_mode_to_ftype() prototype
xfs: don't wrap ID in xfs_dq_get_next_id
xfs: sanity check inode di_mode
xfs: sanity check inode mode when creating new dentry
xfs: replace xfs_mode_to_ftype table with switch statement
xfs: add missing include dependencies to xfs_dir2.h
xfs: sanity check directory inode di_size
xfs: make the ASSERT() condition likely
For such a file mapping,
[0-4k][hole][8k-12k]
In NO_HOLES mode, we don't have the [hole] extent any more.
Commit c1aa45759e ("Btrfs: fix shrinking truncate when the no_holes feature is enabled")
fixed disk isize not being updated in NO_HOLES mode when data is not flushed.
However, even if data has been flushed, we can still have trouble
in updating disk isize since we updated disk isize to 'start' of
the last evicted extent.
Reviewed-by: Chris Mason <clm@fb.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The following deadlock is seen when executing generic/113 test,
---------------------------------------------------------+----------------------------------------------------
Direct I/O task Fast fsync task
---------------------------------------------------------+----------------------------------------------------
btrfs_direct_IO
__blockdev_direct_IO
do_blockdev_direct_IO
do_direct_IO
btrfs_get_blocks_direct
while (blocks needs to written)
get_more_blocks (first iteration)
btrfs_get_blocks_direct
btrfs_create_dio_extent
down_read(&BTRFS_I(inode) >dio_sem)
Create and add extent map and ordered extent
up_read(&BTRFS_I(inode) >dio_sem)
btrfs_sync_file
btrfs_log_dentry_safe
btrfs_log_inode_parent
btrfs_log_inode
btrfs_log_changed_extents
down_write(&BTRFS_I(inode) >dio_sem)
Collect new extent maps and ordered extents
wait for ordered extent completion
get_more_blocks (second iteration)
btrfs_get_blocks_direct
btrfs_create_dio_extent
down_read(&BTRFS_I(inode) >dio_sem)
--------------------------------------------------------------------------------------------------------------
In the above description, Btrfs direct I/O code path has not yet started
submitting bios for file range covered by the initial ordered
extent. Meanwhile, The fast fsync task obtains the write semaphore and
waits for I/O on the ordered extent to get completed. However, the
Direct I/O task is now blocked on obtaining the read semaphore.
To resolve the deadlock, this commit modifies the Direct I/O code path
to obtain the read semaphore before invoking
__blockdev_direct_IO(). The semaphore is then given up after
__blockdev_direct_IO() returns. This allows the Direct I/O code to
complete I/O on all the ordered extents it creates.
Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Below test script can reveal this bug:
dd if=/dev/zero of=fs.img bs=$((1024*1024)) count=100
dev=$(losetup --show -f fs.img)
mkdir -p /mnt/mntpoint
mkfs.btrfs -f $dev
mount $dev /mnt/mntpoint
cd /mnt/mntpoint
echo "workdir is: /mnt/mntpoint"
blocksize=$((128 * 1024))
dd if=/dev/zero of=testfile bs=$blocksize count=1
sync
count=$((17*1024*1024*1024/blocksize))
echo "file size is:" $((count*blocksize))
for ((i = 1; i <= $count; i++)); do
dst_offset=$((blocksize * i))
xfs_io -f -c "reflink testfile 0 $dst_offset $blocksize"\
testfile > /dev/null
done
sync
truncate --size 0 testfile
The last truncate operation will fail for ENOSPC reason, but indeed
it should not fail.
In btrfs_truncate(), we use a temporary block_rsv to do truncate
operation. With every btrfs_truncate_inode_items() call, we migrate space
to this block_rsv, but forget to cleanup previous reservation, which
will make this block_rsv's reserved bytes keep growing, and this reserved
space will only be released in the end of btrfs_truncate(), this metadata
leak will impact other's metadata reservation. In this case, it's
"btrfs_start_transaction(root, 2);" fails for enospc error, which make
this truncate operation fail.
Call btrfs_block_rsv_release() to fix this bug.
Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There are a number of usermode helper binaries that are "hard coded" in
the kernel today, so mark them as "const" to make it harder for someone
to change where the variables point to.
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Thomas Sailer <t.sailer@alumni.ethz.ch>
Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
Cc: Johan Hovold <johan@kernel.org>
Cc: Alex Elder <elder@kernel.org>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Jeff Layton <jlayton@poochiereds.net>
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
A harmless warning just got introduced:
fs/xfs/libxfs/xfs_dir2.h:40:8: error: type qualifiers ignored on function return type [-Werror=ignored-qualifiers]
Removing the 'const' modifier avoids the warning and has no
other effect.
Fixes: 1fc4d33fed ("xfs: replace xfs_mode_to_ftype table with switch statement")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
sparse says:
fs/ceph/mds_client.c:291:23: warning: restricted __le32 degrades to integer
fs/ceph/mds_client.c:293:28: warning: restricted __le32 degrades to integer
fs/ceph/mds_client.c:294:28: warning: restricted __le32 degrades to integer
fs/ceph/mds_client.c:296:28: warning: restricted __le32 degrades to integer
The op value is __le32, so we need to convert it before comparing it.
Cc: stable@vger.kernel.org # needs backporting for < 3.14
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Sage Weil <sage@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
sparse says:
fs/ceph/inode.c:308:36: warning: incorrect type in argument 1 (different base types)
fs/ceph/inode.c:308:36: expected unsigned int [unsigned] [usertype] a
fs/ceph/inode.c:308:36: got restricted __le32 [usertype] frag
fs/ceph/inode.c:308:46: warning: incorrect type in argument 2 (different base types)
fs/ceph/inode.c:308:46: expected unsigned int [unsigned] [usertype] b
fs/ceph/inode.c:308:46: got restricted __le32 [usertype] frag
We need to convert these values to host-endian before calling the
comparator.
Fixes: a407846ef7 ("ceph: don't assume frag tree splits in mds reply are sorted")
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Sage Weil <sage@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Commit 5c341ee328 ("ceph: fix scheduler warning due to nested
blocking") causes infinite loop when process is interrupted. Fix it.
Signed-off-by: Yan, Zheng <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
ovl_lookup_layer() iterates on path elements of d->name.name
but also frees and allocates a new pointer for d->name.name.
For the case of lookup in upper layer, the initial d->name.name
pointer is stable (dentry->d_name), but for lower layers, the
initial d->name.name can be d->redirect, which can be freed during
iteration.
[SzM]
Keep the count of remaining characters in the redirect path and calculate
the current position from that. This works becuase only the prefix is
modified, the ending always stays the same.
Fixes: 02b69b284c ("ovl: lookup redirects")
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
The GETNEXTQOTA ioctl takes whatever ID is sent in,
and looks for the next active quota for an user
equal or higher to that ID.
But if we are at the maximum ID and then ask for the "next"
one, we may wrap back to zero. In this case, userspace
may loop forever, because it will start querying again
at zero.
We'll fix this in userspace as well, but for the kernel,
return -ENOENT if we ask for the next quota ID
past UINT_MAX so the caller knows to stop.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Check for invalid file type in xfs_dinode_verify()
and fail to load the inode structure from disk.
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The helper xfs_dentry_to_name() is used by 2 different
classes of callers: Callers that pass zero mode and don't care
about the returned name.type field and Callers that pass
non zero mode and do care about the name.type field.
Change xfs_dentry_to_name() to not take the mode argument and
change the call sites of the first class to not pass the mode
argument.
Create a new helper xfs_dentry_mode_to_name() which does pass
the mode argument and returns -EFSCORRUPTED if mode is invalid.
Callers that translate non zero mode to on-disk file type now
check the return value and will export the error to user instead
of staging an invalid file type to be written to directory entry.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The size of the xfs_mode_to_ftype[] conversion table
was too small to handle an invalid value of mode=S_IFMT.
Instead of fixing the table size, replace the conversion table
with a conversion helper that uses a switch statement.
Suggested-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
xfs_dir2.h dereferences some data types in inline functions
and fails to include those type definitions, e.g.:
xfs_dir2_data_aoff_t, struct xfs_da_geometry.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
This changes fixes an assertion hit when fuzzing on-disk
i_mode values.
The easy case to fix is when changing an empty file
i_mode to S_IFDIR. In this case, xfs_dinode_verify()
detects an illegal zero size for directory and fails
to load the inode structure from disk.
For the case of non empty file whose i_mode is changed
to S_IFDIR, the ASSERT() statement in xfs_dir2_isblock()
is replaced with return -EFSCORRUPTED, to avoid interacting
with corrupted jusk also when XFS_DEBUG is disabled.
Suggested-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The ASSERT() condition is the normal case, not the exception,
so testing the condition should be likely(), not unlikely().
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
When replaying the journal it can happen that a journal entry points to
a garbage collected node.
This is the case when a power-cut occurred between a garbage collect run
and a commit. In such a case nodes have to be read using the failable
read functions to detect whether the found node matches what we expect.
One corner case was forgotten, when the journal contains an entry to
remove an inode all xattrs have to be removed too. UBIFS models xattr
like directory entries, so the TNC code iterates over
all xattrs of the inode and removes them too. This code re-uses the
functions for walking directories and calls ubifs_tnc_next_ent().
ubifs_tnc_next_ent() expects to be used only after the journal and
aborts when a node does not match the expected result. This behavior can
render an UBIFS volume unmountable after a power-cut when xattrs are
used.
Fix this issue by using failable read functions in ubifs_tnc_next_ent()
too when replaying the journal.
Cc: stable@vger.kernel.org
Fixes: 1e51764a3c ("UBIFS: add new flash file system")
Reported-by: Rock Lee <rockdotlee@gmail.com>
Reviewed-by: David Gstir <david@sigma-star.at>
Signed-off-by: Richard Weinberger <richard@nod.at>
In several places, ubifs checked for an encryption key before creating a
file in an encrypted directory. This was redundant with
fscrypt_setup_filename() or ubifs_new_inode(), and in the case of
ubifs_link() it broke linking to special files. So remove the extra
checks.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
The ubifs encryption ioctls did not work when called by a 32-bit program
on a 64-bit kernel. Since 'struct fscrypt_policy' is not affected by
the word size, ubifs just needs to allow these ioctls through, like what
ext4 and f2fs do.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
This came up during the v4.10 merge window:
warning: (UBIFS_FS_ENCRYPTION) selects FS_ENCRYPTION which has unmet direct dependencies (BLOCK)
fs/crypto/crypto.c: In function 'fscrypt_zeroout_range':
fs/crypto/crypto.c:355:9: error: implicit declaration of function 'bio_alloc';did you mean 'd_alloc'? [-Werror=implicit-function-declaration]
bio = bio_alloc(GFP_NOWAIT, 1);
The easiest way out is to limit UBIFS_FS_ENCRYPTION to configurations
that also enable BLOCK.
Fixes: d475a50745 ("ubifs: Add skeleton for fscrypto")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Richard Weinberger <richard@nod.at>
err is no longer being set on a successful return path, causing
a garbage value being returned. Fix this by setting err to zero
for the successful return path.
Found with static analysis by CoverityScan, CID 1389473
Fixes: 7799953b34 ("ubifs: Implement encrypt/decrypt for all IO")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
Bugfixes:
- Fix invalid fget()/fput() calls when doing file locking
- Fix multiple directory cache invalidation issues due to the client failing
to recognise that the directory wasn't changed.
- Fix client recovery when server reboots multiple times
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJYfSXZAAoJEGcL54qWCgDy3K8P/0YGi1J5VoWkcMD5+Ljh2H3O
AV1oZ3EWLXH2jlJZmX7A1A9ReOEmsYmCX+8QM2ZWbBZI+vIzEJEWd7fU+nhtgfiq
pcw29CSpeBhfcqSswlRE6ouxKpmufnzhwbuGcy9WUTyGYaz1k05/4YHTunyxiZr6
QRHL2Y0YkU0cWD1FHwoOkQr8ft7kdQ4iXHru9utttdYsb4vJnX9ShLpaLW5MDBqS
hPd1ivJHPlWiOVEzLSlTHBvTvw8j5PjhsRQ/q5UwiDvQQDoWL7/qnP4nxZee44mh
MQ+61/0WxS+ohMkIYrZ6hPezk2QX/i7JRQvBPNU07U/+TqtsWX/F5+OerCeQ4/6W
RhP7XDjJV8TpeeJcUU58r1aoEpaWRt9cmtwBWMKBu9tTHCFyRXHsaIh1ZR+Hy85e
HUiiaD4rPldr+ZbCcTwR5bSSkF96WaRF9B3L8Nvgys9v4wwO6LuTnHv5uY8H7ct5
I66nL14drrA+jC01D+SuMl0AnFhrCp7mJyJbrZDvXBJ9hxzzgEj/Jz3IC0mc0SxI
ZjFhQoG2NjV2oJ0PTu9RMUw+Fex0yz6PsXoHLKr45VXkkwQL3Uldq6SVWfkkzqUk
SPFTYD49i9TqCHhGZEzm7kGNM6ASUDKQtAloiUA2QvL6v178RLwG2sjYxy2kEApz
s208kzh8He79iMLEkphR
=K8F7
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-4.10-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client bugfixes from Trond Myklebust:
- fix invalid fget()/fput() calls when doing file locking
- fix multiple directory cache invalidation issues due to the client
failing to recognise that the directory wasn't changed
- fix client recovery when server reboots multiple times
* tag 'nfs-for-4.10-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
NFSv4: Fix client recovery when server reboots multiple times
NFSv4: update_changeattr should update the attribute timestamp
NFSv4: Don't call update_changeattr() unless the unlink is successful
NFSv4: Don't apply change_info4 twice on rename within a directory
NFSv4: Call update_changeattr() from _nfs4_proc_open only if a file was created
nfs: Don't take a reference on fl->fl_file for LOCK operation
The bulk readpages support introduced a harmless warning:
fs/afs/file.c: In function 'afs_readpages_page_done':
fs/afs/file.c:270:20: error: unused variable 'vnode' [-Werror=unused-variable]
This adds an #ifdef to match the user of that variable. The user of the
variable has to be conditional because it accesses a member of a struct
that is also conditional.
Fixes: 91b467e0a3 ("afs: Make afs_readpages() fetch data in bulk")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
bugs.
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJYfOk6AAoJECebzXlCjuG+Lj4QALaLKRRbIdrz6nmg7gUmpTWc
CdW8NMbzwSCXmYoivsTHBlhXZKsi5vVjnFXMCM/P85ddmipXdcTFCDLmmNoKUQ0M
jODlLX90ctaZKCDBVSaH4htAz2gkFv7z5IllX0YDQqHyiuzh/9KoV+AFCgPZPTpL
O1XRmfWz+yJDydz4hb3i5f2JvMk9P/tCXLnheuxxTIMSl2/fIfgF81eWwDpFqcA2
27+PyWWjZehVnZ77ca/mWJj2n0+gBINiKafcfF39NK/Hv2q4aauB3k7c4blecc9Q
m/IT3mKifvHvdNCmvHD5s74h4OikEGYpqaSjonMptZnWgfM4/gtF7yTiQjsOMDx/
w6W/tfHlGrvegpzhjaIaoZZ50EZp7xwGNNZYgH4J44kytYpolrhsOR6NqCLTqpej
xG2Kd89ZtnAgc/7T7ET/1PqpZ8f9M9pyV3E8s36OvF4AYQUNrfzbWSTQcZy3WGBP
YuoUCzacIbNbGgu4m6Zx5l/vKW5yn45xbUMp7T9S4WoxYMx6a5vViU0NiF7KsQDu
pcDT92DZ57KJFtCw7Ig08ILKsSXmNApH5/4mIrkX3quZuH4j2XapEJ9u//fmfZBd
Q+Sgv8RXcGELUJIg9yfmoWgPDA/oYslc7ynBV0lXLNgBuod//dGSlZ+6KfFFJYr8
XVOxwPTiiBIlc9lvB9eA
=tb4L
-----END PGP SIGNATURE-----
Merge tag 'nfsd-4.10-1' of git://linux-nfs.org/~bfields/linux
Pull nfsd fixes from Bruce Fields:
"Miscellaneous nfsd bugfixes, one for a 4.10 regression, three for
older bugs"
* tag 'nfsd-4.10-1' of git://linux-nfs.org/~bfields/linux:
svcrdma: avoid duplicate dma unmapping during error recovery
sunrpc: don't call sleeping functions from the notifier block callbacks
svcrpc: don't leak contexts on PROC_DESTROY
nfsd: fix supported attributes for acl & labels
Pull namespace fixes from Eric Biederman:
"This tree contains 4 fixes.
The first is a fix for a race that can causes oopses under the right
circumstances, and that someone just recently encountered.
Past that are several small trivial correct fixes. A real issue that
was blocking development of an out of tree driver, but does not appear
to have caused any actual problems for in-tree code. A potential
deadlock that was reported by lockdep. And a deadlock people have
experienced and took the time to track down caused by a cleanup that
removed the code to drop a reference count"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
sysctl: Drop reference added by grab_header in proc_sys_readdir
pid: fix lockdep deadlock warning due to ucount_lock
libfs: Modify mount_pseudo_xattr to be clear it is not a userspace mount
mnt: Protect the mountpoint hashtable with mount_lock
Pull vfs fixes from Al Viro.
The most notable fix here is probably the fix for a splice regression
("fix a fencepost error in pipe_advance()") noticed by Alan Wylie.
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
fix a fencepost error in pipe_advance()
coredump: Ensure proper size of sparse core files
aio: fix lock dep warning
tmpfs: clear S_ISGID when setting posix ACLs
Pull block fixes from Jens Axboe:
- the virtio_blk stack DMA corruption fix from Christoph, fixing and
issue with VMAP stacks.
- O_DIRECT blkbits calculation fix from Chandan.
- discard regression fix from Christoph.
- queue init error handling fixes for nbd and virtio_blk, from Omar and
Jeff.
- two small nvme fixes, from Christoph and Guilherme.
- rename of blk_queue_zone_size and bdev_zone_size to _sectors instead,
to more closely follow what we do in other places in the block layer.
This interface is new for this series, so let's get the naming right
before releasing a kernel with this feature. From Damien.
* 'for-linus' of git://git.kernel.dk/linux-block:
block: don't try to discard from __blkdev_issue_zeroout
sd: remove __data_len hack for WRITE SAME
nvme: use blk_rq_payload_bytes
scsi: use blk_rq_payload_bytes
block: add blk_rq_payload_bytes
block: Rename blk_queue_zone_size and bdev_zone_size
nvme: apply DELAY_BEFORE_CHK_RDY quirk at probe time too
nvme-rdma: fix nvme_rdma_queue_is_ready
virtio_blk: fix panic in initialization error path
nbd: blk_mq_init_queue returns an error code on failure, not NULL
virtio_blk: avoid DMA to stack for the sense buffer
do_direct_IO: Use inode->i_blkbits to compute block count to be cleaned
If the last section of a core file ends with an unmapped or zero page,
the size of the file does not correspond with the last dump_skip() call.
gdb complains that the file is truncated and can be confusing to users.
After all of the vma sections are written, make sure that the file size
is no smaller than the current file position.
This problem can be demonstrated with gdb's bigcore testcase on the
sparc architecture.
Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
CC: Stable <stable@vger.kernel.org>
file_info_lock is not initalized in initiate_cifs_search(), leading to the
following splat after a simple "mount.cifs ... dir && ls dir/":
BUG: spinlock bad magic on CPU#0, ls/486
lock: 0xffff880009301110, .magic: 00000000, .owner: <none>/-1, .owner_cpu: 0
CPU: 0 PID: 486 Comm: ls Not tainted 4.9.0 #27
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
ffffc900042f3db0 ffffffff81327533 0000000000000000 ffff880009301110
ffffc900042f3dd0 ffffffff810baf75 ffff880009301110 ffffffff817ae077
ffffc900042f3df0 ffffffff810baff6 ffff880009301110 ffff880008d69900
Call Trace:
[<ffffffff81327533>] dump_stack+0x65/0x92
[<ffffffff810baf75>] spin_dump+0x85/0xe0
[<ffffffff810baff6>] spin_bug+0x26/0x30
[<ffffffff810bb159>] do_raw_spin_lock+0xe9/0x130
[<ffffffff8159ad2f>] _raw_spin_lock+0x1f/0x30
[<ffffffff8127e50d>] cifs_closedir+0x4d/0x100
[<ffffffff81181cfd>] __fput+0x5d/0x160
[<ffffffff81181e3e>] ____fput+0xe/0x10
[<ffffffff8109410e>] task_work_run+0x7e/0xa0
[<ffffffff81002512>] exit_to_usermode_loop+0x92/0xa0
[<ffffffff810026f9>] syscall_return_slowpath+0x49/0x50
[<ffffffff8159b484>] entry_SYSCALL_64_fastpath+0xa7/0xa9
Fixes: 3afca265b5 ("Clarify locking of cifs file and tcon structures and make more granular")
Signed-off-by: Rabin Vincent <rabinv@axis.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Since we need to change the implementation, stop exposing internals.
Provide kref_read() to read the current reference count; typically
used for debug messages.
Kills two anti-patterns:
atomic_read(&kref->refcount)
kref->refcount.counter
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Since we need to change the implementation, stop exposing internals.
Provide KREF_INIT() to allow static initialization of struct kref.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When an ext4 fs is bogged down by a lot of metadata IOs (in the
reported case, it was deletion of millions of files, but any massive
amount of journal writes would do), after the journal is filled up,
tasks which try to access the filesystem and aren't currently
performing the journal writes end up waiting in
__jbd2_log_wait_for_space() for journal->j_checkpoint_mutex.
Because those mutex sleeps aren't marked as iowait, this condition can
lead to misleadingly low iowait and /proc/stat:procs_blocked. While
iowait propagation is far from strict, this condition can be triggered
fairly easily and annotating these sleeps correctly helps initial
diagnosis quite a bit.
Use the new mutex_lock_io() for journal->j_checkpoint_mutex so that
these sleeps are properly marked as iowait.
Reported-by: Mingbo Wan <mingbo@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Jan Kara <jack@suse.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: kernel-team@fb.com
Link: http://lkml.kernel.org/r/1477673892-28940-5-git-send-email-tj@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull btrfs fixes from Chris Mason:
"These are all over the place.
The tracepoint part of the pull fixes a crash and adds a little more
information to two tracepoints, while the rest are good old fashioned
fixes"
* 'for-linus-4.10' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
btrfs: make tracepoint format strings more compact
Btrfs: add truncated_len for ordered extent tracepoints
Btrfs: add 'inode' for extent map tracepoint
btrfs: fix crash when tracepoint arguments are freed by wq callbacks
Btrfs: adjust outstanding_extents counter properly when dio write is split
Btrfs: fix lockdep warning about log_mutex
Btrfs: use down_read_nested to make lockdep silent
btrfs: fix locking when we put back a delayed ref that's too new
btrfs: fix error handling when run_delayed_extent_op fails
btrfs: return the actual error value from from btrfs_uuid_tree_iterate
window.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQEcBAABCAAGBQJYeQymAAoJEEp/3jgCEfOLLVsH/28qRsjVPWr5JuL1SF86//kd
rAi7QUfbNgXHqbb10a9za9pNuLhHr3kImIfvQ04wYiYQY+IaAapiRXwQev8BsNAa
yENUc8XwNgydw4FU1ia5PkGOJLDtujtfgjWT2v+gf1HUzLaV6alBzqDwUZBt3xJz
mlYC82oFkXPa0BFmLUXtT/jJu/ZI8caO4KB34/UKi7LjBQk1ca7E2xVUoDtdQmEm
ciPE98akU4JiB99aOgGdwemBzkAMHEGQpImTzqHr/tbIUj0MqVAjH9FVOhRCbjMy
6MSR+U9yUzJkBzefS5enijAoExVc8cD/A0nIaKGVb6qWrIrk51/Opl6iILeVLUo=
=28cq
-----END PGP SIGNATURE-----
Merge tag 'ceph-for-4.10-rc4' of git://github.com/ceph/ceph-client
Pull ceph fixes from Ilya Dryomov:
"Two small fixups for the filesystem changes that went into this merge
window"
* tag 'ceph-for-4.10-rc4' of git://github.com/ceph/ceph-client:
ceph: fix get_oldest_context()
ceph: fix mds cluster availability check
If the server reboots multiple times, the client should rely on the
server to tell it that it cannot reclaim state as per section 9.6.3.4
in RFC7530 and section 8.4.2.1 in RFC5661.
Currently, the client is being to conservative, and is assuming that
if the server reboots while state recovery is in progress, then it must
ignore state that was not recovered before the reboot.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Commit bcb6f6d2b9 ("fuse: use timespec64") introduced clamped nsec values
in time_to_jiffies but used the max of nsec and NSEC_PER_SEC - 1 instead of
the min. Because of this, dentries would stay in the cache longer than
requested and go stale in scenarios that relied on their timely eviction.
Fixes: bcb6f6d2b9 ("fuse: use timespec64")
Signed-off-by: David Sheets <dsheets@docker.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: <stable@vger.kernel.org> # 4.9
fuse_abort_conn() moves requests from pending list to a temporary list
before canceling them. This operation races with request_wait_answer()
which also tries to remove the request after it gets a fatal signal. It
checks FR_PENDING flag to determine whether the request is still in the
pending list.
Make fuse_abort_conn() clear FR_PENDING flag so that request_wait_answer()
does not remove the request from temporary list.
This bug causes an Oops when trying to delete an already deleted list entry
in end_requests().
Fixes: ee314a870e ("fuse: abort: no fc->lock needed for request ending")
Signed-off-by: Tahsin Erdogan <tahsin@google.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: <stable@vger.kernel.org> # 4.2+
Oops--in 916d2d844a I moved some constants into an array for
convenience, but here I'm accidentally writing to that array.
The effect is that if you ever encounter a filesystem lacking support
for ACLs or security labels, then all queries of supported attributes
will report that attribute as unsupported from then on.
Fixes: 916d2d844a "nfsd: clean up supported attribute handling"
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
If a file is renamed, but stays in the same directory, we will still receive
2 change_info4 structures describing the change to that directory, but we
only want to apply it once.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
We don't want to invalidate the directory attribute and data cache unless we
know that a file was created, or the change attribute differs from the one
in our cache.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
- Fix free space request handling when low on disk space
- Remove redundant log failure error messages
- Free truncate dirty pages instead of letting them build up forever
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABCgAGBQJYdnh5AAoJEPh/dxk0SrTrI5sP/2FYAxrj3Iw684e92PWeIe1c
pZir+0LEmlXgly4wprli7/BO6WZjESuMF3huBZZxtYLx3UUvGRavHup620z3Xa2e
YRdSiBnvchOe3F0UfFO9wT15vEOzwWS61FfX/g35t4mD6ItY0XISesTx4+XA3Cqi
8zf7OAjXI5WQietthwc9zmfhpyWgyw8CkeXVtqd5whNJN/6E80wh5IE6D6cJ1xkX
2qNicrrruTUACPyJxrTrR/0kkoxDoYgBapy3kqV0R1uK+ttNrGlGjfdapKs0tqMb
ezksj0sy/rx+HjfGEHx52mTiYRRTHEsGt/yMa1pT8o0p2wvR0nxnvzkvV8DeMoX3
0wYRxATsv/t8Oog+ug6khB/FtppSGJML+XxG3bV9itgCkAIFAbb7vZRrThnzVqOr
ChbQvBhchY4lYA1pef862QHLRraJF84HuY/ypyE6DX+nQWkGjDfEeHFPcm6eVSXG
xEplK8wjs09pSklH/OQD+GsO4eedBHc4hdzDuBTjCmt867/Lk8Uw4RIjGMYjbBx2
ovU3pTHfdlO6/qJmfSWnBAAZjKAjna/p47ZfDLzio23hb1hK68Qe/z8+tPnAzkHW
rowA6KfBi13tX6Njds/lceiYMrnMRPqNBA+Og1wCyljP2Q4shbADa6czrFDzPn6P
xZbfhKK2CTbYSvxi5S3G
=8bsr
-----END PGP SIGNATURE-----
Merge tag 'xfs-for-linus-4.10-rc4-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs fixes from Darrick Wong:
"As promised last week, here's some stability fixes from Christoph and
Jan Kara:
- fix free space request handling when low on disk space
- remove redundant log failure error messages
- free truncated dirty pages instead of letting them build up
forever"
* tag 'xfs-for-linus-4.10-rc4-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: Timely free truncated dirty pages
xfs: don't print warnings when xfs_log_force fails
xfs: don't rely on ->total in xfs_alloc_space_available
xfs: adjust allocation length in xfs_alloc_space_available
xfs: fix bogus minleft manipulations
xfs: bump up reserved blocks in xfs_alloc_set_aside
For no snapshot case, we should use ci->truncate_{seq,size}.
Fixes: 5f743e4566 ("ceph: record truncate size/seq for snap data writeback")
Signed-off-by: Geng, Jichao <geng.jichao@h3c.com>
Signed-off-by: Yan, Zheng <zyan@redhat.com>
We should apply the check after getting the initial mdsmap.
Fixes: e9e427f0a1 ("ceph: check availability of mds cluster on mount")
Link: http://tracker.ceph.com/issues/18161
Signed-off-by: Yan, Zheng <zyan@redhat.com>
I have reports of a crash that look like __fput() was called twice for
a NFSv4.0 file. It seems possible that the state manager could try to
reclaim a lock and take a reference on the fl->fl_file at the same time the
file is being released if, during the close(), a signal interrupts the wait
for outstanding IO while removing locks which then skips the removal
of that lock.
Since 83bfff23e9 ("nfs4: have do_vfs_lock take an inode pointer") has
removed the need to traverse fl->fl_file->f_inode in nfs4_lock_done(),
taking that reference is no longer necessary.
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
All block device data fields and functions returning a number of 512B
sectors are by convention named xxx_sectors while names in the form
xxx_size are generally used for a number of bytes. The blk_queue_zone_size
and bdev_zone_size functions were not following this convention so rename
them.
No functional change is introduced by this patch.
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Collapsed the two patches, they were nonsensically split and broke
bisection.
Signed-off-by: Jens Axboe <axboe@fb.com>
There is no need to call ext4_mark_inode_dirty while holding xattr_sem
or i_data_sem, so where it's easy to avoid it, move it out from the
critical region.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
In order to test the inode extra isize expansion code, it is useful to
be able to easily create file systems that have inodes with extra
isize values smaller than the current desired value.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Commit 99579ccec4 "xfs: skip dirty pages in ->releasepage()" started
to skip dirty pages in xfs_vm_releasepage() which also has the effect
that if a dirty page is truncated, it does not get freed by
block_invalidatepage() and is lingering in LRU list waiting for reclaim.
So a simple loop like:
while true; do
dd if=/dev/zero of=file bs=1M count=100
rm file
done
will keep using more and more memory until we hit low watermarks and
start pagecache reclaim which will eventually reclaim also the truncate
pages. Keeping these truncated (and thus never usable) pages in memory
is just a waste of memory, is unnecessarily stressing page cache
reclaim, and reportedly also leads to anonymous mmap(2) returning ENOMEM
prematurely.
So instead of just skipping dirty pages in xfs_vm_releasepage(), return
to old behavior of skipping them only if they have delalloc or unwritten
buffers and fix the spurious warnings by warning only if the page is
clean.
CC: stable@vger.kernel.org
CC: Brian Foster <bfoster@redhat.com>
CC: Vlastimil Babka <vbabka@suse.cz>
Reported-by: Petr Tůma <petr.tuma@d3s.mff.cuni.cz>
Fixes: 99579ccec4
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The crash happens rather often when we reset some cluster nodes while
nodes contend fiercely to do truncate and append.
The crash backtrace is below:
dlm: C21CBDA5E0774F4BA5A9D4F317717495: dlm_recover_grant 1 locks on 971 resources
dlm: C21CBDA5E0774F4BA5A9D4F317717495: dlm_recover 9 generation 5 done: 4 ms
ocfs2: Begin replay journal (node 318952601, slot 2) on device (253,18)
ocfs2: End replay journal (node 318952601, slot 2) on device (253,18)
ocfs2: Beginning quota recovery on device (253,18) for slot 2
ocfs2: Finishing quota recovery on device (253,18) for slot 2
(truncate,30154,1):ocfs2_truncate_file:470 ERROR: bug expression: le64_to_cpu(fe->i_size) != i_size_read(inode)
(truncate,30154,1):ocfs2_truncate_file:470 ERROR: Inode 290321, inode i_size = 732 != di i_size = 937, i_flags = 0x1
------------[ cut here ]------------
kernel BUG at /usr/src/linux/fs/ocfs2/file.c:470!
invalid opcode: 0000 [#1] SMP
Modules linked in: ocfs2_stack_user(OEN) ocfs2(OEN) ocfs2_nodemanager ocfs2_stackglue(OEN) quota_tree dlm(OEN) configfs fuse sd_mod iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi af_packet iscsi_ibft iscsi_boot_sysfs softdog xfs libcrc32c ppdev parport_pc pcspkr parport joydev virtio_balloon virtio_net i2c_piix4 acpi_cpufreq button processor ext4 crc16 jbd2 mbcache ata_generic cirrus virtio_blk ata_piix drm_kms_helper ahci syscopyarea libahci sysfillrect sysimgblt fb_sys_fops ttm floppy libata drm virtio_pci virtio_ring uhci_hcd virtio ehci_hcd usbcore serio_raw usb_common sg dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua scsi_mod autofs4
Supported: No, Unsupported modules are loaded
CPU: 1 PID: 30154 Comm: truncate Tainted: G OE N 4.4.21-69-default #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20151112_172657-sheep25 04/01/2014
task: ffff88004ff6d240 ti: ffff880074e68000 task.ti: ffff880074e68000
RIP: 0010:[<ffffffffa05c8c30>] [<ffffffffa05c8c30>] ocfs2_truncate_file+0x640/0x6c0 [ocfs2]
RSP: 0018:ffff880074e6bd50 EFLAGS: 00010282
RAX: 0000000000000074 RBX: 000000000000029e RCX: 0000000000000000
RDX: 0000000000000001 RSI: 0000000000000246 RDI: 0000000000000246
RBP: ffff880074e6bda8 R08: 000000003675dc7a R09: ffffffff82013414
R10: 0000000000034c50 R11: 0000000000000000 R12: ffff88003aab3448
R13: 00000000000002dc R14: 0000000000046e11 R15: 0000000000000020
FS: 00007f839f965700(0000) GS:ffff88007fc80000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007f839f97e000 CR3: 0000000036723000 CR4: 00000000000006e0
Call Trace:
ocfs2_setattr+0x698/0xa90 [ocfs2]
notify_change+0x1ae/0x380
do_truncate+0x5e/0x90
do_sys_ftruncate.constprop.11+0x108/0x160
entry_SYSCALL_64_fastpath+0x12/0x6d
Code: 24 28 ba d6 01 00 00 48 c7 c6 30 43 62 a0 8b 41 2c 89 44 24 08 48 8b 41 20 48 c7 c1 78 a3 62 a0 48 89 04 24 31 c0 e8 a0 97 f9 ff <0f> 0b 3d 00 fe ff ff 0f 84 ab fd ff ff 83 f8 fc 0f 84 a2 fd ff
RIP [<ffffffffa05c8c30>] ocfs2_truncate_file+0x640/0x6c0 [ocfs2]
It's because ocfs2_inode_lock() get us stale LVB in which the i_size is
not equal to the disk i_size. We mistakenly trust the LVB because the
underlaying fsdlm dlm_lock() doesn't set lkb_sbflags with
DLM_SBF_VALNOTVALID properly for us. But, why?
The current code tries to downconvert lock without DLM_LKF_VALBLK flag
to tell o2cb don't update RSB's LVB if it's a PR->NULL conversion, even
if the lock resource type needs LVB. This is not the right way for
fsdlm.
The fsdlm plugin behaves different on DLM_LKF_VALBLK, it depends on
DLM_LKF_VALBLK to decide if we care about the LVB in the LKB. If
DLM_LKF_VALBLK is not set, fsdlm will skip recovering RSB's LVB from
this lkb and set the right DLM_SBF_VALNOTVALID appropriately when node
failure happens.
The following diagram briefly illustrates how this crash happens:
RSB1 is inode metadata lock resource with LOCK_TYPE_USES_LVB;
The 1st round:
Node1 Node2
RSB1: PR
RSB1(master): NULL->EX
ocfs2_downconvert_lock(PR->NULL, set_lvb==0)
ocfs2_dlm_lock(no DLM_LKF_VALBLK)
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
dlm_lock(no DLM_LKF_VALBLK)
convert_lock(overwrite lkb->lkb_exflags
with no DLM_LKF_VALBLK)
RSB1: NULL RSB1: EX
reset Node2
dlm_recover_rsbs()
recover_lvb()
/* The LVB is not trustable if the node with EX fails and
* no lock >= PR is left. We should set RSB_VALNOTVALID for RSB1.
*/
if(!(kb_exflags & DLM_LKF_VALBLK)) /* This means we miss the chance to
return; * to invalid the LVB here.
*/
The 2nd round:
Node 1 Node2
RSB1(become master from recovery)
ocfs2_setattr()
ocfs2_inode_lock(NULL->EX)
/* dlm_lock() return the stale lvb without setting DLM_SBF_VALNOTVALID */
ocfs2_meta_lvb_is_trustable() return 1 /* so we don't refresh inode from disk */
ocfs2_truncate_file()
mlog_bug_on_msg(disk isize != i_size_read(inode)) /* crash! */
The fix is quite straightforward. We keep to set DLM_LKF_VALBLK flag
for dlm_lock() if the lock resource type needs LVB and the fsdlm plugin
is uesed.
Link: http://lkml.kernel.org/r/1481275846-6604-1-git-send-email-zren@suse.com
Signed-off-by: Eric Ren <zren@suse.com>
Reviewed-by: Joseph Qi <jiangqi903@gmail.com>
Cc: Mark Fasheh <mfasheh@versity.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently dax_mapping_entry_mkclean() fails to clean and write protect
the pmd_t of a DAX PMD entry during an *sync operation. This can result
in data loss in the following sequence:
1) mmap write to DAX PMD, dirtying PMD radix tree entry and making the
pmd_t dirty and writeable
2) fsync, flushing out PMD data and cleaning the radix tree entry. We
currently fail to mark the pmd_t as clean and write protected.
3) more mmap writes to the PMD. These don't cause any page faults since
the pmd_t is dirty and writeable. The radix tree entry remains clean.
4) fsync, which fails to flush the dirty PMD data because the radix tree
entry was clean.
5) crash - dirty data that should have been fsync'd as part of 4) could
still have been in the processor cache, and is lost.
Fix this by marking the pmd_t clean and write protected in
dax_mapping_entry_mkclean(), which is called as part of the fsync
operation 2). This will cause the writes in step 3) above to generate
page faults where we'll re-dirty the PMD radix tree entry, resulting in
flushes in the fsync that happens in step 4).
Fixes: 4b4bb46d00 ("dax: clear dirty entry tags on cache flush")
Link: http://lkml.kernel.org/r/1482272586-21177-3-git-send-email-ross.zwisler@linux.intel.com
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The code currently uses sdio->blkbits to compute the number of blocks to
be cleaned. However sdio->blkbits is derived from the logical block size
of the underlying block device (Refer to the definition of
do_blockdev_direct_IO()). Due to this, generic/299 test would rarely
fail when executed on an ext4 filesystem with 64k as the block size and
when using a virtio based disk (having 512 byte as the logical block
size) inside a kvm guest.
This commit fixes the bug by using inode->i_blkbits to compute the
number of blocks to be cleaned.
Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Fixed up by Jeff Moyer to only use/evaluate inode->i_blkbits once,
to avoid issues with block size changes with IO in flight.
Signed-off-by: Jens Axboe <axboe@fb.com>
We were checking block number without checking partition.
sbi->s_partmaps[iloc->partitionReferenceNum] could lead to
bad memory access. See udf_nfs_get_inode() path for instance.
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Jan Kara <jack@suse.cz>
Move all module attributes at the end of one file like other FS.
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Jan Kara <jack@suse.cz>
udf_update_extent_cache() is only called from inode_bmap()
with 1 for next_epos
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Jan Kara <jack@suse.cz>
loc & 0x02 is empty since first git version in 2005 in
udf_add_extendedattr()
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Jan Kara <jack@suse.cz>
Having struct kernel_long_ad laarr[EXTENT_MERGE_SIZE]
in all function arguments could be understood as by-value parameter.
Use kernel_long_ad pointer for functions depending on
inode_getblk()
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Jan Kara <jack@suse.cz>
This change was missed the tmpfs modification in In CVE-2016-7097
commit 073931017b ("posix_acl: Clear SGID bit when setting
file permissions")
It can test by xfstest generic/375, which failed to clear
setgid bit in the following test case on tmpfs:
touch $testfile
chown 100:100 $testfile
chmod 2755 $testfile
_runas -u 100 -g 101 -- setfacl -m u::rwx,g::rwx,o::rwx $testfile
Signed-off-by: Gu Zheng <guzheng1@huawei.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Add MS_KERNMOUNT to the flags that are passed.
Use sget_userns and force &init_user_ns instead of calling sget so that
even if called from a weird context the internal filesystem will be
considered to be in the intial user namespace.
Luis Ressel reported that the the failure to pass MS_KERNMOUNT into
mount_pseudo broke his in development graphics driver that uses the
generic drm infrastructure. I am not certain the deriver was bug
free in it's usage of that infrastructure but since
mount_pseudo_xattr can never be triggered by userspace it is clearer
and less error prone, and less problematic for the code to be explicit.
Reported-by: Luis Ressel <aranea@aixah.de>
Tested-by: Luis Ressel <aranea@aixah.de>
Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Protecting the mountpoint hashtable with namespace_sem was sufficient
until a call to umount_mnt was added to mntput_no_expire. At which
point it became possible for multiple calls of put_mountpoint on
the same hash chain to happen on the same time.
Kristen Johansen <kjlx@templeofstupid.com> reported:
> This can cause a panic when simultaneous callers of put_mountpoint
> attempt to free the same mountpoint. This occurs because some callers
> hold the mount_hash_lock, while others hold the namespace lock. Some
> even hold both.
>
> In this submitter's case, the panic manifested itself as a GP fault in
> put_mountpoint() when it called hlist_del() and attempted to dereference
> a m_hash.pprev that had been poisioned by another thread.
Al Viro observed that the simple fix is to switch from using the namespace_sem
to the mount_lock to protect the mountpoint hash table.
I have taken Al's suggested patch moved put_mountpoint in pivot_root
(instead of taking mount_lock an additional time), and have replaced
new_mountpoint with get_mountpoint a function that does the hash table
lookup and addition under the mount_lock. The introduction of get_mounptoint
ensures that only the mount_lock is needed to manipulate the mountpoint
hashtable.
d_set_mounted is modified to only set DCACHE_MOUNTED if it is not
already set. This allows get_mountpoint to use the setting of
DCACHE_MOUNTED to ensure adding a struct mountpoint for a dentry
happens exactly once.
Cc: stable@vger.kernel.org
Fixes: ce07d891a0 ("mnt: Honor MNT_LOCKED when detaching mounts")
Reported-by: Krister Johansen <kjlx@templeofstupid.com>
Suggested-by: Al Viro <viro@ZenIV.linux.org.uk>
Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
There are only two reasons for xfs_log_force / xfs_log_force_lsn to fail:
one is an I/O error, for which xlog_bdstrat already logs a warning, and
the second is an already shutdown log due to a previous I/O errors. In
the latter case we'll already have a previous indication for the actual
error, but the large stream of misleading warnings from xfs_log_force
will probably scroll it out of the message buffer.
Simply removing the warnings thus makes the XFS log reporting significantly
better.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
->total is a bit of an odd parameter passed down to the low-level
allocator all the way from the high-level callers. It's supposed to
contain the maximum number of blocks to be allocated for the whole
transaction [1].
But in xfs_iomap_write_allocate we only convert existing delayed
allocations and thus only have a minimal block reservation for the
current transaction, so xfs_alloc_space_available can't use it for
the allocation decisions. Use the maximum of args->total and the
calculated block requirement to make a decision. We probably should
get rid of args->total eventually and instead apply ->minleft more
broadly, but that will require some extensive changes all over.
[1] which creates lots of confusion as most callers don't decrement it
once doing a first allocation. But that's for a separate series.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
We must decide in xfs_alloc_fix_freelist if we can perform an
allocation from a given AG is possible or not based on the available
space, and should not fail the allocation past that point on a
healthy file system.
But currently we have two additional places that second-guess
xfs_alloc_fix_freelist: xfs_alloc_ag_vextent tries to adjust the
maxlen parameter to remove the reservation before doing the
allocation (but ignores the various minium freespace requirements),
and xfs_alloc_fix_minleft tries to fix up the allocated length
after we've found an extent, but ignores the reservations and also
doesn't take the AGFL into account (and thus fails allocations
for not matching minlen in some cases).
Remove all these later fixups and just correct the maxlen argument
inside xfs_alloc_fix_freelist once we have the AGF buffer locked.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
We can't just set minleft to 0 when we're low on space - that's exactly
what we need minleft for: to protect space in the AG for btree block
allocations when we are low on free space.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Setting aside 4 blocks globally for bmbt splits isn't all that useful,
as different threads can allocate space in parallel. Bump it to 4
blocks per AG to allow each thread that is currently doing an
allocation to dip into it separately. Without that we may no have
enough reserved blocks if there are enough parallel transactions
in an almost out space file system that all run into bmap btree
splits.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
-----BEGIN PGP SIGNATURE-----
iQIVAwUAWHNwyPSw1s6N8H32AQKoqw//Wi8fpY/7SlQ8UT0RcF4KlBtfKux4dhMh
c4P2ARqEi3hVHz0MAJSYwhJDiXmPT8FboXq7yQmXj7DpkwDUgEHJlOZyoZFrStWC
hE72lbwD/m57jYgTG694wJZnGvTtqBEEkoMMIiUTSpEkSxB8aGsL+8dP9E6Q5hBS
ixLUHINdjaubsu+uzlI3MZdDk7TWBwp5fNekf4Jbjlb9anoICEkJsjZJHTR9n3nM
d9QpEbh42+YHAn2EFL8gXN+Cb7o75QppT3K+b68Pz43yvPgMLd78Q4tSN0aCo190
9ynR1szpniiw3T/xW0dGanpRjKLs7HZubTujc1oQ+TD1Q1Uh+2/nZWb9PxWAAe3S
CW+ssn6slv9IS+KXyoIMbDtyPaJOu1pMxYcFVXlZOAPXnYGl8P0A610f8u9833jT
OEqVKQ/bHAPiiTl2X/ATzCePhATtoYUq7jIc71pP01WK+o054bzm0r9Wyjxgs7g6
iPi4cfueZFOJMilkE9ZWuIws43YDv5wIEOWtpTkRCIHKCmkeVXkDfdRnnXhJCUeF
6y3iW0staR/pnTqI6g8LEnGku2gbteBQNCueYoJA5jsxLyl6oJw1Bur7yGTzzPnJ
SP+9+RBlyGI5EzIcqQWsReOhGY4U/hOWDtltYR/gmlhlQ2o/iO4U1aiN0qa1AiaH
3ixixVygYOA=
=H/FD
-----END PGP SIGNATURE-----
Merge tag 'rxrpc-rewrite-20170109' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
David Howells says:
====================
afs: Refcount afs_call struct
These patches provide some tracepoints for AFS and fix a potential leak by
adding refcounting to the afs_call struct.
The patches are:
(1) Add some tracepoints for logging incoming calls and monitoring
notifications from AF_RXRPC and data reception.
(2) Get rid of afs_wait_mode as it didn't turn out to be as useful as
initially expected. It can be brought back later if needed. This
clears some stuff out that I don't then need to fix up in (4).
(3) Allow listen(..., 0) to be used to disable listening. This makes
shutting down the AFS cache manager server in the kernel much easier
and the accounting simpler as we can then be sure that (a) all
preallocated afs_call structs are relesed and (b) no new incoming
calls are going to be started.
For the moment, listening cannot be reenabled.
(4) Add refcounting to the afs_call struct to fix a potential multiple
release detected by static checking and add a tracepoint to follow the
lifecycle of afs_call objects.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Processes can only alter their own security attributes via
/proc/pid/attr nodes. This is presently enforced by each individual
security module and is also imposed by the Linux credentials
implementation, which only allows a task to alter its own credentials.
Move the check enforcing this restriction from the individual
security modules to proc_pid_attr_write() before calling the security hook,
and drop the unnecessary task argument to the security hook since it can
only ever be the current task.
Signed-off-by: Stephen Smalley <sds@tycho.nsa.gov>
Acked-by: Casey Schaufler <casey@schaufler-ca.com>
Acked-by: John Johansen <john.johansen@canonical.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
A static checker warning occurs in the AFS filesystem:
fs/afs/cmservice.c:155 SRXAFSCB_CallBack()
error: dereferencing freed memory 'call'
due to the reply being sent before we access the server it points to. The
act of sending the reply causes the call to be freed if an error occurs
(but not if it doesn't).
On top of this, the lifetime handling of afs_call structs is fragile
because they get passed around through workqueues without any sort of
refcounting.
Deal with the issues by:
(1) Fix the maybe/maybe not nature of the reply sending functions with
regards to whether they release the call struct.
(2) Refcount the afs_call struct and sort out places that need to get/put
references.
(3) Pass a ref through the work queue and release (or pass on) that ref in
the work function. Care has to be taken because a work queue may
already own a ref to the call.
(4) Do the cleaning up in the put function only.
(5) Simplify module cleanup by always incrementing afs_outstanding_calls
whenever a call is allocated.
(6) Set the backlog to 0 with kernel_listen() at the beginning of the
process of closing the socket to prevent new incoming calls from
occurring and to remove the contribution of preallocated calls from
afs_outstanding_calls before we wait on it.
A tracepoint is also added to monitor the afs_call refcount and lifetime.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Fixes: 08e0e7c82e: "[AF_RXRPC]: Make the in-kernel AFS filesystem use AF_RXRPC."
The afs_wait_mode struct isn't really necessary. Client calls only use one
of a choice of two (synchronous or the asynchronous) and incoming calls
don't use the wait at all. Replace with a boolean parameter.
Signed-off-by: David Howells <dhowells@redhat.com>
'inode' is an important field for btrfs_get_extent, lets trace it.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Enabling btrfs tracepoints leads to instant crash, as reported. The wq
callbacks could free the memory and the tracepoints started to
dereference the members to get to fs_info.
The proposed fix https://marc.info/?l=linux-btrfs&m=148172436722606&w=2
removed the tracepoints but we could preserve them by passing only the
required data in a safe way.
Fixes: bc074524e1 ("btrfs: prefix fsid to all trace events")
CC: stable@vger.kernel.org # 4.8+
Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add three tracepoints to the AFS filesystem:
(1) The afs_recv_data tracepoint logs data segments that are extracted
from the data received from the peer through afs_extract_data().
(2) The afs_notify_call tracepoint logs notification from AF_RXRPC of data
coming in to an asynchronous call.
(3) The afs_cb_call tracepoint logs incoming calls that have had their
operation ID extracted and mapped into a supported cache manager
service call.
To make (3) work, the name strings in the afs_call_type struct objects have
to be annotated with __tracepoint_string. This is done with the CM_NAME()
macro.
Further, the AFS call state enum needs a name so that it can be used to
declare parameter types.
Signed-off-by: David Howells <dhowells@redhat.com>
Inside ext4_ext_shift_extents() function ext4_find_extent() is called
without EXT4_EX_NOCACHE flag, which should prevent cache population.
This leads to oudated offsets in the extents tree and wrong blocks
afterwards.
Patch fixes the problem providing EXT4_EX_NOCACHE flag for each
ext4_find_extents() call inside ext4_ext_shift_extents function.
Fixes: 331573febb
Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: Namjae Jeon <namjae.jeon@samsung.com>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: stable@vger.kernel.org
While doing 'insert range' start block should be also shifted right.
The bug can be easily reproduced by the following test:
ptr = malloc(4096);
assert(ptr);
fd = open("./ext4.file", O_CREAT | O_TRUNC | O_RDWR, 0600);
assert(fd >= 0);
rc = fallocate(fd, 0, 0, 8192);
assert(rc == 0);
for (i = 0; i < 2048; i++)
*((unsigned short *)ptr + i) = 0xbeef;
rc = pwrite(fd, ptr, 4096, 0);
assert(rc == 4096);
rc = pwrite(fd, ptr, 4096, 4096);
assert(rc == 4096);
for (block = 2; block < 1000; block++) {
rc = fallocate(fd, FALLOC_FL_INSERT_RANGE, 4096, 4096);
assert(rc == 0);
for (i = 0; i < 2048; i++)
*((unsigned short *)ptr + i) = block;
rc = pwrite(fd, ptr, 4096, 4096);
assert(rc == 4096);
}
Because start block is not included in the range the hole appears at
the wrong offset (just after the desired offset) and the following
pwrite() overwrites already existent block, keeping hole untouched.
Simple way to verify wrong behaviour is to check zeroed blocks after
the test:
$ hexdump ./ext4.file | grep '0000 0000'
The root cause of the bug is a wrong range (start, stop], where start
should be inclusive, i.e. [start, stop].
This patch fixes the problem by including start into the range. But
not to break left shift (range collapse) stop points to the beginning
of the a block, not to the end.
The other not obvious change is an iterator check on validness in a
main loop. Because iterator is unsigned the following corner case
should be considered with care: insert a block at 0 offset, when stop
variables overflows and never becomes less than start, which is 0.
To handle this special case iterator is set to NULL to indicate that
end of the loop is reached.
Fixes: 331573febb
Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: Namjae Jeon <namjae.jeon@samsung.com>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: stable@vger.kernel.org
There was an unnecessary amount of complexity around requesting the
filesystem-specific key prefix. It was unclear why; perhaps it was
envisioned that different instances of the same filesystem type could
use different key prefixes, or that key prefixes could be binary.
However, neither of those things were implemented or really make sense
at all. So simplify the code by making key_prefix a const char *.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Richard Weinberger <richard@nod.at>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
While we allow deletes without the key, the following should not be
permitted:
# cd /vdc/encrypted-dir-without-key
# ls -l
total 4
-rw-r--r-- 1 root root 0 Dec 27 22:35 6,LKNRJsp209FbXoSvJWzB
-rw-r--r-- 1 root root 286 Dec 27 22:35 uRJ5vJh9gE7vcomYMqTAyD
# mv uRJ5vJh9gE7vcomYMqTAyD 6,LKNRJsp209FbXoSvJWzB
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Before this patch, if a process called function gfs2_log_reserve to
reserve some journal blocks, but the journal not enough blocks were
free, it would call io_schedule. However, in the log flush daemon,
it woke up the waiters only if an gfs2_ail_flush was no longer
required. This resulted in situations where processes would wait
forever because the number of blocks required was so high that it
pushed the journal into a perpetual state of flush being required.
This patch changes the logd daemon so that it wakes up io waiters
every time the log is actually flushed.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Make afs_readpages() use afs_vnode_fetch_data()'s new ability to take a
list of pages and do a bulk fetch.
Signed-off-by: David Howells <dhowells@redhat.com>
Make afs_fs_fetch_data() take a list of pages for bulk data transfer. This
will allow afs_readpages() to be made more efficient.
Signed-off-by: David Howells <dhowells@redhat.com>
Pull audit fixes from Paul Moore:
"Two small fixes relating to audit's use of fsnotify.
The first patch plugs a leak and the second fixes some lock
shenanigans. The patches are small and I banged on this for an
afternoon with our testsuite and didn't see anything odd"
* 'stable-4.10' of git://git.infradead.org/users/pcmoore/audit:
audit: Fix sleep in atomic
fsnotify: Remove fsnotify_duplicate_mark()
Before this patch, the logd daemon only tried to flush things when
the log blocks pinned exceeded a certain threshold. But when we're
deleting very large files, it may require a huge number of journal
blocks, and that, in turn, may exceed the threshold. This patch
factors that into account.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
This patch limits the number of transaction blocks requested during
file truncates. If we have very large multi-terabyte files, and want
to delete or truncate them, they might span so many resource groups
that we overflow the journal blocks, and cause an assert failure.
By limiting the number of blocks in the transaction, we prevent this
overflow and give other running processes time to do transactions.
The limiting factor I chose is sd_log_thresh2 which is currently
set to 4/5ths of the journal. This same ratio is used in function
gfs2_ail_flush_reqd to determine when a log flush is required.
If we make the maximum value less than this, we can get into a
infinite hang whereby the log stops moving because the number of
used blocks is less than the threshold and the iterative loop
needs more, but since we're under the threshold, the log daemon
never starts any IO on the log.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
UDF encodes symlinks in a more complex fashion and thus i_size of a
symlink does not match the lenght of a string returned by readlink(2).
This confuses some applications (see bug 191241) and may be considered a
violation of POSIX. Fix the problem by reading the link into page cache
in response to stat(2) call and report the length of the decoded path.
Signed-off-by: Jan Kara <jack@suse.cz>
- Fixes for crashes and double-cleanup errors
- XFS maintainership handover
- Fix to prevent absurdly large block reservations
- Fix broken sysfs getter/setters
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABCgAGBQJYbH10AAoJEPh/dxk0SrTrozkQAIo1ikGMKI0x52izCJyKA+HP
gUfqqFnFDOsz0pWBDhRMBBGXbQlgU6uPRMicTdkNSRq3BnKPyfC7EK8WkXlOGT8A
JGQhz96sfr4gShuo4lul2nhgfThOyL7M3Vu+xHL7wgrrOV1Y4Haz2m2FKYzNentj
2ca2WeNCcuEQLdWwtwkeOnsjnC2gV9cA5pRsx59rktr/t6fU1Q+AcBSjpwDX43op
cQ0uTqiBB51pWe2tbw+VHSsYzyakkjsNsiYCZNOghN+p/5g4QpgT7LgiO9r5CVvu
UhAXv6285K4pVsD7cIy2yHvSFbYNz1khM1Tvv26npg72NoQxMXhxAaPkke03u2lT
4nakgqLpxybrrnviVsJy2VbnmVYN5mcXSV2XTmRdAtxZmrW9C6H2PCs9pQYSV1Ji
H6UT7kqusqp4ceQUBb5dtFCaUndNGnLxKDrmNOz/It7PPpI0zSRMiv+IC+qrQ/ci
oEExMUtRLLah5zXOXQej2IchZmP2SBXm/a2JhmQywTIyLiSPoDwKsNET9zT4CiXw
CHQj3spGh8uE7rh0QbujWjD7odE1IWDvOZ9Zh5vXVrwqTIeVOlxjfBqO/9J0z2Ak
z7WpRrQ2IhWMwD6pHg0oDUD9oB02LxnW24telQs35wgqgdZm5TfG6BaiKutzA7KH
nVlR22SQ4m+eTD8+IFMO
=TgC5
-----END PGP SIGNATURE-----
Merge tag 'xfs-for-linus-4.10-rc3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs fixes from Darrick Wong:
- fixes for crashes and double-cleanup errors
- XFS maintainership handover
- fix to prevent absurdly large block reservations
- fix broken sysfs getter/setters
* tag 'xfs-for-linus-4.10-rc3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: fix max_retries _show and _store functions
xfs: update MAINTAINERS
xfs: fix crash and data corruption due to removal of busy COW extents
xfs: use the actual AG length when reserving blocks
xfs: fix double-cleanup when CUI recovery fails
Pull block layer fixes from Jens Axboe:
"A set of fixes for the current series, one fixing a regression with
block size < page cache size in the alias series from Jan. Outside of
that, two small cleanups for wbt from Bart, a nvme pull request from
Christoph, and a few small fixes of documentation updates"
* 'for-linus' of git://git.kernel.dk/linux-block:
block: fix up io_poll documentation
block: Avoid that sparse complains about context imbalance in __wbt_wait()
block: Make wbt_wait() definition consistent with declaration
clean_bdev_aliases: Prevent cleaning blocks that are not in block range
genhd: remove dead and duplicated scsi code
block: add back plugging in __blkdev_direct_IO
nvmet/fcloop: remove some logically dead code performing redundant ret checks
nvmet: fix KATO offset in Set Features
nvme/fc: simplify error handling of nvme_fc_create_hw_io_queues
nvme/fc: correct some printk information
nvme/scsi: Remove START STOP emulation
nvme/pci: Delete misleading queue-wrap comment
nvme/pci: Fix whitespace problem
nvme: simplify stripe quirk
nvme: update maintainers information
max_retries _show and _store functions should test against cfg->max_retries,
not cfg->retry_timeout
Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
There is a race window between write_cache_pages calling
clear_page_dirty_for_io and XFS calling set_page_writeback, in which
the mapping for an inode is tagged neither as dirty, nor as writeback.
If the COW shrinker hits in exactly that window we'll remove the delayed
COW extents and writepages trying to write it back, which in release
kernels will manifest as corruption of the bmap btree, and in debug
kernels will trip the ASSERT about now calling xfs_bmapi_write with the
COWFORK flag for holes. A complex customer load manages to hit this
window fairly reliably, probably by always having COW writeback in flight
while the cow shrinker runs.
This patch adds another check for having the I_DIRTY_PAGES flag set,
which is still set during this race window. While this fixes the problem
I'm still not overly happy about the way the COW shrinker works as it
still seems a bit fragile.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
We need to use the actual AG length when making per-AG reservations,
since we could otherwise end up reserving more blocks out of the last
AG than there are actual blocks.
Complained-about-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Dan Carpenter reported a double-free of rcur if _defer_finish fails
while we're recovering CUI items. Fix the error recovery to prevent
this.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Currently how btrfs dio deals with split dio write is not good
enough if dio write is split into several segments due to the
lack of contiguous space, a large dio write like 'dd bs=1G count=1'
can end up with incorrect outstanding_extents counter and endio
would complain loudly with an assertion.
This fixes the problem by compensating the outstanding_extents
counter in inode if a large dio write gets split.
Reported-by: Anand Jain <anand.jain@oracle.com>
Tested-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
While checking INODE_REF/INODE_EXTREF for a corner case, we may acquire a
different inode's log_mutex with holding the current inode's log_mutex, and
lockdep has complained this with a possilble deadlock warning.
Fix this by using mutex_lock_nested() when processing the other inode's
log_mutex.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If @block_group is not @used_bg, it'll try to get @used_bg's lock without
droping @block_group 's lock and lockdep has throwed a scary deadlock warning
about it.
Fix it by using down_read_nested.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In __btrfs_run_delayed_refs, when we put back a delayed ref that's too
new, we have already dropped the lock on locked_ref when we set
->processing = 0.
This patch keeps the lock to cover that assignment.
Fixes: d7df2c796d (Btrfs: attach delayed ref updates to delayed ref heads)
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In __btrfs_run_delayed_refs, the error path when run_delayed_extent_op
fails sets locked_ref->processing = 0 but doesn't re-increment
delayed_refs->num_heads_ready. As a result, we end up triggering
the WARN_ON in btrfs_select_ref_head.
Fixes: d7df2c796d (Btrfs: attach delayed ref updates to delayed ref heads)
Reported-by: Jon Nelson <jnelson-suse@jamponi.net>
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
CURRENT_TIME is not y2038 safe.
CURRENT_TIME macro is also not appropriate for filesystems
as it doesn't use the right granularity for filesystem
timestamps.
Logical Volume Integrity format is described to have the
same timestamp format for "Recording Date and time" as
the other [a,c,m]timestamps.
The function udf_time_to_disk_format() does this conversion.
Hence the timestamp is passed directly to the function and
not truncated. This is as per Arnd's suggestion on the
thread.
This is also in preparation for the patch that transitions
vfs timestamps to use 64 bit time and hence make them
y2038 safe. As part of the effort current_time() will be
extended to do range checks.
Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Jan Kara <jack@suse.cz>
crypto tree during the merge window.
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAlhrCIIACgkQ8vlZVpUN
gaP0rAf8DehnxAXTdGwCDKJ76Xgkd4C0vYwNYsWrwbEsD6dMXPmfhDVA40ZefFWY
4UQaPeoDSXQnIxw+6gi6LFCJeYs+dc9ZWHk++w5kEMclIUONomODDAQLMJbpG+5t
pkEwOzjTaKbIQ5n4r3rMJtlBlrZX+ZVJmMt3sYAMWhIq7Bf7dRy6AC7+vyM5VTce
AYvFpureLd7pJT0AcNvg5oPnXIFiPlKi6knlmAdJ32I4FQQO07aDA37mLPKdff4/
uKs4PGKTa9MCGw+blMDJ/208kBQPPn8JZ7yGQCdGw16CUaoSXLregqu6SNs2MKaQ
WjmBFyEUssScTeAq8rYJVlU7FYxwdQ==
=VVfb
-----END PGP SIGNATURE-----
Merge tag 'fscrypt-for-stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/fscrypt
Pull fscrypt fixes from Ted Ts'o:
"Two fscrypt bug fixes, one of which was unmasked by an update to the
crypto tree during the merge window"
* tag 'fscrypt-for-stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/fscrypt:
fscrypt: fix renaming and linking special files
fscrypt: fix the test_dummy_encryption mount option
Currently, the test_dummy_encryption ext4 mount option, which exists
only to test encrypted I/O paths with xfstests, overrides all
per-inode encryption keys with a fixed key.
This change minimizes test_dummy_encryption-specific code path changes
by supplying a fake context for directories which are not encrypted
for use when creating new directories, files, or symlinks. This
allows us to properly exercise the keyring lookup, derivation, and
context inheritance code paths.
Before mounting a file system using test_dummy_encryption, userspace
must execute the following shell commands:
mode='\x00\x00\x00\x00'
raw="$(printf ""\\\\x%02x"" $(seq 0 63))"
if lscpu | grep "Byte Order" | grep -q Little ; then
size='\x40\x00\x00\x00'
else
size='\x00\x00\x00\x40'
fi
key="${mode}${raw}${size}"
keyctl new_session
echo -n -e "${key}" | keyctl padd logon fscrypt:4242424242424242 @s
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
The first block to be cleaned may start at a non-zero page offset. In
such a scenario clean_bdev_aliases() will end up cleaning blocks that
do not fall in the range of blocks to be cleaned. This commit fixes the
issue by skipping blocks that do not fall in valid block range.
Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Eryu Guan <eguan@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
That way we can get rid of the direct dependency on CONFIG_BLOCK.
Fixes: d475a50745 ("ubifs: Add skeleton for fscrypto")
Reported-by: Arnd Bergmann <arnd@arndb.de>
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Gstir <david@sigma-star.at>
Signed-off-by: Richard Weinberger <richard@nod.at>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
It was possible for the ->get_context() operation to fail with a
specific error code, which was then not returned to the caller of
FS_IOC_SET_ENCRYPTION_POLICY or FS_IOC_GET_ENCRYPTION_POLICY. Make sure
to pass through these error codes. Also reorganize the code so that
->get_context() only needs to be called one time when setting an
encryption policy, and handle contexts of unrecognized sizes more
appropriately.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Several warning messages were not rate limited and were user-triggerable
from FS_IOC_SET_ENCRYPTION_POLICY. These shouldn't really have been
there in the first place, but either way they aren't as useful now that
the error codes have been improved. So just remove them.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
As part of an effort to clean up fscrypt-related error codes, make
FS_IOC_SET_ENCRYPTION_POLICY fail with EEXIST when the file already uses
a different encryption policy. This is more descriptive than EINVAL,
which was ambiguous with some of the other error cases.
I am not aware of any users who might be relying on the previous error
code of EINVAL, which was never documented anywhere.
This failure case will be exercised by an xfstest.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
As part of an effort to clean up fscrypt-related error codes, make
FS_IOC_SET_ENCRYPTION_POLICY fail with ENOTDIR when the file descriptor
does not refer to a directory. This is more descriptive than EINVAL,
which was ambiguous with some of the other error cases.
I am not aware of any users who might be relying on the previous error
code of EINVAL, which was never documented anywhere, and in some buggy
kernels did not exist at all as the S_ISDIR() check was missing.
This failure case will be exercised by an xfstest.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
As part of an effort to clean up fscrypt-related error codes, make
attempting to create a file in an encrypted directory that hasn't been
"unlocked" fail with ENOKEY. Previously, several error codes were used
for this case, including ENOENT, EACCES, and EPERM, and they were not
consistent between and within filesystems. ENOKEY is a better choice
because it expresses that the failure is due to lacking the encryption
key. It also matches the error code returned when trying to open an
encrypted regular file without the key.
I am not aware of any users who might be relying on the previous
inconsistent error codes, which were never documented anywhere.
This failure case will be exercised by an xfstest.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Attempting to link a device node, named pipe, or socket file into an
encrypted directory through rename(2) or link(2) always failed with
EPERM. This happened because fscrypt_has_permitted_context() saw that
the file was unencrypted and forbid creating the link. This behavior
was unexpected because such files are never encrypted; only regular
files, directories, and symlinks can be encrypted.
To fix this, make fscrypt_has_permitted_context() always return true on
special files.
This will be covered by a test in my encryption xfstests patchset.
Fixes: 9bd8212f98 ("ext4 crypto: add encryption policy and password salt support")
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Richard Weinberger <richard@nod.at>
Cc: stable@vger.kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Commit f1c131b454: "crypto: xts - Convert to skcipher" now fails
the setkey operation if the AES key is the same as the tweak key.
Previously this check was only done if FIPS mode is enabled. Now this
check is also done if weak key checking was requested. This is
reasonable, but since we were using the dummy key which was a constant
series of 0x42 bytes, it now caused dummy encrpyption test mode to
fail.
Fix this by using 0x42... and 0x24... for the two keys, so they are
different.
Fixes: f1c131b454
Cc: stable@vger.kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Now that dax_iomap_fault() calls ->iomap_begin() without entry lock, we
can use transaction starting in ext4_iomap_begin() and thus simplify
ext4_dax_fault(). It also provides us proper retries in case of ENOSPC.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Currently ->iomap_begin() handler is called with entry lock held. If the
filesystem held any locks between ->iomap_begin() and ->iomap_end()
(such as ext4 which will want to hold transaction open), this would cause
lock inversion with the iomap_apply() from standard IO path which first
calls ->iomap_begin() and only then calls ->actor() callback which grabs
entry locks for DAX (if it faults when copying from/to user provided
buffers).
Fix the problem by nesting grabbing of entry lock inside ->iomap_begin()
- ->iomap_end() pair.
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
The only case when we do not finish the page fault completely is when we
are loading hole pages into a radix tree. Avoid this special case and
finish the fault in that case as well inside the DAX fault handler. It
will allow us for easier iomap handling.
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Currently dax_iomap_rw() takes care of invalidating page tables and
evicting hole pages from the radix tree when write(2) to the file
happens. This invalidation is only necessary when there is some block
allocation resulting from write(2). Furthermore in current place the
invalidation is racy wrt page fault instantiating a hole page just after
we have invalidated it.
So perform the page invalidation inside dax_iomap_actor() where we can
do it only when really necessary and after blocks have been allocated so
nobody will be instantiating new hole pages anymore.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Currently invalidate_inode_pages2_range() and invalidate_mapping_pages()
just delete all exceptional radix tree entries they find. For DAX this
is not desirable as we track cache dirtiness in these entries and when
they are evicted, we may not flush caches although it is necessary. This
can for example manifest when we write to the same block both via mmap
and via write(2) (to different offsets) and fsync(2) then does not
properly flush CPU caches when modification via write(2) was the last
one.
Create appropriate DAX functions to handle invalidation of DAX entries
for invalidate_inode_pages2_range() and invalidate_mapping_pages() and
wire them up into the corresponding mm functions.
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
So far we did not return BH_New buffers from ext2_get_blocks() when we
allocated and zeroed-out a block for DAX inode to avoid racy zeroing in
DAX code. This zeroing is gone these days so we can remove the
workaround.
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
No point in going through loops and hoops instead of just comparing the
values.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
ktime_set(S,N) was required for the timespec storage type and is still
useful for situations where a Seconds and Nanoseconds part of a time value
needs to be converted. For anything where the Seconds argument is 0, this
is pointless and can be replaced with a simple assignment.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
ktime is a union because the initial implementation stored the time in
scalar nanoseconds on 64 bit machine and in a endianess optimized timespec
variant for 32bit machines. The Y2038 cleanup removed the timespec variant
and switched everything to scalar nanoseconds. The union remained, but
become completely pointless.
Get rid of the union and just keep ktime_t as simple typedef of type s64.
The conversion was done with coccinelle and some manual mopping up.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
This was entirely automated, using the script by Al:
PATT='^[[:blank:]]*#[[:blank:]]*include[[:blank:]]*<asm/uaccess.h>'
sed -i -e "s!$PATT!#include <linux/uaccess.h>!" \
$(git grep -l "$PATT"|grep -v ^include/linux/uaccess.h)
to do the replacement at the end of the merge window.
Requested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull cifs fixes from Steve French:
"This ncludes various cifs/smb3 bug fixes, mostly for stable as well.
In the next week I expect that Germano will have some reconnection
fixes, and also I expect to have the remaining pieces of the snapshot
enablement and SMB3 ACLs, but wanted to get this set of bug fixes in"
* 'for-next' of git://git.samba.org/sfrench/cifs-2.6:
cifs_get_root shouldn't use path with tree name
Fix default behaviour for empty domains and add domainauto option
cifs: use %16phN for formatting md5 sum
cifs: Fix smbencrypt() to stop pointing a scatterlist at the stack
CIFS: Fix a possible double locking of mutex during reconnect
CIFS: Fix a possible memory corruption during reconnect
CIFS: Fix a possible memory corruption in push locks
CIFS: Fix missing nls unload in smb2_reconnect()
CIFS: Decrease verbosity of ioctl call
SMB3: parsing for new snapshot timestamp mount parm
There are only two calls sites of fsnotify_duplicate_mark(). Those are
in kernel/audit_tree.c and both are bogus. Vfsmount pointer is unused
for audit tree, inode pointer and group gets set in
fsnotify_add_mark_locked() later anyway, mask and free_mark are already
set in alloc_chunk(). In fact, calling fsnotify_duplicate_mark() is
actively harmful because following fsnotify_add_mark_locked() will leak
group reference by overwriting the group pointer. So just remove the two
calls to fsnotify_duplicate_mark() and the function.
Signed-off-by: Jan Kara <jack@suse.cz>
[PM: line wrapping to fit in 80 chars]
Signed-off-by: Paul Moore <paul@paul-moore.com>
Pull final vfs updates from Al Viro:
"Assorted cleanups and fixes all over the place"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
sg_write()/bsg_write() is not fit to be called under KERNEL_DS
ufs: fix function declaration for ufs_truncate_blocks
fs: exec: apply CLOEXEC before changing dumpable task flags
seq_file: reset iterator to first record for zero offset
vfs: fix isize/pos/len checks for reflink & dedupe
[iov_iter] fix iterate_all_kinds() on empty iterators
move aio compat to fs/aio.c
reorganize do_make_slave()
clone_private_mount() doesn't need to touch namespace_sem
remove a bogus claim about namespace_sem being held by callers of mnt_alloc_id()
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJYW7vZAAoJEGu/nxmHO1GNeGUIAJil3Q4ZaeOaaj5uNs4h64kc
0BAfGSwzGNgreX5PWm+jQVeh6xbAqXnYtsWIDSibpxnXOhAZcXHbpzKLTwlMl4rh
qpXAAWhHcBsOKiNcg++RRmouubYtpgMoOKCgo/DzGp51mSV7/8K2mugzDRohPUsR
jUDqUa9qvt65uqI5xCuK1n3aLtCQ9m3RUzDfQbH4fK/yBXpNIE83xegU1SBJKZHj
uGPJpjHhc1vaba6Y8vDDBHuJR9IJxfeSnoJE0xMmGlIub40exw7P4Dek1Tc/3G+R
qiqT9aGAbegkFDerps5sqOLbU4Lm4Js8Ov78l3IN1FSVdYWsptzRibjIbUidPdc=
=zypk
-----END PGP SIGNATURE-----
Merge tag 'befs-v4.10-rc1' of git://github.com/luisbg/linux-befs
Pull befs updates from Luis de Bethencourt:
"A series of small fixes and adding NFS export support"
* tag 'befs-v4.10-rc1' of git://github.com/luisbg/linux-befs:
befs: add NFS export support
befs: remove trailing whitespaces
befs: remove signatures from comments
befs: fix style issues in header files
befs: fix style issues in linuxvfs.c
befs: fix typos in linuxvfs.c
befs: fix style issues in io.c
befs: fix style issues in inode.c
befs: fix style issues in debug.c
sparse says:
fs/ufs/inode.c:1195:6: warning: symbol 'ufs_truncate_blocks' was not declared. Should it be static?
Note that the forward declaration in the file is already marked static.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
If you have a process that has set itself to be non-dumpable, and it
then undergoes exec(2), any CLOEXEC file descriptors it has open are
"exposed" during a race window between the dumpable flags of the process
being reset for exec(2) and CLOEXEC being applied to the file
descriptors. This can be exploited by a process by attempting to access
/proc/<pid>/fd/... during this window, without requiring CAP_SYS_PTRACE.
The race in question is after set_dumpable has been (for get_link,
though the trace is basically the same for readlink):
[vfs]
-> proc_pid_link_inode_operations.get_link
-> proc_pid_get_link
-> proc_fd_access_allowed
-> ptrace_may_access(task, PTRACE_MODE_READ_FSCREDS);
Which will return 0, during the race window and CLOEXEC file descriptors
will still be open during this window because do_close_on_exec has not
been called yet. As a result, the ordering of these calls should be
reversed to avoid this race window.
This is of particular concern to container runtimes, where joining a
PID namespace with file descriptors referring to the host filesystem
can result in security issues (since PRCTL_SET_DUMPABLE doesn't protect
against access of CLOEXEC file descriptors -- file descriptors which may
reference filesystem objects the container shouldn't have access to).
Cc: dev@opencontainers.org
Cc: <stable@vger.kernel.org> # v3.2+
Reported-by: Michael Crosby <crosbymichael@gmail.com>
Signed-off-by: Aleksa Sarai <asarai@suse.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
If kernfs file is empty on a first read, successive read operations
using the same file descriptor will return no data, even when data is
available. Default kernfs 'seq_next' implementation advances iterator
position even when next object is not there. Kernfs 'seq_start' for
following requests will not return iterator as position is already on
the second object.
This defect doesn't allow to monitor badblocks sysfs files from MD raid.
They are initially empty but if data appears at some stage, userspace is
not able to read it.
Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Strengthen the checking of pos/len vs. i_size, clarify the return values
for the clone prep function, and remove pointless code.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
... and fix the minor buglet in compat io_submit() - native one
kills ioctx as cleanup when put_user() fails. Get rid of
bogus compat_... in !CONFIG_AIO case, while we are at it - they
should simply fail with ENOSYS, same as for native counterparts.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
This allows sending larger than 1 MB requests to devices that support
large I/O sizes.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reported-by: Laurence Oberman <loberman@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Removing all trailing whitespaces in befs.
I was skeptic about tainting the history with this, but whitespace changes
can be ignored by using 'git blame -w' and 'git log -w'.
Signed-off-by: Luis de Bethencourt <luisbg@osg.samsung.com>
No idea why some comments have signatures. These predate git. Removing them
since they add noise and no information.
Signed-off-by: Luis de Bethencourt <luisbg@osg.samsung.com>
Fixing checkpatch.pl issues in befs header files:
WARNING: Missing a blank line after declarations
+ befs_inode_addr iaddr;
+ iaddr.allocation_group = blockno >> BEFS_SB(sb)->ag_shift;
WARNING: space prohibited between function name and open parenthesis '('
+ return BEFS_SB(sb)->block_size / sizeof (befs_disk_inode_addr);
ERROR: "foo * bar" should be "foo *bar"
+ const char *key, befs_off_t * value);
ERROR: Macros with complex values should be enclosed in parentheses
+#define PACKED __attribute__ ((__packed__))
Signed-off-by: Luis de Bethencourt <luisbg@osg.samsung.com>
Fix the following type of checkpatch.pl issues:
WARNING: line over 80 characters
+static struct dentry *befs_lookup(struct inode *, struct dentry *, unsigned int);
ERROR: code indent should use tabs where possible
+ if (!bi)$
WARNING: please, no spaces at the start of a line
+ if (!bi)$
WARNING: labels should not be indented
+ unacquire_bh:
WARNING: space prohibited between function name and open parenthesis '('
+ sizeof (struct befs_inode_info),
WARNING: braces {} are not necessary for single statement blocks
+ if (!*out) {
+ return -ENOMEM;
+ }
WARNING: Block comments use a trailing */ on a separate line
+ * in special cases */
WARNING: Missing a blank line after declarations
+ int token;
+ if (!*p)
ERROR: do not use assignment in if condition
+ if (!(bh = sb_bread(sb, sb_block))) {
ERROR: space prohibited after that open parenthesis '('
+ if( befs_sb->num_blocks > ~((sector_t)0) ) {
ERROR: space prohibited before that close parenthesis ')'
+ if( befs_sb->num_blocks > ~((sector_t)0) ) {
ERROR: space required before the open parenthesis '('
+ if( befs_sb->num_blocks > ~((sector_t)0) ) {
Signed-off-by: Luis de Bethencourt <luisbg@osg.samsung.com>
Fixing the two following checkpatch.pl issues:
ERROR: trailing whitespace
+ * Based on portions of file.c and inode.c $
WARNING: labels should not be indented
+ error:
Signed-off-by: Luis de Bethencourt <luisbg@osg.samsung.com>
Fixing the following checkpatch.pl errors and warning:
ERROR: trailing whitespace
+ * $
WARNING: Block comments use * on subsequent lines
+/*
+ Validates the correctness of the befs inode
ERROR: "foo * bar" should be "foo *bar"
+befs_check_inode(struct super_block *sb, befs_inode * raw_inode,
Signed-off-by: Luis de Bethencourt <luisbg@osg.samsung.com>
Commit 8924feff66 ("splice: lift pipe_lock out of splice_to_pipe()")
caused a regression when there were no more readers left on a pipe that
was being spliced into: rather than the expected SIGPIPE and -EPIPE
return value, the writer would end up waiting forever for space to free
up (which obviously was not going to happen with no readers around).
Fixes: 8924feff66 ("splice: lift pipe_lock out of splice_to_pipe()")
Reported-and-tested-by: Andreas Schwab <schwab@linux-m68k.org>
Debugged-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: stable@kernel.org # v4.9
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Highlights include:
- Further attribute cache improvements to make revalidation more fine grained
- NFSv4 locking improvements
Bugfixes:
- nfs4_fl_prepare_ds must be careful about reporting success in files layout
- pNFS/flexfiles: Instead of marking a device inactive, remove it from the cache
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJYWoWvAAoJEGcL54qWCgDydigQAMFTUZTjHGx34l3yqie3uV0D
P0Ws6qfQJwN6xwgbhLMt6pbECadN7d55KszIM71gu/Iqa1UbtHZht2Slf/Tw5wY8
vjptzLkvOie5g9sGIhhQqkgrXnG3A96aEuaknIj9JXWALJb01M94ZfC0STgUjmil
sp100cDmmOonqflk7Tgh/jmfiUucBw9HnyNRY59sUaR0lAmHRrpJuqftmarYKRrq
1rTCeGx6MyM2kVyqGsY7FP1prQcNcIJXwti/EXrOywXbHZxmjl+l0WLu6iUOEj/I
zvfWCeMFyMXeVsTrRLYeM1vQ/maUbSev6gms2RshbmDITE6Ir9nHfBTCfOmh95wV
paKnAzuEKsUR+LhEbsORk03efQYkeDQuJW3QqhdVuqGnEwhaboTiP4GRJLjEo287
dHy6sEGEbAcieUsolXhDIaxggoF39gGihOT3+m+WgIlXRd38W2mvlQC9oaq89ehw
Ipetm3swnkY8bTavsNYFW1x+V5zrGH9Dlu2/mpnqVAIwGT+ByolD8tYDAHf4zj3b
cy9XhFzdLDRy4V95l4hzZFDCdjy4dZZu7cJEO5zqUB3236qhiZbGLjL232ZEHFi3
wNHn13nADV+DzoOuB5HJwfhKyOOBQsQpBIvzAgt1n7muZwxa35nX/2waX1xt2HX5
J+s+e02xJnMStcr/2O5v
=8gLp
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-4.10-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull more NFS client updates from Trond Myklebust:
"Highlights include:
- further attribute cache improvements to make revalidation more fine
grained
- NFSv4 locking improvements
Bugfixes:
- nfs4_fl_prepare_ds must be careful about reporting success in files
layout
- pNFS/flexfiles: Instead of marking a device inactive, remove it
from the cache"
* tag 'nfs-for-4.10-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
NFSv4: Retry the DELEGRETURN if the embedded GETATTR is rejected with EACCES
NFS: Retry the CLOSE if the embedded GETATTR is rejected with EACCES
NFSv4: Place the GETATTR operation before the CLOSE
NFSv4: Also ask for attributes when downgrading to a READ-only state
NFS: Don't abuse NFS_INO_REVAL_FORCED in nfs_post_op_update_inode_locked()
pNFS: Return RW layouts on OPEN_DOWNGRADE
NFSv4: Add encode/decode of the layoutreturn op in OPEN_DOWNGRADE
NFS: Don't disconnect open-owner on NFS4ERR_BAD_SEQID
NFSv4: ensure __nfs4_find_lock_state returns consistent result.
NFSv4.1: nfs4_fl_prepare_ds must be careful about reporting success.
pNFS/flexfiles: delete deviceid, don't mark inactive
NFS: Clean up nfs_attribute_timeout()
NFS: Remove unused function nfs_revalidate_inode_rcu()
NFS: Fix and clean up the access cache validity checking
NFS: Only look at the change attribute cache state in nfs_weak_revalidate()
NFS: Clean up cache validity checking
NFS: Don't revalidate the file on close if we hold a delegation
NFSv4: Don't discard the attributes returned by asynchronous DELEGRETURN
NFSv4: Update the attribute cache info in update_changeattr
If our DELEGRETURN RPC call is rejected with an EACCES call, then we should
remove the GETATTR call from the compound RPC and retry.
This could potentially happen when there is a conflict between an
ACL denying attribute reads and our use of SP4_MACH_CRED.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
If our CLOSE RPC call is rejected with an EACCES call, then we should
remove the GETATTR call from the compound RPC and retry.
This could potentially happen when there is a conflict between an
ACL denying attribute reads and our use of SP4_MACH_CRED.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
In order to benefit from the DENY share lock protection, we should
put the GETATTR operation before the CLOSE. Otherwise, we might race
with a Windows machine that thinks it is now safe to modify the file.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
If we're downgrading from a READ+WRITE mode to a READ-only mode, then
ask for cache consistency attributes so that we avoid the revalidation
in nfs_close_context()
Fixes: 3947b74d0f ("NFSv4: Don't request a GETATTR on open_downgrade.")
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
The NFS_INO_REVAL_FORCED flag now really only has meaning for the
case when we've just been handed a delegation for a file that was already
cached, and we're unsure about that cache.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
If the client holds no more writeable open state, and does not hold a
write delegation, then send a layoutreturn as part of the OPEN_DOWNGRADE.
We do this only for writes, since some layout drivers may require you to
also hold a read layout if you are doing a R/W workload.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
While we do not need to return the RW layout when downgrading from a
read/write open state to read-only, we might want to do so in order
to reduce the burden on the metadataserver so that it does not need
to check for changed data when responding to GETATTR requests.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
When an NFS4ERR_BAD_SEQID is received the open-owner is removed from
the ->state_owners rbtree so that it will no longer be used.
If any stateids attached to this open-owner are still in use, and if a
request using one gets an NFS4ERR_BAD_STATEID reply, this can for bad.
The state is marked as needing recovery and the nfs4_state_manager()
is scheduled to clean up. nfs4_state_manager() finds states to be
recovered by walking the state_owners rbtree. As the open-owner is
not in the rbtree, the bad state is not found so nfs4_state_manager()
completes having done nothing. The request is then retried, with a
predicatable result (indefinite retries).
If the stateid is for a delegation, this open_owner will be used
to open files when the delegation is returned. For that to work,
a new open-owner needs to be presented to the server.
This patch changes NFS4ERR_BAD_SEQID handling to leave the open-owner
in the rbtree but updates the 'create_time' so it looks like a new
open-owner. With this the indefinite retries no longer happen.
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
If a file has both flock locks and OFD locks, then it is possible that
two different nfs4 lock states could apply to file accesses from a
single process.
It is not possible to know, efficiently, which one is "correct".
Presumably the state which represents a lock that covers the region
undergoing IO would be the "correct" one to use, but finding that has
a non-trivial cost and would provide miniscule value.
Currently we just return whichever is first in the list, which could
result in inconsistent behaviour if an application ever put it self in
this position. As consistent behaviour is preferable (when perfectly
correct behaviour is not available), change the search to return a
consistent result in this circumstance.
Specifically: if there is both a flock and OFD lock state, always return
the flock one.
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Various places assume that if nfs4_fl_prepare_ds() turns a non-NULL 'ds',
then ds->ds_clp will also be non-NULL.
This is not necessasrily true in the case when the process received a fatal signal
while nfs4_pnfs_ds_connect is waiting in nfs4_wait_ds_connect().
In that case ->ds_clp may not be set, and the devid may not recently have been marked
unavailable.
So add a test for ds_clp == NULL and return NULL in that case.
Fixes: c23266d532 ("NFS4.1 Fix data server connection race")
Signed-off-by: NeilBrown <neilb@suse.com>
Acked-by: Olga Kornievskaia <aglo@umich.edu>
Acked-by: Adamson, Andy <William.Adamson@netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Instead of marking a device inactive, remove it from the cache entirely.
Flexfiles has a way to report errors back to the server, so we don't want
to stop devices from being tried again for 120 seconds.
Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
The access cache needs to check whether or not the mode bits, ownership,
or ACL has changed or the cache has timed out.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Just like in nfs_check_verifier(), we want to use
nfs_mapping_need_revalidate_inode() to check our knowledge of the
change attribute is up to date.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Consolidate the open-coded checking of NFS_I(inode)->cache_validity
into a couple of helper functions.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
If we're holding a delegation, we can skip sending the close-to-open
GETATTR until we're returning that delegation.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
DELEGRETURN will always carry a reference to the inode except when
the latter is being freed, so let's ensure that we always use that
inode information to ensure close-to-open cache consistency, even
when the DELEGRETURN call is asynchronous.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
If we successfully updated the change attribute, we should timestamp the
cache. While we do know that the other attributes are not completely up
to date, we have the NFS_INO_INVALID_ATTR flag that let us know that,
so it is valid to say that the cache has not timed out.
We can also clear NFS_INO_REVAL_PAGECACHE, since our change attribute
is now known to be valid.
Conversely, if the change attribute did not match, we should make sure to
also revalidate the access and ACL caches.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
In function btrfs_uuid_tree_iterate(), errno is assigned to variable ret
on errors. However, it directly returns 0. It may be better to return
ret. This patch also removes the warning, because the caller already
prints a warning.
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=188731
Signed-off-by: Pan Bian <bianpan2016@163.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
[ edited subject ]
Signed-off-by: David Sterba <dsterba@suse.com>
Pull quota, fsnotify and ext2 updates from Jan Kara:
"Changes to locking of some quota operations from dedicated quota mutex
to s_umount semaphore, a fsnotify fix and a simple ext2 fix"
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
quota: Fix bogus warning in dquot_disable()
fsnotify: Fix possible use-after-free in inode iteration on umount
ext2: reject inodes with negative size
quota: Remove dqonoff_mutex
ocfs2: Use s_umount for quota recovery protection
quota: Remove dqonoff_mutex from dquot_scan_active()
ocfs2: Protect periodic quota syncing with s_umount semaphore
quota: Use s_umount protection for quota operations
quota: Hold s_umount in exclusive mode when enabling / disabling quotas
fs: Provide function to get superblock with exclusive s_umount
dquot_disable() was warning when sb_has_quota_loaded() was true when
invalidating page cache for quota files. The thinking behind this
warning was that we must have raced with somebody else turning quotas on
and this should not happen because all places modifying quota state must
hold s_umount exclusively now. However sb_has_quota_loaded() can be also
true at this point when we are just suspending quotas on remount
read-only. Just restore the behavior to situation before commit
c3b004460d ("quota: Remove dqonoff_mutex") which introduced the
warning.
The code in dquot_disable() can be further simplified with the new
locking of quota state changes however let's leave that to a separate
commit that can get more testing exposure.
Fixes: c3b004460d
Signed-off-by: Jan Kara <jack@suse.cz>
Pull partial readlink cleanups from Miklos Szeredi.
This is the uncontroversial part of the readlink cleanup patch-set that
simplifies the default readlink handling.
Miklos and Al are still discussing the rest of the series.
* git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
vfs: make generic_readlink() static
vfs: remove ".readlink = generic_readlink" assignments
vfs: default to generic_readlink()
vfs: replace calling i_op->readlink with vfs_readlink()
proc/self: use generic_readlink
ecryptfs: use vfs_get_link()
bad_inode: add missing i_op initializers
Pull more vfs updates from Al Viro:
"In this pile:
- autofs-namespace series
- dedupe stuff
- more struct path constification"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (40 commits)
ocfs2: implement the VFS clone_range, copy_range, and dedupe_range features
ocfs2: charge quota for reflinked blocks
ocfs2: fix bad pointer cast
ocfs2: always unlock when completing dio writes
ocfs2: don't eat io errors during _dio_end_io_write
ocfs2: budget for extent tree splits when adding refcount flag
ocfs2: prohibit refcounted swapfiles
ocfs2: add newlines to some error messages
ocfs2: convert inode refcount test to a helper
simple_write_end(): don't zero in short copy into uptodate
exofs: don't mess with simple_write_{begin,end}
9p: saner ->write_end() on failing copy into non-uptodate page
fix gfs2_stuffed_write_end() on short copies
fix ceph_write_end()
nfs_write_end(): fix handling of short copies
vfs: refactor clone/dedupe_file_range common functions
fs: try to clone files first in vfs_copy_file_range
vfs: misc struct path constification
namespace.c: constify struct path passed to a bunch of primitives
quota: constify struct path in quota_on
...
Make sure that clone_mnt() never returns a mount with MNT_SHARED in
flags, but without a valid ->mnt_group_id. That allows to demystify
do_make_slave() quite a bit, among other things.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
- a large rework of cephx auth code to cope with CONFIG_VMAP_STACK
(myself). Also fixed a deadlock caused by a bogus allocation on the
writeback path and authorize reply verification.
- a fix for long stalls during fsync (Jeff Layton). The client now
has a way to force the MDS log flush, leading to ~100x speedups in
some synthetic tests.
- a new [no]require_active_mds mount option (Zheng Yan). On mount, we
will now check whether any of the MDSes are available and bail rather
than block if none are. This check can be avoided by specifying the
"no" option.
- a couple of MDS cap handling fixes and a few assorted patches
throughout.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQEcBAABCAAGBQJYVByGAAoJEEp/3jgCEfOLBqkH/A7nVf7ObSDYmLuYgg1gJ8zq
4zDDE42S4yZwayAVpn3UjbfPuez5J44lsdXitExdfiHOdIQZDa/WqAbSqQ48HCSg
7sG6ecRWg3G5zG0psPZnB+S5wGMvsLXmj2hvzV1lt2t0lI5bDLSlNRSnElbhilD/
8Z7+Ni2go8DMC9o49SJU32lBW7IByKl4p4flveItgwUvGkIFNd8OT3CyPBUqonQs
lRCeImRYU8Jghb+ifnRxWSbuDf7pZAPc9kL0vibpUUT/1bH6iHsedKp37WQKqc/w
KDSNnKiZcz0gY/hJeLqE3ymCIKO6SU+JkMQSaYNTouLO5fQsRr8/uWQXSe6S5oc=
=ypWx
-----END PGP SIGNATURE-----
Merge tag 'ceph-for-4.10-rc1' of git://github.com/ceph/ceph-client
Pull ceph updates from Ilya Dryomov:
"A varied set of changes:
- a large rework of cephx auth code to cope with CONFIG_VMAP_STACK
(myself). Also fixed a deadlock caused by a bogus allocation on the
writeback path and authorize reply verification.
- a fix for long stalls during fsync (Jeff Layton). The client now
has a way to force the MDS log flush, leading to ~100x speedups in
some synthetic tests.
- a new [no]require_active_mds mount option (Zheng Yan).
On mount, we will now check whether any of the MDSes are available
and bail rather than block if none are. This check can be avoided
by specifying the "no" option.
- a couple of MDS cap handling fixes and a few assorted patches
throughout"
* tag 'ceph-for-4.10-rc1' of git://github.com/ceph/ceph-client: (32 commits)
libceph: remove now unused finish_request() wrapper
libceph: always signal completion when done
ceph: avoid creating orphan object when checking pool permission
ceph: properly set issue_seq for cap release
ceph: add flags parameter to send_cap_msg
ceph: update cap message struct version to 10
ceph: define new argument structure for send_cap_msg
ceph: move xattr initialzation before the encoding past the ceph_mds_caps
ceph: fix minor typo in unsafe_request_wait
ceph: record truncate size/seq for snap data writeback
ceph: check availability of mds cluster on mount
ceph: fix splice read for no Fc capability case
ceph: try getting buffer capability for readahead/fadvise
ceph: fix scheduler warning due to nested blocking
ceph: fix printing wrong return variable in ceph_direct_read_write()
crush: include mapper.h in mapper.c
rbd: silence bogus -Wmaybe-uninitialized warning
libceph: no need to drop con->mutex for ->get_authorizer()
libceph: drop len argument of *verify_authorizer_reply()
libceph: verify authorize reply on connect
...
Pull overlayfs updates from Miklos Szeredi:
"This update contains:
- try to clone on copy-up
- allow renaming a directory
- split source into managable chunks
- misc cleanups and fixes
It does not contain the read-only fd data inconsistency fix, which Al
didn't like. I'll leave that to the next year..."
* 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs: (36 commits)
ovl: fix reStructuredText syntax errors in documentation
ovl: fix return value of ovl_fill_super
ovl: clean up kstat usage
ovl: fold ovl_copy_up_truncate() into ovl_copy_up()
ovl: create directories inside merged parent opaque
ovl: opaque cleanup
ovl: show redirect_dir mount option
ovl: allow setting max size of redirect
ovl: allow redirect_dir to default to "on"
ovl: check for emptiness of redirect dir
ovl: redirect on rename-dir
ovl: lookup redirects
ovl: consolidate lookup for underlying layers
ovl: fix nested overlayfs mount
ovl: check namelen
ovl: split super.c
ovl: use d_is_dir()
ovl: simplify lookup
ovl: check lower existence of rename target
ovl: rename: simplify handling of lower/merged directory
...
Pull btrfs updates from Chris Mason:
"Jeff Mahoney and Dave Sterba have a really nice set of cleanups in
here, and Christoph pitched in corrections/improvements to make btrfs
use proper helpers for bio walking instead of doing it by hand.
There are some key fixes as well, including some long standing bugs
that took forever to track down in btrfs_drop_extents and during
balance"
* 'for-linus-4.10' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (77 commits)
btrfs: limit async_work allocation and worker func duration
Revert "Btrfs: adjust len of writes if following a preallocated extent"
Btrfs: don't WARN() in btrfs_transaction_abort() for IO errors
btrfs: opencode chunk locking, remove helpers
btrfs: remove root parameter from transaction commit/end routines
btrfs: split btrfs_wait_marked_extents into normal and tree log functions
btrfs: take an fs_info directly when the root is not used otherwise
btrfs: simplify btrfs_wait_cache_io prototype
btrfs: convert extent-tree tracepoints to use fs_info
btrfs: root->fs_info cleanup, access fs_info->delayed_root directly
btrfs: root->fs_info cleanup, add fs_info convenience variables
btrfs: root->fs_info cleanup, update_block_group{,flags}
btrfs: root->fs_info cleanup, lock/unlock_chunks
btrfs: root->fs_info cleanup, btrfs_calc_{trans,trunc}_metadata_size
btrfs: pull node/sector/stripe sizes out of root and into fs_info
btrfs: root->fs_info cleanup, io_ctl_init
btrfs: root->fs_info cleanup, use fs_info->dev_root everywhere
btrfs: struct reada_control.root -> reada_control.fs_info
btrfs: struct btrfsic_state->root should be an fs_info
btrfs: alloc_reserved_file_extent trace point should use extent_root
...
that makes ACL inheritance a little more useful in environments that
default to restrictive umasks. Requires client-side support, also on
its way for 4.10.
Other than that, miscellaneous smaller fixes and cleanup, especially to
the server rdma code.
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJYVAEqAAoJECebzXlCjuG+VM0QAKaR+ibSM31Ahpnrgit5/wrb
n630KDFztO7iqEeuHfPQ4/n05T2QR0JWsLpjLMFvx88Gy4gyXYk9cuDPIrNKX1IS
3/nnhBo0+EVnjODjufommCrtbPZlqOSsS3N03vWkB7rTi8QYsWBOThh+XLRJYOXo
LZzJE1WmXNeCXV1kXPBsauryywql1fmwTXBzmIf1HbzoGAVROMEA2qqh4Z3nb7BP
sJuGchWx0STBOuAa278ighXQPUW2lUft9uzw2bssOtMwfNyOs/Pd6nx4F1Lg6WwD
1UQXoiR8K3PqelZfoeFJ05v0css/sbNKep+huWRdOXZj3Kjpa20lKBX8xHfat7sN
1OQ4FHx8ToigX3c+wwtlCqRMCcIxqUYkRjqzPHyeBiSSSp0rLrId44rI5x/K0yay
3bkGw7hFDSzc0Nq2uZgmtlbyTC71hLNhkWe7ThofcVG/pS0JtAqBiKIVwXJPh/e0
PLmVHYGU6Xowjag5edJlXY1tlIlxtWfqsWUarCXS5bfKUa3UjMVSjyuljsDqqJsn
96fEWu7DiUo4HeGYmf8MJoeZYV2y0DKSQGeguVkUKWp2DoTzinQHTfdKvrZVwNuu
hVE9/QeWzUvPY13HOUaKD2skozhbUChqv0NHESKUv8gxE3svTEpYZkXrE74WNqMk
l/WXAhw+RdKZof4+qdjU
=JANY
-----END PGP SIGNATURE-----
Merge tag 'nfsd-4.10' of git://linux-nfs.org/~bfields/linux
Pull nfsd updates from Bruce Fields:
"The one new feature is support for a new NFSv4.2 mode_umask attribute
that makes ACL inheritance a little more useful in environments that
default to restrictive umasks. Requires client-side support, also on
its way for 4.10.
Other than that, miscellaneous smaller fixes and cleanup, especially
to the server rdma code"
[ The client side of the umask attribute was merged yesterday ]
* tag 'nfsd-4.10' of git://linux-nfs.org/~bfields/linux:
nfsd: add support for the umask attribute
sunrpc: use DEFINE_SPINLOCK()
svcrdma: Further clean-up of svc_rdma_get_inv_rkey()
svcrdma: Break up dprintk format in svc_rdma_accept()
svcrdma: Remove unused variable in rdma_copy_tail()
svcrdma: Remove unused variables in xprt_rdma_bc_allocate()
svcrdma: Remove svc_rdma_op_ctxt::wc_status
svcrdma: Remove DMA map accounting
svcrdma: Remove BH-disabled spin locking in svc_rdma_send()
svcrdma: Renovate sendto chunk list parsing
svcauth_gss: Close connection when dropping an incoming message
svcrdma: Clear xpt_bc_xps in xprt_setup_rdma_bc() error exit arm
nfsd: constify reply_cache_stats_operations structure
nfsd: update workqueue creation
sunrpc: GFP_KERNEL should be GFP_NOFS in crypto code
nfsd: catch errors in decode_fattr earlier
nfsd: clean up supported attribute handling
nfsd: fix error handling for clients that fail to return the layout
nfsd: more robust allocation failure handling in nfsd_reply_cache_init
Pull vfs updates from Al Viro:
- more ->d_init() stuff (work.dcache)
- pathname resolution cleanups (work.namei)
- a few missing iov_iter primitives - copy_from_iter_full() and
friends. Either copy the full requested amount, advance the iterator
and return true, or fail, return false and do _not_ advance the
iterator. Quite a few open-coded callers converted (and became more
readable and harder to fuck up that way) (work.iov_iter)
- several assorted patches, the big one being logfs removal
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
logfs: remove from tree
vfs: fix put_compat_statfs64() does not handle errors
namei: fold should_follow_link() with the step into not-followed link
namei: pass both WALK_GET and WALK_MORE to should_follow_link()
namei: invert WALK_PUT logics
namei: shift interpretation of LOOKUP_FOLLOW inside should_follow_link()
namei: saner calling conventions for mountpoint_last()
namei.c: get rid of user_path_parent()
switch getfrag callbacks to ..._full() primitives
make skb_add_data,{_nocache}() and skb_copy_to_page_nocache() advance only on success
[iov_iter] new primitives - copy_from_iter_full() and friends
don't open-code file_inode()
ceph: switch to use of ->d_init()
ceph: unify dentry_operations instances
lustre: switch to use of ->d_init()
If kcalloc() failed, the return value of ovl_fill_super() is -EINVAL,
not -ENOMEM. So this patch sets this value to -ENOMEM before calling
kcalloc(), and sets it back to -EINVAL after calling kcalloc().
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
FWIW, there's a bit of abuse of struct kstat in overlayfs object
creation paths - for one thing, it ends up with a very small subset
of struct kstat (mode + rdev), for another it also needs link in
case of symlinks and ends up passing it separately.
IMO it would be better to introduce a separate object for that.
In principle, we might even lift that thing into general API and switch
->mkdir()/->mknod()/->symlink() to identical calling conventions. Hell
knows, perhaps ->create() as well...
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
The benefit of making directories opaque on creation is that lookups can
stop short when they reach the original created directory, instead of
continue lookup the entire depth of parent directory stack.
The best case is overlay with N layers, performing lookup for first level
directory, which exists only in upper. In that case, there will be only
one lookup instead of N.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
oe->opaque is set for
a) whiteouts
b) directories having the "trusted.overlay.opaque" xattr
Case b can be simplified, since setting the xattr always implies setting
oe->opaque. Also once set, the opaque flag is never cleared.
Don't need to set opaque flag for non-directories.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Add a module option to allow tuning the max size of absolute redirects.
Default is 256.
Size of relative redirects is naturally limited by the the underlying
filesystem's max filename length (usually 255).
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
This patch introduces a kernel config option and a module param. Both can
be used independently to turn the default value of redirect_dir on or off.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Before introducing redirect_dir feature, the condition
!ovl_lower_positive(dentry) for a directory, implied that it is a pure
upper directory, which may be removed if empty.
Now that directory can be redirect, it is possible that upper does not
cover any lower (i.e. !ovl_lower_positive(dentry)), but the directory is a
merge (with redirected path) and maybe non empty.
Check for this case in ovl_remove_upper().
This change fixes the following test case from rename-pop-dir.py
of unionmount-testsuite:
"""Remove dir and rename old name"""
d = ctx.non_empty_dir()
d2 = ctx.no_dir()
ctx.rmdir(d, err=ENOTEMPTY)
ctx.rename(d, d2)
ctx.rmdir(d, err=ENOENT)
ctx.rmdir(d2, err=ENOTEMPTY)
./run --ov rename-pop-dir
/mnt/a/no_dir103: Expected error (Directory not empty) was not produced
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Current code returns EXDEV when a directory would need to be copied up to
move. We could copy up the directory tree in this case, but there's
another, simpler solution: point to old lower directory from moved upper
directory.
This is achieved with a "trusted.overlay.redirect" xattr storing the path
relative to the root of the overlay. After such attribute has been set,
the directory can be moved without further actions required.
This is a backward incompatible feature, old kernels won't be able to
correctly mount an overlay containing redirected directories.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
If a directory has the "trusted.overlay.redirect" xattr, it means that the
value of the xattr should be used to find the underlying directory on the
next lower layer.
The redirect may be relative or absolute. Absolute redirects begin with a
slash.
A relative redirect means: instead of the current dentry's name use the
value of the redirect to find the directory in the next lower
layer. Relative redirects must not contain a slash.
An absolute redirect means: look up the directory relative to the root of
the overlay using the value of the redirect in the next lower layer.
Redirects work on lower layers as well.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Use a common helper for lookup of upper and lower layers. This paves the
way for looking up directory redirects.
No functional change.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
When the upper overlayfs checks "trusted.overlay.*" xattr on the underlying
overlayfs mount, it gets -EPERM, which confuses the upper overlayfs.
Fix this by returning -EOPNOTSUPP instead of -EPERM from
ovl_own_xattr_get() and ovl_own_xattr_set(). This behavior is consistent
with the behavior of ovl_listxattr(), which filters out the private
overlayfs xattrs.
Note: nested overlays are deprecated. But this change makes sense
regardless: these xattrs are private to the overlay and should always be
hidden. Hence getting and setting them should indicate this.
[SzMi: Use EOPNOTSUPP instead of ENODATA and use it for both getting and
setting "trusted.overlay." xattrs. This is a perfectly valid error code
for "we don't support this prefix", which is the case here.]
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
We already calculate f_namelen in statfs as the maximum of the name lengths
provided by the filesystems taking part in the overlay.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
fs/overlayfs/super.c is the biggest of the overlayfs source files and it
contains various utility functions as well as the rather complicated lookup
code. Split these parts out to separate files.
Before:
1446 fs/overlayfs/super.c
After:
919 fs/overlayfs/super.c
267 fs/overlayfs/namei.c
235 fs/overlayfs/util.c
51 fs/overlayfs/ovl_entry.h
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
If encountering a non-directory, then stop looking at lower layers.
In this case the oe->opaque flag is not set anymore, which doesn't matter
since existence of lower file is now checked at remove/rename time.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Check if something exists on the lower layer(s) under the target or rename
to decide if directory needs to be marked "opaque".
Marking opaque is done before the rename, and on failure the marking was
undone. Also the opaque xattr was removed if the target didn't cover
anything.
This patch changes behavior so that removal of "opaque" is not done in
either of the above cases. This means that directory may have the opaque
flag even if it doesn't cover anything. However this shouldn't affect the
performance or semantics of the overalay, while simplifying the code.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
d_is_dir() is safe to call on a negative dentry. Use this fact to simplify
handling of the lower or merged directories.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Currently ovl_lookup() checks existence of lower file even if there's a
non-directory on upper (which is always opaque). This is done so that
remove can decide whether a whiteout is needed or not.
It would be better to defer this check to unlink, since most of the time
the gathered information about opaqueness will be unused.
This adds a helper ovl_lower_positive() that checks if there's anything on
the lower layer(s).
The following patches also introduce changes to how the "opaque" attribute
is updated on directories: this attribute is added when the directory is
creted or moved over a whiteout or object covering something on the lower
layer. However following changes will allow the attribute to remain on the
directory after being moved, even if the new location doesn't cover
anything. Because of this, we need to check lower layers even for opaque
directories, so that whiteout is only created when necessary.
This function will later be also used to decide about marking a directory
opaque, so deal with negative dentries as well. When dealing with
negative, it's enough to check for being a whiteout
If the dentry is positive but not upper then it also obviously needs
whiteout/opaque.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Since commit 07a2daab49 ("ovl: Copy up underlying inode's ->i_mode to
overlay inode") sticky checking on overlay inode is performed by the vfs,
so checking against sticky on underlying inode is not needed.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
This is redundant, the vfs already performed this check (and was broken,
see commit 9409e22acd ("vfs: rename: check backing inode being equal")).
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
No sense in opening special files on the underlying layers, they work just
as well if opened on the overlay.
Side effect is that it's no longer possible to connect one side of a pipe
opened on overlayfs with the other side opened on the underlying layer.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
When copying up within the same fs, try to use vfs_clone_file_range().
This is very efficient when lower and upper are on the same fs
with file reflink support. If vfs_clone_file_range() fails for any
reason, copy up falls back to the regular data copy code.
Tested correct behavior when lower and upper are on:
1. same ext4 (copy)
2. same xfs + reflink patches + mkfs.xfs (copy)
3. same xfs + reflink patches + mkfs.xfs -m reflink=1 (reflink)
4. different xfs + reflink patches + mkfs.xfs -m reflink=1 (copy)
For comparison, on my laptop, xfstest overlay/001 (copy up of large
sparse files) takes less than 1 second in the xfs reflink setup vs.
25 seconds on the rest of the setups.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
This reverts commit 03bea60409.
Commit 4d0c5ba2ff ("vfs: do get_write_access() on upper layer of
overlayfs") makes the writecount checks inside overlayfs superfluous, the
file is already copied up and write access acquired on the upper inode when
ovl_setattr is called with ATTR_SIZE.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
With overlayfs, it is wrong to compare file_inode(inode)->i_sb
of regular files with those of non-regular files, because the
former reference the real (upper/lower) sb and the latter reference
the overlayfs sb.
Move the test for same super block after the sanity tests for
clone range of directory and non-regular file.
This change fixes xfstest generic/157, which returned EXDEV instead
of EISDIR/EINVAL in the following test cases over overlayfs:
echo "Try to reflink a dir"
_reflink_range $testdir1/dir1 0 $testdir1/file2 0 $blksz
echo "Try to reflink a device"
_reflink_range $testdir1/dev1 0 $testdir1/file2 0 $blksz
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Move sb_start_write()/sb_end_write() out of the vfs helper and up into the
ioctl handler.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
FICLONE/FICLONERANGE ioctls return -EXDEV if src and dest
files are not on the same mount point.
Practically, clone only requires that src and dest files
are on the same file system.
Move the check for same mount point to ioctl handler and keep
only the check for same super block in the vfs helper.
A following patch is going to use the vfs_clone_file_range()
helper in overlayfs to copy up between lower and upper
mount points on the same file system.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
We've checked for file_out being opened for write. This ensures that we
already have mnt_want_write() on target.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
This reverts commit 9409e22acd.
Since commit 51f7e52dc9 ("ovl: share inode for hard link") there's no
need to call d_real_inode() to check two overlay inodes for equality.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Highlights include:
Stable bugfixes:
- Fix a pnfs deadlock between read resends and layoutreturn
- Don't invalidate the layout stateid while a layout return is outstanding
- Don't schedule a layoutreturn if the layout stateid is marked as invalid
- On a pNFS error, do not send LAYOUTGET until the LAYOUTRETURN is complete
- SUNRPC: fix refcounting problems with auth_gss messages.
Features:
- Add client support for the NFSv4 umask attribute.
- NFSv4: Correct support for flock() stateids.
- Add a LAYOUTRETURN operation to CLOSE and DELEGRETURN when return-on-close
is specified
- Allow the pNFS/flexfiles layoutstat information to piggyback on LAYOUTRETURN
- Optimise away redundant GETATTR calls when doing state recovery and/or
when not required by cache revalidation rules or close-to-open cache
consistency.
- Attribute cache improvements
- RPC/RDMA support for SG_GAP devices
Bugfixes:
- NFS: Fix performance regressions in readdir
- pNFS/flexfiles: Fix a deadlock on LAYOUTGET
- NFSv4: Add missing nfs_put_lock_context()
- NFSv4.1: Fix regression in callback retry handling
- Fix false positive NFSv4.0 trunking detection.
- pNFS/flexfiles: Only send layoutstats updates for mirrors that were updated
- Various layout stateid related bugfixes
- RPC/RDMA bugfixes
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJYUyemAAoJEGcL54qWCgDy96wP/Ry86cknfLUqLKJCbFVV4nV8
HdovCY8if8JQO0HUPDJ25ITvoRJNVRRwJMWnVq5XHRrPUHletDks6/UYfa63UDMv
umHvGST1cQPU1G+vBIQ3sdkVi1X1GeyBY4rU8aDWxLyKWwyeNptCK12i80ifyaGV
GZIIxuKVDOFS15M7NwMPRkrBacF8TyVK6S7275z6ZNmhFtvYwMAbvMxLabTwWAe8
4A03m4RDBTYhQIc2xLJbHfOTYoHi34l90wrn3C7Wv0I2zp8EJlzCY2tSbYKhfPg7
0HVKNdruRL+cHwLwJEcjFbxOg9MArgRxyup3dwAYQq7Ivsf9oR8/D61CDhanXAzy
cAWyrCyxaAoPWCOb8k4OFRh6jOF9LBGb5WTNpXRi1LoGrbvi6/WLlJccV60325wd
gmSAiwIE7aLG8pFk54J0Et86VaQ6qQNBUtJY/4m87uf1FSv3yzQvh7qDr7s+t8ZQ
kmSTZJzMWZLEEeyvEPZCfjygFu7n4PuTePJu31217styvat39TpY2p0HaaMhgC0V
/Y0ygGH7VlGp0oaVQ70CtBzGsCWTKU2DU8di7nvsCKg6iLv89QBILIJhVeP42tKd
juNCWVw4bpW1Zex7HXKecKfMXkDJ4qSDLFzGWj6Ue85f/rCOSKQH01jfvwBlvtBc
3E6fk85ExTw2+siHWiGy
=MapM
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-4.10-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client updates from Trond Myklebust:
"Highlights include:
Stable bugfixes:
- Fix a pnfs deadlock between read resends and layoutreturn
- Don't invalidate the layout stateid while a layout return is
outstanding
- Don't schedule a layoutreturn if the layout stateid is marked as
invalid
- On a pNFS error, do not send LAYOUTGET until the LAYOUTRETURN is
complete
- SUNRPC: fix refcounting problems with auth_gss messages.
Features:
- Add client support for the NFSv4 umask attribute.
- NFSv4: Correct support for flock() stateids.
- Add a LAYOUTRETURN operation to CLOSE and DELEGRETURN when
return-on-close is specified
- Allow the pNFS/flexfiles layoutstat information to piggyback on
LAYOUTRETURN
- Optimise away redundant GETATTR calls when doing state recovery
and/or when not required by cache revalidation rules or
close-to-open cache consistency.
- Attribute cache improvements
- RPC/RDMA support for SG_GAP devices
Bugfixes:
- NFS: Fix performance regressions in readdir
- pNFS/flexfiles: Fix a deadlock on LAYOUTGET
- NFSv4: Add missing nfs_put_lock_context()
- NFSv4.1: Fix regression in callback retry handling
- Fix false positive NFSv4.0 trunking detection.
- pNFS/flexfiles: Only send layoutstats updates for mirrors that were
updated
- Various layout stateid related bugfixes
- RPC/RDMA bugfixes"
* tag 'nfs-for-4.10-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (82 commits)
SUNRPC: fix refcounting problems with auth_gss messages.
nfs: add support for the umask attribute
pNFS/flexfiles: Ensure we have enough buffer for layoutreturn
pNFS/flexfiles: Remove a redundant parameter in ff_layout_encode_ioerr()
pNFS/flexfiles: Fix a deadlock on LAYOUTGET
pNFS: Layoutreturn must free the layout after the layout-private data
pNFS/flexfiles: Fix ff_layout_add_ds_error_locked()
NFSv4: Add missing nfs_put_lock_context()
pNFS: Release NFS_LAYOUT_RETURN when invalidating the layout stateid
NFSv4.1: Don't schedule lease recovery in nfs4_schedule_session_recovery()
NFSv4.1: Handle NFS4ERR_BADSESSION/NFS4ERR_DEADSESSION replies to OP_SEQUENCE
NFS: Only look at the change attribute cache state in nfs_check_verifier
NFS: Fix incorrect size revalidation when holding a delegation
NFS: Fix incorrect mapping revalidation when holding a delegation
pNFS/flexfiles: Support sending layoutstats in layoutreturn
pNFS/flexfiles: Minor refactoring before adding iostats to layoutreturn
NFS: Fix up read of mirror stats
pNFS/flexfiles: Clean up layoutstats
pNFS/flexfiles: Refactor encoding of the layoutreturn payload
pNFS: Add a layoutreturn callback to performa layout-private setup
...