Once we move the block layer to its own status code we'll still want to
propagate the bio_iov_iter_get_pages, so restructure __blkdev_direct_IO
to take ret into account when returning the errno.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Only read bio->bi_error once in the common path.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <Bart.VanAssche@sandisk.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
In order to avoid future diversions between fs/compat_ioctl.c and
drivers/tty/pty.c, define .compat_ioctl callbacks for the relevant
tty_operations structs. Since both pty_unix98_ioctl() and
pty_bsd_ioctl() are compatible between 32-bit and 64-bit userspace no
special translation is required.
Signed-off-by: Aleksa Sarai <asarai@suse.de>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The 0-day kernel test robot reports assertion failures on
!CONFIG_SMP kernels due to failed spin_is_locked() checks. As it
turns out, spin_is_locked() is hardcoded to return zero on
!CONFIG_SMP kernels and so this function cannot be relied on to
verify spinlock state in this configuration.
To avoid this problem, replace the associated asserts with lockdep
variants that do the right thing regardless of kernel configuration.
Drop the one assert that checks for an unlocked lock as there is no
suitable lockdep variant for that case. This moves the spinlock
checks from XFS debug code to lockdep, but generally provides the
same level of protection.
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
On sparc, if we have an alloca() like situation, as is the case with
SHASH_DESC_ON_STACK(), we can end up referencing deallocated stack
memory. The result can be that the value is clobbered if a trap
or interrupt arrives at just the right instruction.
It only occurs if the function ends returning a value from that
alloca() area and that value can be placed into the return value
register using a single instruction.
For example, in lib/libcrc32c.c:crc32c() we end up with a return
sequence like:
return %i7+8
lduw [%o5+16], %o0 ! MEM[(u32 *)__shash_desc.1_10 + 16B],
%o5 holds the base of the on-stack area allocated for the shash
descriptor. But the return released the stack frame and the
register window.
So if an intererupt arrives between 'return' and 'lduw', then
the value read at %o5+16 can be corrupted.
Add a data compiler barrier to work around this problem. This is
exactly what the gcc fix will end up doing as well, and it absolutely
should not change the code generated for other cpus (unless gcc
on them has the same bug :-)
With crucial insight from Eric Sandeen.
Cc: <stable@vger.kernel.org>
Reported-by: Anatoly Pugachev <matorola@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Provide a control message that can be specified on the first sendmsg() of a
client call or the first sendmsg() of a service response to indicate the
total length of the data to be transmitted for that call.
Currently, because the length of the payload of an encrypted DATA packet is
encrypted in front of the data, the packet cannot be encrypted until we
know how much data it will hold.
By specifying the length at the beginning of the transmit phase, each DATA
packet length can be set before we start loading data from userspace (where
several sendmsg() calls may contribute to a particular packet).
An error will be returned if too little or too much data is presented in
the Tx phase.
Signed-off-by: David Howells <dhowells@redhat.com>
Now that we have umask support, we shouldn't re-send the mode in a SETATTR
following an exclusive CREATE, or we risk having the same problem fixed in
commit 5334c5bdac ("NFS: Send attributes in OPEN request for
NFS4_CREATE_EXCLUSIVE4_1"), which is that files with S_ISGID will have that
bit stripped away.
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Fixes: dff25ddb48 ("nfs: add support for the umask attribute")
Cc: stable@vger.kernel.org # v4.10+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
For some file systems we still memcpy into it, but in various places this
already allows us to use the proper uuid helpers. More to come..
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Acked-by: Mimi Zohar <zohar@linux.vnet.ibm.com> (Changes to IMA/EVM)
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Use the common helper uuid_is_null() and remove the xfs specific
helper uuid_is_nil().
The common helper does not check for the NULL pointer value as
xfs helper did, but xfs code never calls the helper with a pointer
that can be NULL.
Conform comments and warning strings to use the term 'null uuid'
instead of 'nil uuid', because this is the terminology used by
lib/uuid.c and its users. It is also the terminology used in
userspace by libuuid and xfsprogs.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
[hch: remove now unused uuid.[ch]]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Opencode uuid_getnodeuniq in the only caller, and directly decode
the uuid_t representation instead of using a structure cast for it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
These helper are used to compare and copy two uuid_t type objects.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
[hch: also provide the respective guid_ versions]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Our "little endian" UUID really is a Wintel GUID, so rename it and its
helpers such (guid_t). The big endian UUID is the only true one, so
give it the name uuid_t. The uuid_le and uuid_be names are retained for
now, but will hopefully go away soon. The exception to that are the _cmp
helpers that will be replaced by better primitives ASAP and thus don't
get the new names.
Also the _to_bin helpers are named to match the better named uuid_parse
routine in userspace.
Also remove the existing typedef in XFS that's now been superceeded by
the generic type name.
Signed-off-by: Christoph Hellwig <hch@lst.de>
[andy: also update the UUID_LE/UUID_BE macros including fallout]
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Use the generic Linux definition to implement our UUID type, this will
allow using more generic infrastructure in the future.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
uuid_t definition is about to change.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
This essentially is a partial revert of commit ff548773
("afs: Move UUID struct to linux/uuid.h") and moves struct uuid_v1 back into
fs/afs as struct afs_uuid. It however keeps it as big endian structure
so that we can use the normal uuid generation helpers when casting to/from
struct afs_uuid.
The V1 uuid intrepretation in struct form isn't really useful to the
rest of the kernel, and not really compatible to it either, so move it
back to AFS instead of polluting the global uuid.h.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: David Howells <dhowells@redhat.com>
This fixes a problem with reading files larger than 2GB from a UFS-2
file system:
https://bugzilla.kernel.org/show_bug.cgi?id=195721
The incorrect UFS s_maxsize limit became a problem as of commit
c2a9737f45 ("vfs,mm: fix a dead loop in truncate_inode_pages_range()")
which started using s_maxbytes to avoid a page index overflow in
do_generic_file_read().
That caused files to be truncated on UFS-2 file systems because the
default maximum file size is 2GB (MAX_NON_LFS) and UFS didn't update it.
Here I simply increase the default to a common value used by other file
systems.
Signed-off-by: Richard Narron <comet.berkeley@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Will B <will.brokenbourgh2877@gmail.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: <stable@vger.kernel.org> # v4.9 and backports of c2a9737f45
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Bugfixes include:
- Fix a typo in commit e092693443 that breaks copy offload
- Fix the connect error propagation in xs_tcp_setup_socket()
- Fix a lock leak in nfs40_walk_client_list
- Verify that pNFS requests lie within the offset range of the layout segment.
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJZM0YBAAoJEGcL54qWCgDyLUwQALaPEVp00UMdDR0in7MIFKsO
2mgi7pOyn6po3EjxKbGtjAbL4nSlVxdaFpCIGg47YXrl9/95Zjjmyke+iwRdnMsa
ZPyXwfhVRa80fxbOogAverNCnCptoHoG7EzdWuCTcOOxMxR3Ixs7wVJXrs+7ig+r
IdvIAyTsiDYuP5yVp5KkmJCtLGc0Ze20rb7VgdQJfdiLibWvfYCLZ9CgfAQkdAMU
RIlbT0/BG13XDqwh/C2V1vLge0VfpT5p8qbIb/kFyQ0ZJUUiicGGGjp3u/yj0aG9
ljldI34WmQpsy+nCNN4dEgsF461ECvWLwRZnnpN9nv7VurUBpJNUqHLnubvDbzhh
w8QX54ceEWuQAjg96keNuYOhoG53Omle2/Cm+nmiJOmShJbJ0yh4OcB9DYe0gdYa
5YXbKRjPvf/HfdE7PPpvbPG2E211zfvkLdHnFxswggWyGrh23kqlWrpcHpZomGNW
GbJLfIfhyEfBjCPdNJT3Tzvewo2LkcTNLb+3mJhkxOegkdops8vGYA9G2mba3Daj
1HWl1yFAdzlEf2H1Cb8Y2ZrJKHAmaYBKBkKZYUeAcr6EtoxNqnNMP+PEDcVIzPKg
6Jq7DiYwYksK+XDWK9G4QBguKKGLvYtv0MIA3QDX+bBGLo+eFYxc2iaaJefYNdkK
+vdLHclg/YpepLg+Ui21
=P2bm
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-4.12-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client bugfixes from Trond Myklebust:
"Bugfixes include:
- Fix a typo in commit e092693443 ("NFS append COMMIT after
synchronous COPY") that breaks copy offload
- Fix the connect error propagation in xs_tcp_setup_socket()
- Fix a lock leak in nfs40_walk_client_list
- Verify that pNFS requests lie within the offset range of the layout
segment"
* tag 'nfs-for-4.12-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
nfs: Mark unnecessarily extern functions as static
SUNRPC: ensure correct error is reported by xs_tcp_setup_socket()
NFSv4.0: Fix a lock leak in nfs40_walk_client_list
pnfs: Fix the check for requests in range of layout segment
xprtrdma: Delete an error message for a failed memory allocation in xprt_rdma_bc_setup()
pNFS/flexfiles: missing error code in ff_layout_alloc_lseg()
NFS fix COMMIT after COPY
nfs_initialise_sb() and nfs_clone_super() are declared as extern even
though they are used only in fs/nfs/super.c. Mark them as static.
Also remove explicit 'inline' directive from nfs_initialise_sb() and
leave it upto compiler to decide whether inlining is worth it.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Merge misc fixes from Andrew Morton:
"15 fixes"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
scripts/gdb: make lx-dmesg command work (reliably)
mm: consider memblock reservations for deferred memory initialization sizing
mm/hugetlb: report -EHWPOISON not -EFAULT when FOLL_HWPOISON is specified
mlock: fix mlock count can not decrease in race condition
mm/migrate: fix refcount handling when !hugepage_migration_supported()
dax: fix race between colliding PMD & PTE entries
mm: avoid spurious 'bad pmd' warning messages
mm/page_alloc.c: make sure OOM victim can try allocations with no watermarks once
pcmcia: remove left-over %Z format
slub/memcg: cure the brainless abuse of sysfs attributes
initramfs: fix disabling of initramfs (and its compression)
mm: clarify why we want kmalloc before falling backto vmallock
frv: declare jiffies to be located in the .data section
include/linux/gfp.h: fix ___GFP_NOLOCKDEP value
ksm: prevent crash after write_protect_page fails
We currently have two related PMD vs PTE races in the DAX code. These
can both be easily triggered by having two threads reading and writing
simultaneously to the same private mapping, with the key being that
private mapping reads can be handled with PMDs but private mapping
writes are always handled with PTEs so that we can COW.
Here is the first race:
CPU 0 CPU 1
(private mapping write)
__handle_mm_fault()
create_huge_pmd() - FALLBACK
handle_pte_fault()
passes check for pmd_devmap()
(private mapping read)
__handle_mm_fault()
create_huge_pmd()
dax_iomap_pmd_fault() inserts PMD
dax_iomap_pte_fault() does a PTE fault, but we already have a DAX PMD
installed in our page tables at this spot.
Here's the second race:
CPU 0 CPU 1
(private mapping read)
__handle_mm_fault()
passes check for pmd_none()
create_huge_pmd()
dax_iomap_pmd_fault() inserts PMD
(private mapping write)
__handle_mm_fault()
create_huge_pmd() - FALLBACK
(private mapping read)
__handle_mm_fault()
passes check for pmd_none()
create_huge_pmd()
handle_pte_fault()
dax_iomap_pte_fault() inserts PTE
dax_iomap_pmd_fault() inserts PMD,
but we already have a PTE at
this spot.
The core of the issue is that while there is isolation between faults to
the same range in the DAX fault handlers via our DAX entry locking,
there is no isolation between faults in the code in mm/memory.c. This
means for instance that this code in __handle_mm_fault() can run:
if (pmd_none(*vmf.pmd) && transparent_hugepage_enabled(vma)) {
ret = create_huge_pmd(&vmf);
But by the time we actually get to run the fault handler called by
create_huge_pmd(), the PMD is no longer pmd_none() because a racing PTE
fault has installed a normal PMD here as a parent. This is the cause of
the 2nd race. The first race is similar - there is the following check
in handle_pte_fault():
} else {
/* See comment in pte_alloc_one_map() */
if (pmd_devmap(*vmf->pmd) || pmd_trans_unstable(vmf->pmd))
return 0;
So if a pmd_devmap() PMD (a DAX PMD) has been installed at vmf->pmd, we
will bail and retry the fault. This is correct, but there is nothing
preventing the PMD from being installed after this check but before we
actually get to the DAX PTE fault handlers.
In my testing these races result in the following types of errors:
BUG: Bad rss-counter state mm:ffff8800a817d280 idx:1 val:1
BUG: non-zero nr_ptes on freeing mm: 15
Fix this issue by having the DAX fault handlers verify that it is safe
to continue their fault after they have taken an entry lock to block
other racing faults.
[ross.zwisler@linux.intel.com: improve fix for colliding PMD & PTE entries]
Link: http://lkml.kernel.org/r/20170526195932.32178-1-ross.zwisler@linux.intel.com
Link: http://lkml.kernel.org/r/20170522215749.23516-2-ross.zwisler@linux.intel.com
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reported-by: Pawel Lebioda <pawel.lebioda@intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Pawel Lebioda <pawel.lebioda@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Xiong Zhou <xzhou@redhat.com>
Cc: Eryu Guan <eguan@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- Fix an unmount hang due to a race in io buffer accounting.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABCgAGBQJZMKVEAAoJEPh/dxk0SrTrBYcQAKSpzE8C9wDBw6cyxP3kwrTr
FSQiSr7flnGBHwy2U0UC/SIFIwYxvW4BTnXJWADyqtnvLWP1+TC7UY1oNpkTsbkK
KLsWgz3aOcT/8sb346PzFDAuxof2lkv3xFPRBFaoeSkybxWqLz6BWsbmaJNH/wqy
W3k3H241mAftEiv1i9IUlAZMXE31qywIKzzUJvkOglXS8OdVFfMPQvUz6epU2LWA
I2tBip936Sl45vLu6ubqoRpk8dWNuPPX+f4YXl8dVeqRKTYhviMwgYD4rlljb6Ti
kIRG9HYg1GVZo5z/5unAjyEaKzYoRrXnO5Lg+i09NIhezlDhB2HJ+k71NljoeHoe
YCwqumQIGgnxdFu+FP10tKh2EWvDp80SQxgzIvr+FCCKJdsdNYyftRh4CtsCPJSG
xWHT1jgovygHsBEEmG2LS9mCXKkyWgMkHNMBu3Yy/F/4HGzrPjcU3F+x90OmOo7J
S26kEwsAoo+Q5Is8QkmqrnD+CQ7jwXEv9Mw3UqRwQ7UagRdR2nI8CIGEC7W+42Mm
Gd3TtAyJCbhZWXNq7pLeTnGu7JY3/dhR/8VSW+mIKtvFg7v9O1wZBYId8vTwZN1+
8jgnW0h6myE10YKU5bc1TZeYYAkWA+JLRKxoexL3QD8jWeffyZgMNWPM2rb+4Jjp
2wwCHMPvHE8X7a2urTW3
=wRbJ
-----END PGP SIGNATURE-----
Merge tag 'xfs-4.12-fixes-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull XFS fix from Darrick Wong:
"I've one more bugfix for you for 4.12-rc4: Fix an unmount hang due to
a race in io buffer accounting"
* tag 'xfs-4.12-fixes-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: use ->b_state to fix buffer I/O accounting release race
fix a crash that was likely a result of buggy client behavior.
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJZMHDGAAoJECebzXlCjuG+wVAP/RC2THsrHEfWQSrc+/wkKron
7PUZo6VRhoasjBInSJB/tdy+Yb82NbfLoXfJ71ddAwRUZlte74aI762HHuMdWtHY
8mCum5ea1AfRX5N/L/isO6lh4utO0vQEJ8r+P095d3EDwl0DnLYC3JVlKd/1r2VS
ELy8DZkyaVHZO9xiT+mnRgsq4aMjxG3F7DTHpcKDDFzG5Ts00zBQIXDu/rKmw3fD
WEuQjjrit1gFrUIUzJbSqwSokDCcf7v9HtGTI5+t+pIZ4Q2SyuKuTZvjtg+hb7Qa
K+F2SNNQsqfTW65zllhVR3gYpCykoqYPAJDw9MlqLN5tCmXFZLYhFHEUFx5kuobx
7+Dc3z1o5BgOiXcnKBVe+uONxXxcMYXLbU0e5Gac39GYW5xWzrU1+O6mMi0Q01YS
QsGRZEqHE2/3j1TAl0Q2SqT8gtG+A7piU4s5VavIHKIzI3/WubZ1GjLQ+RfXjuNa
DvkcAvSYfHyxzdWlyxjkzM09edt6SN3yEYdIRv9hiJEbUO3itVm9ycXTHLJUQUL0
sfVeXkm49e8gZZxHn+XuJubkT8HYlDGLQVSzK1zWFgt+zxd9LiP9iY+zs+vL9ryJ
DM9VmlJxZvNx9T7zSradW7gbIwOgxmBfRHFD05oODS1Tymb029akuU0YACb0sVnQ
LzDaZejUmURp7vlUffFp
=wznG
-----END PGP SIGNATURE-----
Merge tag 'nfsd-4.12-1' of git://linux-nfs.org/~bfields/linux
Pull nfsd fixes from Bruce Fields:
"Revert patch accidentally included in the merge window pull request,
and fix a crash that was likely a result of buggy client behavior"
* tag 'nfsd-4.12-1' of git://linux-nfs.org/~bfields/linux:
nfsd4: fix null dereference on replay
nfsd: Revert "nfsd: check for oversized NFSv2/v3 arguments"
Use ERR_CAST() to avoid cross-structure cast in ocf2, ntfs, and NFS.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
Comment: Kees Cook <kees@outflux.net>
iQIcBAABCgAGBQJZMHWdAAoJEIly9N/cbcAmWOYP/i45fa6JG7Aw9N59Uz4sqeUQ
ZUlvAUek6GkaGijCPtDYjy0cVj2Cc3QZLSRq9dDw/rU66Mc0ybYWHtIIwJy4ZjVe
D4w2Cs7K1oSOnhJnPTjQSKuMD81PF75NLChf3XSfLvtOWVIqW33EzLIu5lJ1rc1x
wh1fEAsJXGA9xklmW+m8Vn1FoS1a1j+9zuCEmGpveOkk6UKhhp73Ke8PP4uK9ld+
saApe/iH0JdTP6I7030A8hXwz7ZCYbMicw1kVpnsn4rM24p+k3Y2/OrFT2tY6/Y6
fzkTuVL7omQmUWph9zX6SYPg2GACEBTLb5V1YJ6zDUUzucu7vjfsvsTHXZb1gq2j
i8hZ6XsNOMWYJiOkOOSKM0rpjG6WSvF/sGc78ap7NJ4QPZ2/h3BTOXfk/ye/xQmL
WidEESJ4srInpi5ju8JTWHe27aydwiUUF91Y+gFv4G6CGU6/5vjUzOsgeiMxt0JN
lPaTjjL4lBHI2yohx2Wqy88yYWulK3LB0Hzt9XcSGMBA58H9d0CV0ZTkH3dJJkpC
QCM+Kt1DPy5A2RPC2APrPPCJsQycX9PSDeRaWkTxHnNLftpq65h1pAKjMcqsUPgb
HEEMLIBGqm871dr3+aPJPfG3Qil9ANBscDRbHXugCFTseFQO6M26KAxWGN+6LIQp
6Z0GUaPgJEua9ejodq4m
=R3qn
-----END PGP SIGNATURE-----
Merge tag 'gcc-plugins-v4.12-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull gcc-plugin prepwork from Kees Cook:
"Use designated initializers for mtk-vcodec, powerplay, amdgpu, and
sgi-xp. Use ERR_CAST() to avoid cross-structure cast in ocf2, ntfs,
and NFS.
Christoph Hellwig recommended that I send these fixes now, rather than
waiting for the v4.13 merge window. These are all initializer and cast
fixes needed for the future randstruct plugin that haven't been picked
up by the respective maintainers"
* tag 'gcc-plugins-v4.12-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
mtk-vcodec: Use designated initializers
drm/amd/powerplay: Use designated initializers
drm/amdgpu: Use designated initializers
sgi-xp: Use designated initializers
ocfs2: Use ERR_CAST() to avoid cross-structure cast
ntfs: Use ERR_CAST() to avoid cross-structure cast
NFS: Use ERR_CAST() to avoid cross-structure cast
Since using scsi_req() is only allowed against request queues for
which struct scsi_request is the first member of their private
request data, refuse to submit SCSI commands against a queue for
which this is not the case.
References: commit 82ed4db499 ("block: split scsi_request out of struct request")
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: J. Bruce Fields <bfields@redhat.com>
Cc: Jeff Layton <jlayton@poochiereds.net>
Cc: Omar Sandoval <osandov@fb.com>
Cc: linux-nfs@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@fb.com>
Pull Reiserfs and GFS2 fixes from Jan Kara:
"Fixes to GFS2 & Reiserfs for the fallout of the recent WRITE_FUA
cleanup from Christoph.
Fixes for other filesystems were already merged by respective
maintainers."
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
reiserfs: Make flush bios explicitely sync
gfs2: Make flush bios explicitely sync
Instead write a proper compat syscall that calls common helpers.
[ jlayton: fix pointer dereferencing in fixup_compat_flock ]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
If we have to recover relocation during mount, we'll ultimately have to
evict the orphan inode. That goes through the reservation dance, where
priority_reclaim_metadata_space and flush_space expect fs_info->fs_root
to be valid. That's the next thing to be set up during mount, so we
crash, almost always in flush_space trying to join the transaction
but priority_reclaim_metadata_space is possible as well. This call
path has been problematic in the past WRT whether ->fs_root is valid
yet. Commit 957780eb27 (Btrfs: introduce ticketed enospc
infrastructure) added new users that are called in the direct path
instead of the async path that had already been worked around.
The thing is that we don't actually need the fs_root, specifically, for
anything. We either use it to determine whether the root is the
chunk_root for use in choosing an allocation profile or as a root to pass
btrfs_join_transaction before immediately committing it. Anything that
isn't the chunk root works in the former case and any root works in
the latter.
A simple fix is to use a root we know will always be there: the
extent_root.
Cc: <stable@vger.kernel.org> # v4.8+
Fixes: 957780eb27 (Btrfs: introduce ticketed enospc infrastructure)
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If we fail to add the space_info kobject, we'll leak the memory
for the percpu counter.
Fixes: 6ab0a2029c (btrfs: publish allocation data in sysfs)
Cc: <stable@vger.kernel.org> # v3.14+
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Variables start_idx and end_idx are supposed to hold a page index
derived from the file offsets. The int type is not the right one though,
offsets larger than 1 << 44 will get silently trimmed off the high bits.
(1 << 44 is 16TiB)
What can go wrong, if start is below the boundary and end gets trimmed:
- if there's a page after start, we'll find it (radix_tree_gang_lookup_slot)
- the final check "if (page->index <= end_idx)" will unexpectedly fail
The function will return false, ie. "there's no page in the range",
although there is at least one.
btrfs_page_exists_in_range is used to prevent races in:
* in hole punching, where we make sure there are not pages in the
truncated range, otherwise we'll wait for them to finish and redo
truncation, but we're going to replace the pages with holes anyway so
the only problem is the intermediate state
* lock_extent_direct: we want to make sure there are no pages before we
lock and start DIO, to prevent stale data reads
For practical occurence of the bug, there are several constaints. The
file must be quite large, the affected range must cross the 16TiB
boundary and the internal state of the file pages and pending operations
must match. Also, we must not have started any ordered data in the
range, otherwise we don't even reach the buggy function check.
DIO locking tries hard in several places to avoid deadlocks with
buffered IO and avoids waiting for ranges. The worst consequence seems
to be stale data read.
CC: Liu Bo <bo.li.liu@oracle.com>
CC: stable@vger.kernel.org # 3.16+
Fixes: fc4adbff82 ("btrfs: Drop EXTENT_UPTODATE check in hole punching and direct locking")
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The current time will be initially available in the record->time field
for all pstore_read() and pstore_write() calls. Backends can either
update the field during read(), or use the field during write() instead
of fetching time themselves.
Signed-off-by: Kees Cook <keescook@chromium.org>
In preparation for setting timestamps in the pstore core, create a common
initializer routine, instead of using static initializers.
Signed-off-by: Kees Cook <keescook@chromium.org>
If a backend does not correctly iterate through its records, pstore will
get stuck loading entries. Detect this with a large record count, and
announce if we ever hit the limit. This will let future backend reading
bugs less annoying to debug. Additionally adjust the error about
pstore_mkfile() failing.
Signed-off-by: Kees Cook <keescook@chromium.org>
When the "if (record->size <= 0)" test is true in
pstore_get_backend_records() it's pretty clear that nobody holds a
reference to the allocated pstore_record, yet we don't free it.
Let's free it.
Fixes: 2a2b0acf76 ("pstore: Allocate records on heap instead of stack")
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: stable@vger.kernel.org
commit 9abdcccc3d ("pstore: Extract common arguments into structure")
moved record decompression to function. decompress_record() gets
called without checking type and compressed flag. Warning will be
reported if data is uncompressed. Pstore type PSTORE_TYPE_PPC_OPAL,
PSTORE_TYPE_PPC_COMMON doesn't contain compressed data and warning get
printed part of dmesg.
Partial dmesg log:
[ 35.848914] pstore: ignored compressed record type 6
[ 35.848927] pstore: ignored compressed record type 8
Above warning should not get printed as it is known that data won't be
compressed for above type and it is valid condition.
This patch returns if data is not compressed and print warning only if
data is compressed and type is not PSTORE_TYPE_DMESG.
Reported-by: Anton Blanchard <anton@au1.ibm.com>
Signed-off-by: Ankit Kumar <ankit@linux.vnet.ibm.com>
Reviewed-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Fixes: 9abdcccc3d ("pstore: Extract common arguments into structure")
Cc: stable@vger.kernel.org
Pull overlayfs fixes from Miklos Szeredi:
"Fix regressions:
- missing CONFIG_EXPORTFS dependency
- failure if upper fs doesn't support xattr
- bad error cleanup
This also adds the concept of "impure" directories complementing the
"origin" marking introduced in -rc1. Together they enable getting
consistent st_ino and d_ino for directory listings.
And there's a bug fix and a cleanup as well"
* 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
ovl: filter trusted xattr for non-admin
ovl: mark upper merge dir with type origin entries "impure"
ovl: mark upper dir with type origin entries "impure"
ovl: remove unused arg from ovl_lookup_temp()
ovl: handle rename when upper doesn't support xattr
ovl: don't fail copy-up if upper doesn't support xattr
ovl: check on mount time if upper fs supports setting xattr
ovl: fix creds leak in copy up error path
ovl: select EXPORTFS
We've had user reports of unmount hangs in xfs_wait_buftarg() that
analysis shows is due to btp->bt_io_count == -1. bt_io_count
represents the count of in-flight asynchronous buffers and thus
should always be >= 0. xfs_wait_buftarg() waits for this value to
stabilize to zero in order to ensure that all untracked (with
respect to the lru) buffers have completed I/O processing before
unmount proceeds to tear down in-core data structures.
The value of -1 implies an I/O accounting decrement race. Indeed,
the fact that xfs_buf_ioacct_dec() is called from xfs_buf_rele()
(where the buffer lock is no longer held) means that bp->b_flags can
be updated from an unsafe context. While a user-level reproducer is
currently not available, some intrusive hacks to run racing buffer
lookups/ioacct/releases from multiple threads was used to
successfully manufacture this problem.
Existing callers do not expect to acquire the buffer lock from
xfs_buf_rele(). Therefore, we can not safely update ->b_flags from
this context. It turns out that we already have separate buffer
state bits and associated serialization for dealing with buffer LRU
state in the form of ->b_state and ->b_lock. Therefore, replace the
_XBF_IN_FLIGHT flag with a ->b_state variant, update the I/O
accounting wrappers appropriately and make sure they are used with
the correct locking. This ensures that buffer in-flight state can be
modified at buffer release time without racing with modifications
from a buffer lock holder.
Fixes: 9c7504aa72 ("xfs: track and serialize in-flight async buffers against unmount")
Cc: <stable@vger.kernel.org> # v4.8+
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Tested-by: Libor Pechacek <lpechacek@suse.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
This effectively reverts commit 8ee74a91ac ("proc: try to remove use
of FOLL_FORCE entirely")
It turns out that people do depend on FOLL_FORCE for the /proc/<pid>/mem
case, and we're talking not just debuggers. Talking to the affected people, the use-cases are:
Keno Fischer:
"We used these semantics as a hardening mechanism in the julia JIT. By
opening /proc/self/mem and using these semantics, we could avoid
needing RWX pages, or a dual mapping approach. We do have fallbacks to
these other methods (though getting EIO here actually causes an assert
in released versions - we'll updated that to make sure to take the
fall back in that case).
Nevertheless the /proc/self/mem approach was our favored approach
because it a) Required an attacker to be able to execute syscalls
which is a taller order than getting memory write and b) didn't double
the virtual address space requirements (as a dual mapping approach
would).
I think in general this feature is very useful for anybody who needs
to precisely control the execution of some other process. Various
debuggers (gdb/lldb/rr) certainly fall into that category, but there's
another class of such processes (wine, various emulators) which may
want to do that kind of thing.
Now, I suspect most of these will have the other process under ptrace
control, so maybe allowing (same_mm || ptraced) would be ok, but at
least for the sandbox/remote-jit use case, it would be perfectly
reasonable to not have the jit server be a ptracer"
Robert O'Callahan:
"We write to readonly code and data mappings via /proc/.../mem in lots
of different situations, particularly when we're adjusting program
state during replay to match the recorded execution.
Like Julia, we can add workarounds, but they could be expensive."
so not only do people use FOLL_FORCE for both reads and writes, but they
use it for both the local mm and remote mm.
With these comments in mind, we likely also cannot add the "are we
actively ptracing" check either, so this keeps the new code organization
and does not do a real revert that would add back the original comment
about "Maybe we should limit FOLL_FORCE to actual ptrace users?"
Reported-by: Keno Fischer <keno@juliacomputing.com>
Reported-by: Robert O'Callahan <robert@ocallahan.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Eric Biederman <ebiederm@xmission.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently, extent manipulation operations such as hole punch, range
zeroing, or extent shifting do not record the fact that file data has
changed and thus fdatasync(2) has a work to do. As a result if we crash
e.g. after a punch hole and fdatasync, user can still possibly see the
punched out data after journal replay. Test generic/392 fails due to
these problems.
Fix the problem by properly marking that file data has changed in these
operations.
CC: stable@vger.kernel.org
Fixes: a4bb6b64e3
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Filesystems filter out extended attributes in the "trusted." domain for
unprivlieged callers.
Overlay calls underlying filesystem's method with elevated privs, so need
to do the filtering in overlayfs too.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
An upper dir is marked "impure" to let ovl_iterate() know that this
directory may contain non pure upper entries whose d_ino may need to be
read from the origin inode.
We already mark a non-merge dir "impure" when moving a non-pure child
entry inside it, to let ovl_iterate() know not to iterate the non-merge
dir directly.
Mark also a merge dir "impure" when moving a non-pure child entry inside
it and when copying up a child entry inside it.
This can be used to optimize ovl_iterate() to perform a "pure merge" of
upper and lower directories, merging the content of the directories,
without having to read d_ino from origin inodes.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
When trying to propagate an error result, the error return path attempts
to retain the error, but does this with an open cast across very different
types, which the upcoming structure layout randomization plugin flags as
being potentially dangerous in the face of randomization. This is a false
positive, but what this code actually wants to do is use ERR_CAST() to
retain the error value.
Cc: Mark Fasheh <mfasheh@versity.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
When trying to propagate an error result, the error return path attempts
to retain the error, but does this with an open cast across very different
types, which the upcoming structure layout randomization plugin flags as
being potentially dangerous in the face of randomization. This is a false
positive, but what this code actually wants to do is use ERR_CAST() to
retain the error value.
Cc: Anton Altaparmakov <anton@tuxera.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
When the call to nfs_devname() fails, the error path attempts to retain
the error via the mnt variable, but this requires a cast across very
different types (char * to struct vfsmount *), which the upcoming
structure layout randomization plugin flags as being potentially
dangerous in the face of randomization. This is a false positive, but
what this code actually wants to do is retain the error value, so this
patch explicitly sets it, instead of using what seems to be an
unexpected cast.
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Trond Myklebust <trond.myklebust@primarydata.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
This will make it easier to implement a sane compat fcntl syscall.
[ jlayton: fix undeclared identifiers in 32-bit fcntl64 syscall handler ]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
There are a few syntax violations that cause outputs of
a few comments to not be properly parsed in ReST format.
No functional changes.
Signed-off-by: Mauro Carvalho Chehab <mchehab@s-opensource.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
mpage_submit_page() can race with another process growing i_size and
writing data via mmap to the written-back page. As mpage_submit_page()
samples i_size too early, it may happen that ext4_bio_write_page()
zeroes out too large tail of the page and thus corrupts user data.
Fix the problem by sampling i_size only after the page has been
write-protected in page tables by clear_page_dirty_for_io() call.
Reported-by: Michael Zimmer <michael@swarm64.com>
CC: stable@vger.kernel.org
Fixes: cb20d51883
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
When ext4_map_blocks() is called with EXT4_GET_BLOCKS_ZERO to zero-out
allocated blocks and these blocks are actually converted from unwritten
extent the following race can happen:
CPU0 CPU1
page fault page fault
... ...
ext4_map_blocks()
ext4_ext_map_blocks()
ext4_ext_handle_unwritten_extents()
ext4_ext_convert_to_initialized()
- zero out converted extent
ext4_zeroout_es()
- inserts extent as initialized in status tree
ext4_map_blocks()
ext4_es_lookup_extent()
- finds initialized extent
write data
ext4_issue_zeroout()
- zeroes out new extent overwriting data
This problem can be reproduced by generic/340 for the fallocated case
for the last block in the file.
Fix the problem by avoiding zeroing out the area we are mapping with
ext4_map_blocks() in ext4_ext_convert_to_initialized(). It is pointless
to zero out this area in the first place as the caller asked us to
convert the area to initialized because he is just going to write data
there before the transaction finishes. To achieve this we delete the
special case of zeroing out full extent as that will be handled by the
cases below zeroing only the part of the extent that needs it. We also
instruct ext4_split_extent() that the middle of extent being split
contains data so that ext4_split_extent_at() cannot zero out full extent
in case of ENOSPC.
CC: stable@vger.kernel.org
Fixes: 12735f8819
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
- Fix indlen block reservation accounting bug when splitting delalloc extent
- Fix warnings about unused variables that appeared in -rc1.
- Don't spew errors when bmapping a local format directory
- Fix an off-by-one error in a delalloc eof assertion
- Make fsmap only return inode information for CAP_SYS_ADMIN
- Fix a potential mount time deadlock recovering cow extents
- Fix unaligned memory access in _btree_visit_blocks
- Fix various SEEK_HOLE/SEEK_DATA bugs
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABCgAGBQJZJwnxAAoJEPh/dxk0SrTr/TMQAKP6OMsjYxpro+1Uif+oPTQ6
vvUfXJMWLKc07QI/czwLDY4A36h2TZjNxpBJypSfVumlD82ZPa8gp6XFWngwIUb4
3G+A9zq4Fviq8Vzz3G75C8Q49h8IpmU3SimTlhS1BIcxe+upu2qplzM3yc6/T4MB
WTTqtjL3SaW5D2v0ZdPL9ulQKKAlL1WfbZV9dDJ4UiRw5Jlwj2Udg6HnbRvfrcZF
IziYlidrTIt64ecA9GqR32soXqFBGPKo6Wp9Pk+iWLlsfM6qcCt1m+yfM1JonRGA
wycygcrrjfR/lFHMQCGonLs1ajC6isLeMZ804P6OP2q6kfdtersedvY7XSoYsEJ4
ok4J3fiyqYgMGhPz7x0Y8IH9+gdudn7+fHiC5/RNkolEy8AbPPe21XhFDVxeTkCs
4GAHNGQfOEK2PT69Ya81taVzT/TpuIGIkUAaDH8vsfxwcVunM08/OffsCiinLMJx
bt3G7fH3wJ+VuYJS92amj3k6n6EAeHYc0dAVGd5e8dtN25079nBm+EP0Wp+j8uVl
PwaJjde68wxWUvuYXVK1a8vietRS7xChyta34cYcStd4wWu1knccpN/mjQnK/ucB
4etZspB1rQQx08KBqHVq8t508PA7nWtFxjE91JYkpvbyYym1WEH8Mz7rbVBI6NjS
Y/8+uPhFq2BU1b9skj0U
=pDjl
-----END PGP SIGNATURE-----
Merge tag 'xfs-4.12-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull XFS fixes from Darrick Wong:
"A few miscellaneous bug fixes & cleanups:
- Fix indlen block reservation accounting bug when splitting delalloc
extent
- Fix warnings about unused variables that appeared in -rc1.
- Don't spew errors when bmapping a local format directory
- Fix an off-by-one error in a delalloc eof assertion
- Make fsmap only return inode information for CAP_SYS_ADMIN
- Fix a potential mount time deadlock recovering cow extents
- Fix unaligned memory access in _btree_visit_blocks
- Fix various SEEK_HOLE/SEEK_DATA bugs"
* tag 'xfs-4.12-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: Move handling of missing page into one place in xfs_find_get_desired_pgoff()
xfs: Fix off-by-in in loop termination in xfs_find_get_desired_pgoff()
xfs: Fix missed holes in SEEK_HOLE implementation
xfs: fix off-by-one on max nr_pages in xfs_find_get_desired_pgoff()
xfs: fix unaligned access in xfs_btree_visit_blocks
xfs: avoid mount-time deadlock in CoW extent recovery
xfs: only return detailed fsmap info if the caller has CAP_SYS_ADMIN
xfs: bad assertion for delalloc an extent that start at i_size
xfs: fix warnings about unused stack variables
xfs: BMAPX shouldn't barf on inline-format directories
xfs: fix indlen accounting error on partial delalloc conversion
Currently several places in xfs_find_get_desired_pgoff() handle the case
of a missing page. Make them all handled in one place after the loop has
terminated.
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
There is an off-by-one error in loop termination conditions in
xfs_find_get_desired_pgoff() since 'end' may index a page beyond end of
desired range if 'endoff' is page aligned. It doesn't have any visible
effects but still it is good to fix it.
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
XFS SEEK_HOLE implementation could miss a hole in an unwritten extent as
can be seen by the following command:
xfs_io -c "falloc 0 256k" -c "pwrite 0 56k" -c "pwrite 128k 8k"
-c "seek -h 0" file
wrote 57344/57344 bytes at offset 0
56 KiB, 14 ops; 0.0000 sec (49.312 MiB/sec and 12623.9856 ops/sec)
wrote 8192/8192 bytes at offset 131072
8 KiB, 2 ops; 0.0000 sec (70.383 MiB/sec and 18018.0180 ops/sec)
Whence Result
HOLE 139264
Where we can see that hole at offset 56k was just ignored by SEEK_HOLE
implementation. The bug is in xfs_find_get_desired_pgoff() which does
not properly detect the case when pages are not contiguous.
Fix the problem by properly detecting when found page has larger offset
than expected.
CC: stable@vger.kernel.org
Fixes: d126d43f63
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
xfs_find_get_desired_pgoff() is used to search for offset of hole or
data in page range [index, end] (both inclusive), and the max number
of pages to search should be at least one, if end == index.
Otherwise the only page is missed and no hole or data is found,
which is not correct.
When block size is smaller than page size, this can be demonstrated
by preallocating a file with size smaller than page size and writing
data to the last block. E.g. run this xfs_io command on a 1k block
size XFS on x86_64 host.
# xfs_io -fc "falloc 0 3k" -c "pwrite 2k 1k" \
-c "seek -d 0" /mnt/xfs/testfile
wrote 1024/1024 bytes at offset 2048
1 KiB, 1 ops; 0.0000 sec (33.675 MiB/sec and 34482.7586 ops/sec)
Whence Result
DATA EOF
Data at offset 2k was missed, and lseek(2) returned ENXIO.
This is uncovered by generic/285 subtest 07 and 08 on ppc64 host,
where pagesize is 64k. Because a recent change to generic/285
reduced the preallocated file size to smaller than 64k.
Cc: stable@vger.kernel.org # v3.7+
Signed-off-by: Eryu Guan <eguan@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
This structure copy was throwing unaligned access warnings on sparc64:
Kernel unaligned access at TPC[1043c088] xfs_btree_visit_blocks+0x88/0xe0 [xfs]
xfs_btree_copy_ptrs does a memcpy, which avoids it.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
ext4_xattr_block_set() calls dquot_alloc_block() to charge for an xattr
block when new references are made. However if dquot_initialize() hasn't
been called on an inode, request for charging is effectively ignored
because ext4_inode_info->i_dquot is not initialized yet.
Add dquot_initialize() to call paths that lead to ext4_xattr_block_set().
Signed-off-by: Tahsin Erdogan <tahsin@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Currently we don't allow direct I/O on encrypted regular files, so in
such cases we return 0 early in ext4_direct_IO(). There was also an
additional BUG_ON() check in ext4_direct_IO_write(), but it can never be
hit because of the earlier check for the exact same condition in
ext4_direct_IO(). There was also no matching check on the read path,
which made the write path specific check seem very ad-hoc.
Just remove the unnecessary BUG_ON().
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: David Gstir <david@sigma-star.at>
Reviewed-by: Jan Kara <jack@suse.cz>
Now that we are passing a struct ext4_filename, we do not need to pass
around the original struct qstr too.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
The 'lend' argument of filemap_write_and_wait_range() is inclusive, so
we need to subtract 1 from pos + count.
Note that 'count' is guaranteed to be nonzero since
ext4_file_read_iter() returns early when given a 0 count.
Fixes: 16c5468859 ("ext4: Allow parallel DIO reads")
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
ext4_find_unwritten_pgoff() is used to search for offset of hole or
data in page range [index, end] (both inclusive), and the max number
of pages to search should be at least one, if end == index.
Otherwise the only page is missed and no hole or data is found,
which is not correct.
When block size is smaller than page size, this can be demonstrated
by preallocating a file with size smaller than page size and writing
data to the last block. E.g. run this xfs_io command on a 1k block
size ext4 on x86_64 host.
# xfs_io -fc "falloc 0 3k" -c "pwrite 2k 1k" \
-c "seek -d 0" /mnt/ext4/testfile
wrote 1024/1024 bytes at offset 2048
1 KiB, 1 ops; 0.0000 sec (42.459 MiB/sec and 43478.2609 ops/sec)
Whence Result
DATA EOF
Data at offset 2k was missed, and lseek(2) returned ENXIO.
This is unconvered by generic/285 subtest 07 and 08 on ppc64 host,
where pagesize is 64k. Because a recent change to generic/285
reduced the preallocated file size to smaller than 64k.
Signed-off-by: Eryu Guan <eguan@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Currently the ceph client doesn't respect the rlimit in fallocate. This
means that a user can allocate a file with size > RLIMIT_FSIZE. This
patch adds the call to inode_newsize_ok() to verify filesystem limits and
ulimits. This should make ceph successfully run xfstest generic/228.
Signed-off-by: Luis Henriques <lhenriques@suse.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
It's possible and acceptable for NFS to attempt to add requests beyond the
range of the current pgio->pg_lseg, a case which should be caught and
limited by the pg_test operation. However, the current handling of this
case replaces pgio->pg_lseg with a new layout segment (after a WARN) within
that pg_test operation. That will cause all the previously added requests
to be submitted with this new layout segment, which may not be valid for
those requests.
Fix this problem by only returning zero for the number of bytes to coalesce
from pg_test for this case which allows any previously added requests to
complete on the current layout segment. The check for requests starting
out of range of the layout segment moves to pg_init, so that the
replacement of pgio->pg_lseg will be done when the next request is added.
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
If xdr_inline_decode() fails then we end up returning ERR_PTR(0). The
caller treats NULL returns as -ENOMEM so it doesn't really hurt runtime,
but obviously we intended to set an error code here.
Fixes: d67ae825a5 ("pnfs/flexfiles: Add the FlexFile Layout Driver")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Commit b685d3d65a "block: treat REQ_FUA and REQ_PREFLUSH as
synchronous" removed REQ_SYNC flag from WRITE_{FUA|PREFLUSH|...}
definitions. generic_make_request_checks() however strips REQ_FUA and
REQ_PREFLUSH flags from a bio when the storage doesn't report volatile
write cache and thus write effectively becomes asynchronous which can
lead to performance regressions
Fix the problem by making sure all bios which are synchronous are
properly marked with REQ_SYNC.
Fixes: b685d3d65a
CC: reiserfs-devel@vger.kernel.org
CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Commit b685d3d65a "block: treat REQ_FUA and REQ_PREFLUSH as
synchronous" removed REQ_SYNC flag from WRITE_{FUA|PREFLUSH|...}
definitions. generic_make_request_checks() however strips REQ_FUA and
REQ_PREFLUSH flags from a bio when the storage doesn't report volatile
write cache and thus write effectively becomes asynchronous which can
lead to performance regressions
Fix the problem by making sure all bios which are synchronous are
properly marked with REQ_SYNC.
Fixes: b685d3d65a
CC: Steven Whitehouse <swhiteho@redhat.com>
CC: cluster-devel@redhat.com
CC: stable@vger.kernel.org
Acked-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Since only an open file can be written to, and we only allow open()ing
an encrypted file when its key is available, there is no need to check
for the key again before permitting each ->write_iter().
This code was also broken in that it wouldn't actually have failed if
the key was in fact unavailable.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: David Gstir <david@sigma-star.at>
Acked-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Since only an open file can be mmap'ed, and we only allow open()ing an
encrypted file when its key is available, there is no need to check for
the key again before permitting each mmap().
This f2fs copy of this code was also broken in that it wouldn't actually
have failed if the key was in fact unavailable.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: David Gstir <david@sigma-star.at>
Acked-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Wait discard IO completion outside cmd_lock to avoid long latency
of holding cmd_lock in IO busy scenario.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
There could be more than one waiter waiting discard IO completion, so we
need use complete_all() instead of complete() in f2fs_submit_discard_endio
to avoid hungtask.
Fixes: ec9895add2 ("f2fs: don't hold cmd_lock during waiting discard
command")
Cc: <stable@vger.kernel.org>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Serialize data/node IOs by using fifo list instead of mutex lock,
it will help to enhance concurrency of f2fs, meanwhile keeping LFS
IO semantics.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Came across a hungtask caused by huge number of rb tree traversing
during adding discard addrs in cp. This patch combine these consistence
checks and move it to discard thread.
Signed-off-by: Yunlei He <heyunlei@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch add a new method get_ssr_cost to select
SSR segment more accurately.
Signed-off-by: Yunlei He <heyunlei@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Currently, if we do get_node_of_data before f2fs_lock_op, there may be dead lock
as follows, where process A would be in infinite loop, and B will NOT be awaked.
Process A(cp): Process B:
f2fs_lock_all(sbi)
get_dnode_of_data <---- lock dn.node_page
flush_nodes f2fs_lock_op
So, this patch adds f2fs_trylock_op to avoid f2fs_lock_op done by IPU.
Signed-off-by: Hou Pengyang <houpengyang@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Split DATA/NODE type bio cache according to different temperature,
so write IOs with the same temperature can be merged in corresponding
bio cache as much as possible, otherwise, different temperature write
IOs submitting into one bio cache will always cause split of bio.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Merged IO flow doesn't need to care about read IOs.
f2fs_submit_merged_bio -> f2fs_submit_merged_write
f2fs_submit_merged_bios -> f2fs_submit_merged_writes
f2fs_submit_merged_bio_cond -> f2fs_submit_merged_write_cond
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
By default, f2fs_gc returns -EINVAL in general error cases, e.g., no victim
was selected. However, the default errno may be overwritten in two cases:
gc_more and BG_GC -> FG_GC. We should return consistent errno in such cases.
Signed-off-by: Weichao Guo <guoweichao@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Last kernel has already support new syscall statx() in commit a528d35e8b
("statx: Add a system call to make enhanced file info available"), with
this interface we can show more file info including file creation and some
attribute flags to user.
This patch tries to support this functionality.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
if we receive a compound such that:
- the sessionid, slot, and sequence number in the SEQUENCE op
match a cached succesful reply with N ops, and
- the Nth operation of the compound is a PUTFH, PUTPUBFH,
PUTROOTFH, or RESTOREFH,
then nfsd4_sequence will return 0 and set cstate->status to
nfserr_replay_cache. The current filehandle will not be set. This will
cause us to call check_nfsd_access with first argument NULL.
To nfsd4_compound it looks like we just succesfully executed an
operation that set a filehandle, but the current filehandle is not set.
Fix this by moving the nfserr_replay_cache earlier. There was never any
reason to have it after the encode_op label, since the only case where
he hit that is when opdesc->op_func sets it.
Note that there are two ways we could hit this case:
- a client is resending a previously sent compound that ended
with one of the four PUTFH-like operations, or
- a client is sending a *new* compound that (incorrectly) shares
sessionid, slot, and sequence number with a previously sent
compound, and the length of the previously sent compound
happens to match the position of a PUTFH-like operation in the
new compound.
The second is obviously incorrect client behavior. The first is also
very strange--the only purpose of a PUTFH-like operation is to set the
current filehandle to be used by the following operation, so there's no
point in having it as the last in a compound.
So it's likely this requires a buggy or malicious client to reproduce.
Reported-by: Scott Mayhew <smayhew@redhat.com>
Cc: stable@kernel.vger.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Andrei Vagin pointed out that time to executue propagate_umount can go
non-linear (and take a ludicrious amount of time) when the mount
propogation trees of the mounts to be unmunted by a lazy unmount
overlap.
Make the walk of the mount propagation trees nearly linear by
remembering which mounts have already been visited, allowing
subsequent walks to detect when walking a mount propgation tree or a
subtree of a mount propgation tree would be duplicate work and to skip
them entirely.
Walk the list of mounts whose propgatation trees need to be traversed
from the mount highest in the mount tree to mounts lower in the mount
tree so that odds are higher that the code will walk the largest trees
first, allowing later tree walks to be skipped entirely.
Add cleanup_umount_visitation to remover the code's memory of which
mounts have been visited.
Add the functions last_slave and skip_propagation_subtree to allow
skipping appropriate parts of the mount propagation tree without
needing to change the logic of the rest of the code.
A script to generate overlapping mount propagation trees:
$ cat runs.h
set -e
mount -t tmpfs zdtm /mnt
mkdir -p /mnt/1 /mnt/2
mount -t tmpfs zdtm /mnt/1
mount --make-shared /mnt/1
mkdir /mnt/1/1
iteration=10
if [ -n "$1" ] ; then
iteration=$1
fi
for i in $(seq $iteration); do
mount --bind /mnt/1/1 /mnt/1/1
done
mount --rbind /mnt/1 /mnt/2
TIMEFORMAT='%Rs'
nr=$(( ( 2 ** ( $iteration + 1 ) ) + 1 ))
echo -n "umount -l /mnt/1 -> $nr "
time umount -l /mnt/1
nr=$(cat /proc/self/mountinfo | grep zdtm | wc -l )
time umount -l /mnt/2
$ for i in $(seq 9 19); do echo $i; unshare -Urm bash ./run.sh $i; done
Here are the performance numbers with and without the patch:
mhash | 8192 | 8192 | 1048576 | 1048576
mounts | before | after | before | after
------------------------------------------------
1025 | 0.040s | 0.016s | 0.038s | 0.019s
2049 | 0.094s | 0.017s | 0.080s | 0.018s
4097 | 0.243s | 0.019s | 0.206s | 0.023s
8193 | 1.202s | 0.028s | 1.562s | 0.032s
16385 | 9.635s | 0.036s | 9.952s | 0.041s
32769 | 60.928s | 0.063s | 44.321s | 0.064s
65537 | | 0.097s | | 0.097s
131073 | | 0.233s | | 0.176s
262145 | | 0.653s | | 0.344s
524289 | | 2.305s | | 0.735s
1048577 | | 7.107s | | 2.603s
Andrei Vagin reports fixing the performance problem is part of the
work to fix CVE-2016-6213.
Cc: stable@vger.kernel.org
Fixes: a05964f391 ("[PATCH] shared mounts handling: umount")
Reported-by: Andrei Vagin <avagin@openvz.org>
Reviewed-by: Andrei Vagin <avagin@virtuozzo.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
While investigating some poor umount performance I realized that in
the case of overlapping mount trees where some of the mounts are locked
the code has been failing to unmount all of the mounts it should
have been unmounting.
This failure to unmount all of the necessary
mounts can be reproduced with:
$ cat locked_mounts_test.sh
mount -t tmpfs test-base /mnt
mount --make-shared /mnt
mkdir -p /mnt/b
mount -t tmpfs test1 /mnt/b
mount --make-shared /mnt/b
mkdir -p /mnt/b/10
mount -t tmpfs test2 /mnt/b/10
mount --make-shared /mnt/b/10
mkdir -p /mnt/b/10/20
mount --rbind /mnt/b /mnt/b/10/20
unshare -Urm --propagation unchaged /bin/sh -c 'sleep 5; if [ $(grep test /proc/self/mountinfo | wc -l) -eq 1 ] ; then echo SUCCESS ; else echo FAILURE ; fi'
sleep 1
umount -l /mnt/b
wait %%
$ unshare -Urm ./locked_mounts_test.sh
This failure is corrected by removing the prepass that marks mounts
that may be umounted.
A first pass is added that umounts mounts if possible and if not sets
mount mark if they could be unmounted if they weren't locked and adds
them to a list to umount possibilities. This first pass reconsiders
the mounts parent if it is on the list of umount possibilities, ensuring
that information of umoutability will pass from child to mount parent.
A second pass then walks through all mounts that are umounted and processes
their children unmounting them or marking them for reparenting.
A last pass cleans up the state on the mounts that could not be umounted
and if applicable reparents them to their first parent that remained
mounted.
While a bit longer than the old code this code is much more robust
as it allows information to flow up from the leaves and down
from the trunk making the order in which mounts are encountered
in the umount propgation tree irrelevant.
Cc: stable@vger.kernel.org
Fixes: 0c56fe3142 ("mnt: Don't propagate unmounts to locked mounts")
Reviewed-by: Andrei Vagin <avagin@virtuozzo.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
ext4_expand_extra_isize() should clear only space between old and new
size.
Fixes: 6dd4ee7cab # v2.6.23
Cc: stable@vger.kernel.org
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
I've got another report about breaking ext4 by ENOMEM error returned from
ext4_mb_load_buddy() caused by memory shortage in memory cgroup.
This time inside ext4_discard_preallocations().
This patch replaces ext4_error() with ext4_warning() where errors returned
from ext4_mb_load_buddy() are not fatal and handled by caller:
* ext4_mb_discard_group_preallocations() - called before generating ENOSPC,
we'll try to discard other group or return ENOSPC into user-space.
* ext4_trim_all_free() - just stop trimming and return ENOMEM from ioctl.
Some callers cannot handle errors, thus __GFP_NOFAIL is used for them:
* ext4_discard_preallocations()
* ext4_mb_discard_lg_preallocations()
Fixes: adb7ef600c ("ext4: use __GFP_NOFAIL in ext4_free_blocks()")
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
There is an off-by-one error in loop termination conditions in
ext4_find_unwritten_pgoff() since 'end' may index a page beyond end of
desired range if 'endoff' is page aligned. It doesn't have any visible
effects but still it is good to fix it.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Currently, SEEK_HOLE implementation in ext4 may both return that there's
a hole at some offset although that offset already has data and skip
some holes during a search for the next hole. The first problem is
demostrated by:
xfs_io -c "falloc 0 256k" -c "pwrite 0 56k" -c "seek -h 0" file
wrote 57344/57344 bytes at offset 0
56 KiB, 14 ops; 0.0000 sec (2.054 GiB/sec and 538461.5385 ops/sec)
Whence Result
HOLE 0
Where we can see that SEEK_HOLE wrongly returned offset 0 as containing
a hole although we have written data there. The second problem can be
demonstrated by:
xfs_io -c "falloc 0 256k" -c "pwrite 0 56k" -c "pwrite 128k 8k"
-c "seek -h 0" file
wrote 57344/57344 bytes at offset 0
56 KiB, 14 ops; 0.0000 sec (1.978 GiB/sec and 518518.5185 ops/sec)
wrote 8192/8192 bytes at offset 131072
8 KiB, 2 ops; 0.0000 sec (2 GiB/sec and 500000.0000 ops/sec)
Whence Result
HOLE 139264
Where we can see that hole at offsets 56k..128k has been ignored by the
SEEK_HOLE call.
The underlying problem is in the ext4_find_unwritten_pgoff() which is
just buggy. In some cases it fails to update returned offset when it
finds a hole (when no pages are found or when the first found page has
higher index than expected), in some cases conditions for detecting hole
are just missing (we fail to detect a situation where indices of
returned pages are not contiguous).
Fix ext4_find_unwritten_pgoff() to properly detect non-contiguous page
indices and also handle all cases where we got less pages then expected
in one place and handle it properly there.
CC: stable@vger.kernel.org
Fixes: c8c0df241c
CC: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
When a transaction starts, start_this_handle() saves current
PF_MEMALLOC_NOFS value so that it can be restored at journal stop time.
Journal restart is a special case that calls start_this_handle() without
stopping the transaction. start_this_handle() isn't aware that the
original value is already stored so it overwrites it with current value.
For instance, a call sequence like below leaves PF_MEMALLOC_NOFS flag set
at the end:
jbd2_journal_start()
jbd2__journal_restart()
jbd2_journal_stop()
Make jbd2__journal_restart() restore the original value before calling
start_this_handle().
Fixes: 81378da64d ("jbd2: mark the transaction context with the scope GFP_NOFS context")
Signed-off-by: Tahsin Erdogan <tahsin@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Quota files have special ranking of i_data_sem lock. We inform lockdep
about it when turning on quotas however when turning quotas off, we
don't clear the lockdep subclass from i_data_sem lock and thus when the
inode gets later reused for a normal file or directory, lockdep gets
confused and complains about possible deadlocks. Fix the problem by
resetting lockdep subclass of i_data_sem on quota off.
Cc: stable@vger.kernel.org
Fixes: daf647d2dd
Reported-and-tested-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Pull block fixes from Jens Axboe:
"A small collection of fixes that should go into this cycle.
- a pull request from Christoph for NVMe, which ended up being
manually applied to avoid pulling in newer bits in master. Mostly
fibre channel fixes from James, but also a few fixes from Jon and
Vijay
- a pull request from Konrad, with just a single fix for xen-blkback
from Gustavo.
- a fuseblk bdi fix from Jan, fixing a regression in this series with
the dynamic backing devices.
- a blktrace fix from Shaohua, replacing sscanf() with kstrtoull().
- a request leak fix for drbd from Lars, fixing a regression in the
last series with the kref changes. This will go to stable as well"
* 'for-linus' of git://git.kernel.dk/linux-block:
nvmet: release the sq ref on rdma read errors
nvmet-fc: remove target cpu scheduling flag
nvme-fc: stop queues on error detection
nvme-fc: require target or discovery role for fc-nvme targets
nvme-fc: correct port role bits
nvme: unmap CMB and remove sysfs file in reset path
blktrace: fix integer parse
fuseblk: Fix warning in super_setup_bdi_name()
block: xen-blkback: add null check to avoid null pointer dereference
drbd: fix request leak introduced by locking/atomic, kref: Kill kref_sub()
Pull libnvdimm fixes from Dan Williams:
"A couple of compile fixes.
With the removal of the ->direct_access() method from
block_device_operations in favor of a new dax_device + dax_operations
we broke two configurations.
The CONFIG_BLOCK=n case is fixed by compiling out the block+dax
helpers in the dax core. Configurations with FS_DAX=n EXT4=y / XFS=y
and DAX=m fail due to the helpers the builtin filesystem needs being
in a module, so we stub out the helpers in the FS_DAX=n case."
* 'libnvdimm-for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
dax, xfs, ext4: compile out iomap-dax paths in the FS_DAX=n case
dax: fix false CONFIG_BLOCK dependency
If a malicious user corrupts the refcount btree to cause a cycle between
different levels of the tree, the next mount attempt will deadlock in
the CoW recovery routine while grabbing buffer locks. We can use the
ability to re-grab a buffer that was previous locked to a transaction to
avoid deadlocks, so do that here.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
When moving a merge dir or non-dir with copy up origin into a non-merge
upper dir (a.k.a pure upper dir), we are marking the target parent dir
"impure". ovl_iterate() iterates pure upper dirs directly, because there is
no need to filter out whiteouts and merge dir content with lower dir. But
for the case of an "impure" upper dir, ovl_iterate() will not be able to
iterate the real upper dir directly, because it will need to lookup the
origin inode and use it to fill d_ino.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
On failure to set opaque/redirect xattr on rename, skip setting xattr and
return -EXDEV.
On failure to set opaque xattr when creating a new directory, -EIO is
returned instead of -EOPNOTSUPP.
Any failure to set those xattr will be recorded in super block and
then setting any xattr on upper won't be attempted again.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Mauro says:
This patch series convert the remaining DocBooks to ReST.
The first version was originally
send as 3 patch series:
[PATCH 00/36] Convert DocBook documents to ReST
[PATCH 0/5] Convert more books to ReST
[PATCH 00/13] Get rid of DocBook
The lsm book was added as if it were a text file under
Documentation. The plan is to merge it with another file
under Documentation/security, after both this series and
a security Documentation patch series gets merged.
It also adjusts some Sphinx-pedantic errors/warnings on
some kernel-doc markups.
I also added some patches here to add PDF output for all
existing ReST books.
xattr are needed by overlayfs for setting opaque dir, redirect dir
and copy up origin.
Check at mount time by trying to set the overlay.opaque xattr on the
workdir and if that fails issue a warning message.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Commit 5f7f7543f5 "fuse: Convert to separately allocated bdi" didn't
properly handle fuseblk filesystem. When fuse_bdi_init() is called for
that filesystem type, sb->s_bdi is already initialized (by
set_bdev_super()) to point to block device's bdi and consequently
super_setup_bdi_name() complains about this fact when reseting bdi to
the private one.
Fix the problem by properly dropping bdi reference in fuse_bdi_init()
before creating a private bdi in super_setup_bdi_name().
Fixes: 5f7f7543f5 ("fuse: Convert to separately allocated bdi")
Reported-by: Rakesh Pandit <rakesh@tuxera.com>
Tested-by: Rakesh Pandit <rakesh@tuxera.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
Make sure segno and blkoff read from raw image are valid.
Cc: stable@vger.kernel.org
Signed-off-by: Jin Qian <jinqian@google.com>
[Jaegeuk Kim: adjust minor coding style]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This reverts commit 51f5677777 "nfsd: check for oversized NFSv2/v3
arguments", which breaks support for NFSv3 ACLs.
That patch was actually an earlier draft of a fix for the problem that
was eventually fixed by e6838a29ec "nfsd: check for oversized NFSv2/v3
arguments". But somehow I accidentally left this earlier draft in the
branch that was part of my 2.12 pull request.
Reported-by: Eryu Guan <eguan@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
There were a number of handwaving complaints that one could "possibly"
use inode numbers and extent maps to fingerprint a filesystem hosting
multiple containers and somehow use the information to guess at the
contents of other containers and attack them. Despite the total lack of
any demonstration that this is actually possible, it's easier to
restrict access now and broaden it later, so use the rmapbt fsmap
backends only if the caller has CAP_SYS_ADMIN. Unprivileged users will
just have to make do with only getting the free space and static
metadata placement information.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
By run fsstress long enough time enough in RHEL-7, I find an
assertion failure (harder to reproduce on linux-4.11, but problem
is still there):
XFS: Assertion failed: (iflags & BMV_IF_DELALLOC) != 0, file: fs/xfs/xfs_bmap_util.c
The assertion is in xfs_getbmap() funciton:
if (map[i].br_startblock == DELAYSTARTBLOCK &&
--> map[i].br_startoff <= XFS_B_TO_FSB(mp, XFS_ISIZE(ip)))
ASSERT((iflags & BMV_IF_DELALLOC) != 0);
When map[i].br_startoff == XFS_B_TO_FSB(mp, XFS_ISIZE(ip)), the
startoff is just at EOF. But we only need to make sure delalloc
extents that are within EOF, not include EOF.
Signed-off-by: Zorro Lang <zlang@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reduce stack usage and get rid of compiler warnings by eliminating
unused variables.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
When we're fulfilling a BMAPX request, jump out early if the data fork
is in local format. This prevents us from hitting a debugging check in
bmapi_read and barfing errors back to userspace. The on-disk extent
count check later isn't sufficient for IF_DELALLOC mode because da
extents are in memory and not on disk.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
The delalloc -> real block conversion path uses an incorrect
calculation in the case where the middle part of a delalloc extent
is being converted. This is documented as a rare situation because
XFS generally attempts to maximize contiguity by converting as much
of a delalloc extent as possible.
If this situation does occur, the indlen reservation for the two new
delalloc extents left behind by the conversion of the middle range
is calculated and compared with the original reservation. If more
blocks are required, the delta is allocated from the global block
pool. This delta value can be characterized as the difference
between the new total requirement (temp + temp2) and the currently
available reservation minus those blocks that have already been
allocated (startblockval(PREV.br_startblock) - allocated).
The problem is that the current code does not account for previously
allocated blocks correctly. It subtracts the current allocation
count from the (new - old) delta rather than the old indlen
reservation. This means that more indlen blocks than have been
allocated end up stashed in the remaining extents and free space
accounting is broken as a result.
Fix up the calculation to subtract the allocated block count from
the original extent indlen and thus correctly allocate the
reservation delta based on the difference between the new total
requirement and the unused blocks from the original reservation.
Also remove a bogus assert that contradicts the fact that the new
indlen reservation can be larger than the original indlen
reservation.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The setting of return code ret should be based on the error code
passed into function end_extent_writepage and not on ret. Thanks
to Liu Bo for spotting this mistake in the original fix I submitted.
Detected by CoverityScan, CID#1414312 ("Logically dead code")
Fixes: 5dca6eea91 ("Btrfs: mark mapping with error flag to report errors to userspace")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Commit b685d3d65a "block: treat REQ_FUA and REQ_PREFLUSH as
synchronous" removed REQ_SYNC flag from WRITE_{FUA|PREFLUSH|...}
definitions. generic_make_request_checks() however strips REQ_FUA and
REQ_PREFLUSH flags from a bio when the storage doesn't report volatile
write cache and thus write effectively becomes asynchronous which can
lead to performance regressions
Fix the problem by making sure all bios which are synchronous are
properly marked with REQ_SYNC.
CC: David Sterba <dsterba@suse.com>
CC: linux-btrfs@vger.kernel.org
Fixes: b685d3d65a
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
Cycle mount btrfs can cause fiemap to return different result.
Like:
# mount /dev/vdb5 /mnt/btrfs
# dd if=/dev/zero bs=16K count=4 oflag=dsync of=/mnt/btrfs/file
# xfs_io -c "fiemap -v" /mnt/btrfs/file
/mnt/test/file:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..127]: 25088..25215 128 0x1
# umount /mnt/btrfs
# mount /dev/vdb5 /mnt/btrfs
# xfs_io -c "fiemap -v" /mnt/btrfs/file
/mnt/test/file:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..31]: 25088..25119 32 0x0
1: [32..63]: 25120..25151 32 0x0
2: [64..95]: 25152..25183 32 0x0
3: [96..127]: 25184..25215 32 0x1
But after above fiemap, we get correct merged result if we call fiemap
again.
# xfs_io -c "fiemap -v" /mnt/btrfs/file
/mnt/test/file:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..127]: 25088..25215 128 0x1
[REASON]
Btrfs will try to merge extent map when inserting new extent map.
btrfs_fiemap(start=0 len=(u64)-1)
|- extent_fiemap(start=0 len=(u64)-1)
|- get_extent_skip_holes(start=0 len=64k)
| |- btrfs_get_extent_fiemap(start=0 len=64k)
| |- btrfs_get_extent(start=0 len=64k)
| | Found on-disk (ino, EXTENT_DATA, 0)
| |- add_extent_mapping()
| |- Return (em->start=0, len=16k)
|
|- fiemap_fill_next_extent(logic=0 phys=X len=16k)
|
|- get_extent_skip_holes(start=0 len=64k)
| |- btrfs_get_extent_fiemap(start=0 len=64k)
| |- btrfs_get_extent(start=16k len=48k)
| | Found on-disk (ino, EXTENT_DATA, 16k)
| |- add_extent_mapping()
| | |- try_merge_map()
| | Merge with previous em start=0 len=16k
| | resulting em start=0 len=32k
| |- Return (em->start=0, len=32K) << Merged result
|- Stripe off the unrelated range (0~16K) of return em
|- fiemap_fill_next_extent(logic=16K phys=X+16K len=16K)
^^^ Causing split fiemap extent.
And since in add_extent_mapping(), em is already merged, in next
fiemap() call, we will get merged result.
[FIX]
Here we introduce a new structure, fiemap_cache, which records previous
fiemap extent.
And will always try to merge current fiemap_cache result before calling
fiemap_fill_next_extent().
Only when we failed to merge current fiemap extent with cached one, we
will call fiemap_fill_next_extent() to submit cached one.
So by this method, we can merge all fiemap extents.
It can also be done in fs/ioctl.c, however the problem is if
fieinfo->fi_extents_max == 0, we have no space to cache previous fiemap
extent.
So I choose to merge it in btrfs.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The filesystem documentation was moved from DocBook to
Documentation/filesystems/. Update it at the sources.
Signed-off-by: Mauro Carvalho Chehab <mchehab@s-opensource.com>
kernel-doc will try to interpret a foo() string, except if
properly escaped.
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Mauro Carvalho Chehab <mchehab@s-opensource.com>
Sphinx require explicit tags in order to use a list of possible
values, otherwise it produces this error:
./fs/eventfd.c:219: WARNING: Option list ends without a blank line; unexpected unindent.
Signed-off-by: Mauro Carvalho Chehab <mchehab@s-opensource.com>
Sphinx gets confused when it finds identation without a
good reason for it and without a preceding blank line:
./fs/mpage.c:347: ERROR: Unexpected indentation.
./fs/namei.c:4303: ERROR: Unexpected indentation.
./fs/fs-writeback.c:2060: ERROR: Unexpected indentation.
No functional changes.
Signed-off-by: Mauro Carvalho Chehab <mchehab@s-opensource.com>
kernel-doc script expects that a function documentation to
be just before the function, otherwise it will be ignored.
So, move the kernel-doc markup to the right place.
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Mauro Carvalho Chehab <mchehab@s-opensource.com>
Pull cifs fixes from Steve French:
"A set of minor cifs fixes"
* 'for-next' of git://git.samba.org/sfrench/cifs-2.6:
[CIFS] Minor cleanup of xattr query function
fs: cifs: transport: Use time_after for time comparison
SMB2: Fix share type handling
cifs: cifsacl: Use a temporary ops variable to reduce code length
Don't delay freeing mids when blocked on slow socket write of request
CIFS: silence lockdep splat in cifs_relock_file()
nfsd4_ops contains function pointers, and marking it as constant avoids
it being able to be used as an attach vector for code injections.
Signed-off-by: Christoph Hellwig <hch@lst.de>
struct svc_procinfo contains function pointers, and marking it as
constant avoids it being able to be used as an attach vector for
code injections.
Signed-off-by: Christoph Hellwig <hch@lst.de>
pc_count is the only writeable memeber of struct svc_procinfo, which is
a good candidate to be const-ified as it contains function pointers.
This patch moves it into out out struct svc_procinfo, and into a
separate writable array that is pointed to by struct svc_version.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Pass union nfsd4_op_u to the op_func callbacks instead of using unsafe
function pointer casts.
It also adds two missing structures to struct nfsd4_op.u to facilitate
this.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Except for a lot of unnecessary casts this typedef only has one user,
so remove the casts and expand it in struct nfsd4_operation.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Pass union nfsd4_op_u to the op_set_currentstateid callbacks instead of
using unsafe function pointer casts.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Given the args union in struct nfsd4_op a name, and pass it to the
op_set_currentstateid callbacks instead of using unsafe function
pointer casts.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Drop the resp argument as it can trivially be derived from the rqstp
argument. With that all functions now have the same prototype, and we
can remove the unsafe casting to kxdrproc_t.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Trond Myklebust <trond.myklebust@primarydata.com>
Drop the argp argument as it can trivially be derived from the rqstp
argument. With that all functions now have the same prototype, and we
can remove the unsafe casting to kxdrproc_t.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Drop the p and resp arguments as they are always NULL or can trivially
be derived from the rqstp argument. With that all functions now have the
same prototype, and we can remove the unsafe casting to kxdrproc_t.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Drop the argp and resp arguments as they can trivially be derived from
the rqstp argument. With that all functions now have the same prototype,
and we can remove the unsafe casting to svc_procfunc as well as the
svc_procfunc typedef itself.
Signed-off-by: Christoph Hellwig <hch@lst.de>
struct rpc_procinfo contains function pointers, and marking it as
constant avoids it being able to be used as an attach vector for
code injections.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Trond Myklebust <trond.myklebust@primarydata.com>
p_count is the only writeable memeber of struct rpc_procinfo, which is
a good candidate to be const-ified as it contains function pointers.
This patch moves it into out out struct rpc_procinfo, and into a
separate writable array that is pointed to by struct rpc_version and
indexed by p_statidx.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Remove double indentation of a few struct rpc_version and
struct rpc_program instance.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Trond Myklebust <trond.myklebust@primarydata.com>
Instead declare all functions with the proper methods signature.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Acked-by: Trond Myklebust <trond.myklebust@primarydata.com>
Declare the p_decode callbacks with the proper prototype instead of
casting to kxdrdproc_t and losing all type safety.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Acked-by: Trond Myklebust <trond.myklebust@primarydata.com>
Declare the p_decode callbacks with the proper prototype instead of
casting to kxdrdproc_t and losing all type safety.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Acked-by: Trond Myklebust <trond.myklebust@primarydata.com>
Declare the p_decode callbacks with the proper prototype instead of
casting to kxdrdproc_t and losing all type safety.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Declare the p_encode callbacks with the proper prototype instead of
casting to kxdreproc_t and losing all type safety.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Declare the p_encode callbacks with the proper prototype instead of
casting to kxdreproc_t and losing all type safety.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Acked-by: Trond Myklebust <trond.myklebust@primarydata.com>
Declare the p_encode callbacks with the proper prototype instead of
casting to kxdreproc_t and losing all type safety.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Acked-by: Trond Myklebust <trond.myklebust@primarydata.com>
We get a link error when EXPORTFS is not enabled:
ERROR: "exportfs_encode_fh" [fs/overlayfs/overlay.ko] undefined!
ERROR: "exportfs_decode_fh" [fs/overlayfs/overlay.ko] undefined!
This adds a Kconfig 'select' statement for overlayfs, the same way that
it is done for the other users of exportfs.
Fixes: 3a1e819b4e ("ovl: store file handle of lower inode on copy up")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Tetsuo reports:
fs/built-in.o: In function `xfs_file_iomap_end':
xfs_iomap.c:(.text+0xe0ef9): undefined reference to `put_dax'
fs/built-in.o: In function `xfs_file_iomap_begin':
xfs_iomap.c:(.text+0xe1a7f): undefined reference to `dax_get_by_host'
make: *** [vmlinux] Error 1
$ grep DAX .config
CONFIG_DAX=m
# CONFIG_DEV_DAX is not set
# CONFIG_FS_DAX is not set
When FS_DAX=n we can/must throw away the dax code in filesystems.
Implement 'fs_' versions of dax_get_by_host() and put_dax() that are
nops in the FS_DAX=n case.
Cc: <linux-xfs@vger.kernel.org>
Cc: <linux-ext4@vger.kernel.org>
Cc: Jan Kara <jack@suse.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Tested-by: Tony Luck <tony.luck@intel.com>
Fixes: ef51042472 ("block, dax: move 'select DAX' from BLOCK to FS_DAX")
Reported-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Merge misc fixes from Andrew Morton:
"15 fixes"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
mm, docs: update memory.stat description with workingset* entries
mm: vmscan: scan until it finds eligible pages
mm, thp: copying user pages must schedule on collapse
dax: fix PMD data corruption when fault races with write
dax: fix data corruption when fault races with write
ext4: return to starting transaction in ext4_dax_huge_fault()
mm: fix data corruption due to stale mmap reads
dax: prevent invalidation of mapped DAX entries
Tigran has moved
mm, vmalloc: fix vmalloc users tracking properly
mm/khugepaged: add missed tracepoint for collapse_huge_page_swapin
gcov: support GCC 7.1
mm, vmstat: Remove spurious WARN() during zoneinfo print
time: delete current_fs_time()
hwpoison, memcg: forcibly uncharge LRU pages
Some minor cleanup of cifs query xattr functions (will also make
SMB3 xattr implementation cleaner as well).
Signed-off-by: Steve French <steve.french@primarydata.com>
Use time_after kernel macro for time comparison
that has safety check.
Signed-off-by: Karim Eshapa <karim.eshapa@gmail.com>
Signed-off-by: Steve French <smfrench@gmail.com>
In fs/cifs/smb2pdu.h, we have:
#define SMB2_SHARE_TYPE_DISK 0x01
#define SMB2_SHARE_TYPE_PIPE 0x02
#define SMB2_SHARE_TYPE_PRINT 0x03
Knowing that, with the current code, the SMB2_SHARE_TYPE_PRINT case can
never trigger and printer share would be interpreted as disk share.
So, test the ShareType value for equality instead.
Fixes: faaf946a7d ("CIFS: Add tree connect/disconnect capability for SMB2")
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Acked-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Create an ops variable to store tcon->ses->server->ops and cache
indirections and reduce code size a trivial bit.
$ size fs/cifs/cifsacl.o*
text data bss dec hex filename
5338 136 8 5482 156a fs/cifs/cifsacl.o.new
5371 136 8 5515 158b fs/cifs/cifsacl.o.old
Signed-off-by: Joe Perches <joe@perches.com>
Acked-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Signed-off-by: Steve French <smfrench@gmail.com>
This is based on a patch from Jan Kara that fixed the equivalent race in
the DAX PTE fault path.
Currently DAX PMD read fault can race with write(2) in the following
way:
CPU1 - write(2) CPU2 - read fault
dax_iomap_pmd_fault()
->iomap_begin() - sees hole
dax_iomap_rw()
iomap_apply()
->iomap_begin - allocates blocks
dax_iomap_actor()
invalidate_inode_pages2_range()
- there's nothing to invalidate
grab_mapping_entry()
- we add huge zero page to the radix tree
and map it to page tables
The result is that hole page is mapped into page tables (and thus zeros
are seen in mmap) while file has data written in that place.
Fix the problem by locking exception entry before mapping blocks for the
fault. That way we are sure invalidate_inode_pages2_range() call for
racing write will either block on entry lock waiting for the fault to
finish (and unmap stale page tables after that) or read fault will see
already allocated blocks by write(2).
Fixes: 9f141d6ef6 ("dax: Call ->iomap_begin without entry lock during dax fault")
Link: http://lkml.kernel.org/r/20170510172700.18991-1-ross.zwisler@linux.intel.com
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently DAX read fault can race with write(2) in the following way:
CPU1 - write(2) CPU2 - read fault
dax_iomap_pte_fault()
->iomap_begin() - sees hole
dax_iomap_rw()
iomap_apply()
->iomap_begin - allocates blocks
dax_iomap_actor()
invalidate_inode_pages2_range()
- there's nothing to invalidate
grab_mapping_entry()
- we add zero page in the radix tree
and map it to page tables
The result is that hole page is mapped into page tables (and thus zeros
are seen in mmap) while file has data written in that place.
Fix the problem by locking exception entry before mapping blocks for the
fault. That way we are sure invalidate_inode_pages2_range() call for
racing write will either block on entry lock waiting for the fault to
finish (and unmap stale page tables after that) or read fault will see
already allocated blocks by write(2).
Fixes: 9f141d6ef6
Link: http://lkml.kernel.org/r/20170510085419.27601-5-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
DAX will return to locking exceptional entry before mapping blocks for a
page fault to fix possible races with concurrent writes. To avoid lock
inversion between exceptional entry lock and transaction start, start
the transaction already in ext4_dax_huge_fault().
Fixes: 9f141d6ef6
Link: http://lkml.kernel.org/r/20170510085419.27601-4-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently, we didn't invalidate page tables during invalidate_inode_pages2()
for DAX. That could result in e.g. 2MiB zero page being mapped into
page tables while there were already underlying blocks allocated and
thus data seen through mmap were different from data seen by read(2).
The following sequence reproduces the problem:
- open an mmap over a 2MiB hole
- read from a 2MiB hole, faulting in a 2MiB zero page
- write to the hole with write(3p). The write succeeds but we
incorrectly leave the 2MiB zero page mapping intact.
- via the mmap, read the data that was just written. Since the zero
page mapping is still intact we read back zeroes instead of the new
data.
Fix the problem by unconditionally calling invalidate_inode_pages2_range()
in dax_iomap_actor() for new block allocations and by properly
invalidating page tables in invalidate_inode_pages2_range() for DAX
mappings.
Fixes: c6dcf52c23
Link: http://lkml.kernel.org/r/20170510085419.27601-3-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "mm,dax: Fix data corruption due to mmap inconsistency",
v4.
This series fixes data corruption that can happen for DAX mounts when
page faults race with write(2) and as a result page tables get out of
sync with block mappings in the filesystem and thus data seen through
mmap is different from data seen through read(2).
The series passes testing with t_mmap_stale test program from Ross and
also other mmap related tests on DAX filesystem.
This patch (of 4):
dax_invalidate_mapping_entry() currently removes DAX exceptional entries
only if they are clean and unlocked. This is done via:
invalidate_mapping_pages()
invalidate_exceptional_entry()
dax_invalidate_mapping_entry()
However, for page cache pages removed in invalidate_mapping_pages()
there is an additional criteria which is that the page must not be
mapped. This is noted in the comments above invalidate_mapping_pages()
and is checked in invalidate_inode_page().
For DAX entries this means that we can can end up in a situation where a
DAX exceptional entry, either a huge zero page or a regular DAX entry,
could end up mapped but without an associated radix tree entry. This is
inconsistent with the rest of the DAX code and with what happens in the
page cache case.
We aren't able to unmap the DAX exceptional entry because according to
its comments invalidate_mapping_pages() isn't allowed to block, and
unmap_mapping_range() takes a write lock on the mapping->i_mmap_rwsem.
Since we essentially never have unmapped DAX entries to evict from the
radix tree, just remove dax_invalidate_mapping_entry().
Fixes: c6dcf52c23 ("mm: Invalidate DAX radix tree entries only if appropriate")
Link: http://lkml.kernel.org/r/20170510085419.27601-2-jack@suse.cz
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Reported-by: Jan Kara <jack@suse.cz>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: <stable@vger.kernel.org> [4.10+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull libnvdimm fixes from Dan Williams:
"Incremental fixes and a small feature addition on top of the main
libnvdimm 4.12 pull request:
- Geert noticed that tinyconfig was bloated by BLOCK selecting DAX.
The size regression is fixed by moving all dax helpers into the
dax-core and only specifying "select DAX" for FS_DAX and
dax-capable drivers. He also asked for clarification of the
NR_DEV_DAX config option which, on closer look, does not need to be
a config option at all. Mike also throws in a DEV_DAX_PMEM fixup
for good measure.
- Ben's attention to detail on -stable patch submissions caught a
case where the recent fixes to arch_copy_from_iter_pmem() missed a
condition where we strand dirty data in the cache. This is tagged
for -stable and will also be included in the rework of the pmem api
to a proposed {memcpy,copy_user}_flushcache() interface for 4.13.
- Vishal adds a feature that missed the initial pull due to pending
review feedback. It allows the kernel to clear media errors when
initializing a BTT (atomic sector update driver) instance on a pmem
namespace.
- Ross noticed that the dax_device + dax_operations conversion broke
__dax_zero_page_range(). The nvdimm unit tests fail to check this
path, but xfstests immediately trips over it. No excuse for missing
this before submitting the 4.12 pull request.
These all pass the nvdimm unit tests and an xfstests spot check. The
set has received a build success notification from the kbuild robot"
* 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
filesystem-dax: fix broken __dax_zero_page_range() conversion
libnvdimm, btt: ensure that initializing metadata clears poison
libnvdimm: add an atomic vs process context flag to rw_bytes
x86, pmem: Fix cache flushing for iovec write < 8 bytes
device-dax: kill NR_DEV_DAX
block, dax: move "select DAX" from BLOCK to FS_DAX
device-dax: Tell kbuild DEV_DAX_PMEM depends on DEV_DAX
Pull misc vfs updates from Al Viro:
"Making sure that something like a referral point won't end up as pwd
or root.
The main part is the last commit (fixing mntns_install()); that one
fixes a hard-to-hit race. The fchdir() commit is making fchdir(2) a
bit more robust - it should be impossible to get opened files (even
O_PATH ones) for referral points in the first place, so the existing
checks are OK, but checking the same thing as in chdir(2) is just as
cheap.
The path_init() commit removes a redundant check that shouldn't have
been there in the first place"
* 'work.sane_pwd' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
make sure that mntns_install() doesn't end up with referral for root
path_init(): don't bother with checking MAY_EXEC for LOOKUP_ROOT
make sure that fchdir() won't accept referral points, etc.
NAND, from Boris:
"""
- some minor fixes/improvements on existing drivers (fsmc, gpio, ifc,
davinci, brcmnand, omap)
- a huge cleanup/rework of the denali driver accompanied with core
fixes/improvements to simplify the driver code
- a complete rewrite of the atmel driver to support new DT bindings
make future evolution easier
- the addition of per-vendor detection/initialization steps to avoid
extending the nand_ids table with more extended-id entries
"""
SPI NOR, from Cyrille:
"""
- fixes in the hisi SPI controller driver.
- fixes in the intel SPI controller driver.
- fixes in the Mediatek SPI controller driver.
- fixes to some SPI flash memories not supported the Chip Erase command.
- add support to some new memory parts (Winbond, Macronix, Micron, ESMT).
- add new driver for the STM32 QSPI controller.
"""
And a few fixes for Gemini and Versatile platforms on physmap-of
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJZE86yAAoJEFySrpd9RFgtlOoP/1o1s8dlKdd4TazdoxBTL2wy
C4wPkqPWyfREcD5ZUYJgr6ENI2OnEwcAxAt2CXnqegx+ZIPToBW4/WK9gj/TNLRx
AfSOz+EPPzo5uZwJPnfocgIFYuhsspymvmISwv66kPbjfkrSjo1l/K9nem3gh7an
IkQdVVq8brvxNeDZOAzbsT2Y5DZNfs00g1jLXkcQrpfM0sWKcbHIUa0BTWy4WKGV
ElTr+xh7QHh/Pd9/A5znd3xX54w5+YR/xe38jSBfTb0vEgw/RIfhIcnvxQ8G/7Se
jE0+8GR5ZJGKwA9Xk5nFzS2G3uECMFNS75KfxkZ0LlEE6ivUvpDbokCbIU4bDOCt
/8bWQf9AGA3gLHGgNUQTSt5HrkBXTGp917jtAZbI/y2MzTkLw3aAZ/m/j37vv9ON
ezeGRO6VWK3bcimLFrt6KO5emYstmm4Tp4rRe3jakH7eyTlINDsecKtuMo2xVzyZ
kK3tnDMdEntECAiKh3ndRdAUL3fs+/IdzWTAxnF9VQFQs1YxiZ1K8kY/zcN+rzbn
CVkEhdm+tdDBx8XgOdfnOTGRAJ07dGOoDhLPR4/egC/ta6GIRkHQjFSwsW7bD9p9
phHH6nQX9Bpza1JV/xvljezoHjvZkny4UhRpLgYMowb41DXv7os7ZV+g7kf5sd0i
mGzCH46j0DmWQ1u5/Q6j
=dxj5
-----END PGP SIGNATURE-----
Merge tag 'for-linus-20170510' of git://git.infradead.org/linux-mtd
Pull MTD updates from Brian Norris:
"NAND, from Boris:
- some minor fixes/improvements on existing drivers (fsmc, gpio, ifc,
davinci, brcmnand, omap)
- a huge cleanup/rework of the denali driver accompanied with core
fixes/improvements to simplify the driver code
- a complete rewrite of the atmel driver to support new DT bindings
make future evolution easier
- the addition of per-vendor detection/initialization steps to avoid
extending the nand_ids table with more extended-id entries
SPI NOR, from Cyrille:
- fixes in the hisi, intel and Mediatek SPI controller drivers
- fixes to some SPI flash memories not supporting the Chip Erase
command.
- add support to some new memory parts (Winbond, Macronix, Micron,
ESMT).
- add new driver for the STM32 QSPI controller
And a few fixes for Gemini and Versatile platforms on physmap-of"
* tag 'for-linus-20170510' of git://git.infradead.org/linux-mtd: (100 commits)
MAINTAINERS: Update NAND subsystem git repositories
mtd: nand: gpio: update binding
mtd: nand: add ooblayout for old hamming layout
mtd: oxnas_nand: Allocating more than necessary in probe()
dt-bindings: mtd: Document the STM32 QSPI bindings
mtd: mtk-nor: set controller's address width according to nor flash
mtd: spi-nor: add driver for STM32 quad spi flash controller
mtd: nand: brcmnand: Check flash #WP pin status before nand erase/program
mtd: nand: davinci: add comment on NAND subpage write status on keystone
mtd: nand: omap2: Fix partition creation via cmdline mtdparts
mtd: nand: NULL terminate a of_device_id table
mtd: nand: Fix a couple error codes
mtd: nand: allow drivers to request minimum alignment for passed buffer
mtd: nand: allocate aligned buffers if NAND_OWN_BUFFERS is unset
mtd: nand: denali: allow to override revision number
mtd: nand: denali_dt: use pdev instead of ofdev for platform_device
mtd: nand: denali_dt: remove dma-mask DT property
mtd: nand: denali: support 64bit capable DMA engine
mtd: nand: denali_dt: enable HW_ECC_FIXUP for Altera SOCFPGA variant
mtd: nand: denali: support HW_ECC_FIXUP capability
...
The conversion of __dax_zero_page_range() to 'struct dax_operations'
caused it to frequently fail. The mistake was treating the @size
parameter as a dax mapping length rather than just a length of the
clear_pmem() operation. The dax mapping length is assumed to be hard
coded as PAGE_SIZE.
Without this fix any page unaligned zeroing request will trigger a
-EINVAL return from bdev_dax_pgoff().
Cc: Jan Kara <jack@suse.com>
Cc: Christoph Hellwig <hch@lst.de>
Reported-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Tested-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Fixes: cccbce6715 ("filesystem-dax: convert to dax_direct_access()")
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
-----BEGIN PGP SIGNATURE-----
iQIVAwUAWPiW6vSw1s6N8H32AQLOrw/+NTqGf7bjq+64YKS6NfR0XDgE+wNJltGO
ck7zJW3NHIg76RNu8s0I9xg5aVmwizz3Z5DGROZquaolnezux4tQihZ3AFyxIzLc
+Y3WHYagcML7yFfjl/WznCLRD5EW3yPln4lCvQO0nW/xICRYeRI057JaIbi2Dtek
BhcXt3c4AjXDLdYJkgtHV3p2R2mt8hcdFdWqqx6s7JaIThZNRGNzxAgtbcB9k5IW
HVG9ZEIL73VBYWHrYivzjHYF5rBnNCPt87eOwDQeTOSkhv8te+u9k+bH8vxZw1T0
XUtDrLBndKiuVo2GUfLkkF8LItx3Q9eLCJYy0joaIliyPqTEsPx9KjQ+Af0cxS9s
ZPCZ5SYf96stKmDeL5xaMfrAmeyVHJ4lc4JTOqdzbIT8blsOSfYO/03p0ALShSDv
/RQLaKGlf8Bjoy8PwKFcXb4sIDufcd/U1Av/EMFXxOfgN/u2JUkGKq6EaIM5B68L
fHPje+aR9VNELPmPjwNOWtmN4I79EH3EItQf7zv0KG+UeKhcHLx/EAcSJ3ZRKEkH
Lathg7pPOEJGArPiVO79TZzBG01ADn1aiwv65XObMzNZ+54xI/mN/Y1DNF/kL5jU
XzvNzEjFt8mwMIZGVNdAt4+pDyMfIZGZSyUkSRKFnaQZMIvQrfQIU9RLBYLX5eOx
+/p0VkIwDpg=
=lbS7
-----END PGP SIGNATURE-----
Merge tag 'hwparam-20170420' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
Pull hw lockdown support from David Howells:
"Annotation of module parameters that configure hardware resources
including ioports, iomem addresses, irq lines and dma channels.
This allows a future patch to prohibit the use of such module
parameters to prevent that hardware from being abused to gain access
to the running kernel image as part of locking the kernel down under
UEFI secure boot conditions.
Annotations are made by changing:
module_param(n, t, p)
module_param_named(n, v, t, p)
module_param_array(n, t, m, p)
to:
module_param_hw(n, t, hwtype, p)
module_param_hw_named(n, v, t, hwtype, p)
module_param_hw_array(n, t, hwtype, m, p)
where the module parameter refers to a hardware setting
hwtype specifies the type of the resource being configured. This can
be one of:
ioport Module parameter configures an I/O port
iomem Module parameter configures an I/O mem address
ioport_or_iomem Module parameter could be either (runtime set)
irq Module parameter configures an I/O port
dma Module parameter configures a DMA channel
dma_addr Module parameter configures a DMA buffer address
other Module parameter configures some other value
Note that the hwtype is compile checked, but not currently stored (the
lockdown code probably won't require it). It is, however, there for
future use.
A bonus is that the hwtype can also be used for grepping.
The intention is for the kernel to ignore or reject attempts to set
annotated module parameters if lockdown is enabled. This applies to
options passed on the boot command line, passed to insmod/modprobe or
direct twiddling in /sys/module/ parameter files.
The module initialisation then needs to handle the parameter not being
set, by (1) giving an error, (2) probing for a value or (3) using a
reasonable default.
What I can't do is just reject a module out of hand because it may
take a hardware setting in the module parameters. Some important
modules, some ipmi stuff for instance, both probe for hardware and
allow hardware to be manually specified; if the driver is aborts with
any error, you don't get any ipmi hardware.
Further, trying to do this entirely in the module initialisation code
doesn't protect against sysfs twiddling.
[!] Note that in and of itself, this series of patches should have no
effect on the the size of the kernel or code execution - that is
left to a patch in the next series to effect. It does mark
annotated kernel parameters with a KERNEL_PARAM_FL_HWPARAM flag in
an already existing field"
* tag 'hwparam-20170420' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs: (38 commits)
Annotate hardware config module parameters in sound/pci/
Annotate hardware config module parameters in sound/oss/
Annotate hardware config module parameters in sound/isa/
Annotate hardware config module parameters in sound/drivers/
Annotate hardware config module parameters in fs/pstore/
Annotate hardware config module parameters in drivers/watchdog/
Annotate hardware config module parameters in drivers/video/
Annotate hardware config module parameters in drivers/tty/
Annotate hardware config module parameters in drivers/staging/vme/
Annotate hardware config module parameters in drivers/staging/speakup/
Annotate hardware config module parameters in drivers/staging/media/
Annotate hardware config module parameters in drivers/scsi/
Annotate hardware config module parameters in drivers/pcmcia/
Annotate hardware config module parameters in drivers/pci/hotplug/
Annotate hardware config module parameters in drivers/parport/
Annotate hardware config module parameters in drivers/net/wireless/
Annotate hardware config module parameters in drivers/net/wan/
Annotate hardware config module parameters in drivers/net/irda/
Annotate hardware config module parameters in drivers/net/hamradio/
Annotate hardware config module parameters in drivers/net/ethernet/
...
bugfixes.
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJZE2UeAAoJECebzXlCjuG+St8P/0vG+ps9sY012E6Wh9gy4Ev4
BtxG/c3CtcxrbNzW+cFhdEloBGtC0VvcrKNCozJTK4LdaPYErkyRBpjgXvIggT9I
GWY4ftpH3eJ6uByN9Okgc3/1la2poDflJO/nYhdRed3YHOnXTtx/746tu1xAnVCV
tFtDGrbJZTprt5c3zETtdtquCUSy2aMT5ZbrdU3yWBCwQMNSIufN3an8epfB++xx
Ct+G0HTRffcWAdYuLT0N1HKqm8pkncdNMFpm7mVw0hMCRy552G3fuj8LtkhVTvKE
1KN3zXY4jhaUYWD5Yt6AJcpLEro65b8swYk4e9FP2TNUpCmuRdXT9cb9vE8YztxC
8s4N23RHaEx9I6pC3OU64a2HfhiQM/oOIvjlhTBsjojXsQcqZFD1vsoSYA8Byl0w
m9EQWqPqge4m6yEYl7uAyL6xSthbrhcU1Ks5jvNXGcWzEQj7BATnynJANsfZ+y6r
ZoVcsRNX49m1BG+p9br+9DFffPiNFUMqxbfr73L9HRep3OsPeFKazFG0bKd3hOqA
E6L/AnBd9soSqTuTvbisWrGWbomhtd5G/fAa1uHrWTPHMXUWCmkguiau51FNfcHu
xcJlBBVCvUmmd5u3wF6QeiyjPs4KEBzQzsOUsWKHRxDBp6s+5PX/lHuXRBlDP+fN
TQq0KbvBtea1OyMaRtoV
=Rtl/
-----END PGP SIGNATURE-----
Merge tag 'nfsd-4.12' of git://linux-nfs.org/~bfields/linux
Pull nfsd updates from Bruce Fields:
"Another RDMA update from Chuck Lever, and a bunch of miscellaneous
bugfixes"
* tag 'nfsd-4.12' of git://linux-nfs.org/~bfields/linux: (26 commits)
nfsd: Fix up the "supattr_exclcreat" attributes
nfsd: encoders mustn't use unitialized values in error cases
nfsd: fix undefined behavior in nfsd4_layout_verify
lockd: fix lockd shutdown race
NFSv4: Fix callback server shutdown
SUNRPC: Refactor svc_set_num_threads()
NFSv4.x/callback: Create the callback service through svc_create_pooled
lockd: remove redundant check on block
svcrdma: Clean out old XDR encoders
svcrdma: Remove the req_map cache
svcrdma: Remove unused RDMA Write completion handler
svcrdma: Reduce size of sge array in struct svc_rdma_op_ctxt
svcrdma: Clean up RPC-over-RDMA backchannel reply processing
svcrdma: Report Write/Reply chunk overruns
svcrdma: Clean up RDMA_ERROR path
svcrdma: Use rdma_rw API in RPC reply path
svcrdma: Introduce local rdma_rw API helpers
svcrdma: Clean up svc_rdma_get_inv_rkey()
svcrdma: Add helper to save pages under I/O
svcrdma: Eliminate RPCRDMA_SQ_DEPTH_MULT
...
Highlights include:
Stable bugfixes:
- Fix use after free in write error path
- Use GFP_NOIO for two allocations in writeback
- Fix a hang in OPEN related to server reboot
- Check the result of nfs4_pnfs_ds_connect
- Fix an rcu lock leak
Features:
- Removal of the unmaintained and unused OSD pNFS layout
- Cleanup and removal of lots of unnecessary dprintk()s
- Cleanup and removal of some memory failure paths now that
GFP_NOFS is guaranteed to never fail.
- Remove the v3-only data server limitation on pNFS/flexfiles
Bugfixes:
- RPC/RDMA connection handling bugfixes
- Copy offload: fixes to ensure the copied data is COMMITed to disk.
- Readdir: switch back to using the ->iterate VFS interface
- File locking fixes from Ben Coddington
- Various use-after-free and deadlock issues in pNFS
- Write path bugfixes
-----BEGIN PGP SIGNATURE-----
iQIbBAABAgAGBQJZE0KiAAoJEGcL54qWCgDy/moP93wZ+cGnN5sC+nsirqj4eUiM
BQKKonweNQIoYRwp5B9jLTxsUMIxRasV5W3BbqEm4PUtBYXfqQ7SfLv7RboKbd4M
RJB9PS+sjx3Fxf65mhveKziwUFLvQCQ3+we0TpUga6+7SBiGlgPKBfisk7frC0nt
BbYBuGaWXMPxO0BnR8adNwqiGINPDSzB+8sgjiT8zkZLm4lrew2eV7TDvwVOguD+
S2vLPGhg1F9wu8aG731MgiSNaeCgsBP6I5D29fTTD7z1DCNMQXOoHcX8k4KwwIDB
sHRR0tVBsg+1B7WdH4y41GQ03rn3o2DHeJB5cdYGaEu4lx7CecCzt0o0dfAkNizT
5LxbQxIHPNYMeZmP2T0oD41zQyfjKqrdRSPnXi3dPD98NwaM1Lqv+Kzb/eXzupXp
vJ7859PQCa3KjQ1IFhwdXTmh53J1c8SzEDpzz7WX0R0saRyxeIJsm30MmdPqKu7Z
notjsXxrTmjIhC+0vFLey1kejFDh+b0gT6UIwoMdx39VL9AM6DVL7HsrU1kEwCdf
f8otaLcm0WoUaseF+cMtfRNGEqCMxPywwz7mEKlGiVZgyAM8VfzH+s5j6/u6ncwS
ASwRclwwPAZN97rzl0exZxuaRwFZd7oFT1zrviPWvv+0SUPuy258J6QpolUSavgi
Qh7f3QR65K+QX9QbO1g=
=7Nm2
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-4.12-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client updates from Trond Myklebust:
"Highlights include:
Stable bugfixes:
- Fix use after free in write error path
- Use GFP_NOIO for two allocations in writeback
- Fix a hang in OPEN related to server reboot
- Check the result of nfs4_pnfs_ds_connect
- Fix an rcu lock leak
Features:
- Removal of the unmaintained and unused OSD pNFS layout
- Cleanup and removal of lots of unnecessary dprintk()s
- Cleanup and removal of some memory failure paths now that GFP_NOFS
is guaranteed to never fail.
- Remove the v3-only data server limitation on pNFS/flexfiles
Bugfixes:
- RPC/RDMA connection handling bugfixes
- Copy offload: fixes to ensure the copied data is COMMITed to disk.
- Readdir: switch back to using the ->iterate VFS interface
- File locking fixes from Ben Coddington
- Various use-after-free and deadlock issues in pNFS
- Write path bugfixes"
* tag 'nfs-for-4.12-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (89 commits)
pNFS/flexfiles: Always attempt to call layoutstats when flexfiles is enabled
NFSv4.1: Work around a Linux server bug...
NFS append COMMIT after synchronous COPY
NFSv4: Fix exclusive create attributes encoding
NFSv4: Fix an rcu lock leak
nfs: use kmap/kunmap directly
NFS: always treat the invocation of nfs_getattr as cache hit when noac is on
Fix nfs_client refcounting if kmalloc fails in nfs4_proc_exchange_id and nfs4_proc_async_renew
NFSv4.1: RECLAIM_COMPLETE must handle NFS4ERR_CONN_NOT_BOUND_TO_SESSION
pNFS: Fix NULL dereference in pnfs_generic_alloc_ds_commits
pNFS: Fix a typo in pnfs_generic_alloc_ds_commits
pNFS: Fix a deadlock when coalescing writes and returning the layout
pNFS: Don't clear the layout return info if there are segments to return
pNFS: Ensure we commit the layout if it has been invalidated
pNFS: Don't send COMMITs to the DSes if the server invalidated our layout
pNFS/flexfiles: Fix up the ff_layout_write_pagelist failure path
pNFS: Ensure we check layout validity before marking it for return
NFS4.1 handle interrupted slot reuse from ERR_DELAY
NFSv4: check return value of xdr_inline_decode
nfs/filelayout: fix NULL pointer dereference in fl_pnfs_update_layout()
...
If an NFSv4 client asks us for the supattr_exclcreat, then we must
not return attributes that are unsupported by this minor version.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Fixes: 75976de655 ("NFSD: Return word2 bitmask if setting security..,")
Cc: stable@vger.kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
In error cases, lgp->lg_layout_type may be out of bounds; so we
shouldn't be using it until after the check of nfserr.
This was seen to crash nfsd threads when the server receives a LAYOUTGET
request with a large layout type.
GETDEVICEINFO has the same problem.
Reported-by: Ari Kauppi <Ari.Kauppi@synopsys.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: stable@vger.kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Pull RCU updates from Ingo Molnar:
"The main changes are:
- Debloat RCU headers
- Parallelize SRCU callback handling (plus overlapping patches)
- Improve the performance of Tree SRCU on a CPU-hotplug stress test
- Documentation updates
- Miscellaneous fixes"
* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (74 commits)
rcu: Open-code the rcu_cblist_n_lazy_cbs() function
rcu: Open-code the rcu_cblist_n_cbs() function
rcu: Open-code the rcu_cblist_empty() function
rcu: Separately compile large rcu_segcblist functions
srcu: Debloat the <linux/rcu_segcblist.h> header
srcu: Adjust default auto-expediting holdoff
srcu: Specify auto-expedite holdoff time
srcu: Expedite first synchronize_srcu() when idle
srcu: Expedited grace periods with reduced memory contention
srcu: Make rcutorture writer stalls print SRCU GP state
srcu: Exact tracking of srcu_data structures containing callbacks
srcu: Make SRCU be built by default
srcu: Fix Kconfig botch when SRCU not selected
rcu: Make non-preemptive schedule be Tasks RCU quiescent state
srcu: Expedite srcu_schedule_cbs_snp() callback invocation
srcu: Parallelize callback handling
kvm: Move srcu_struct fields to end of struct kvm
rcu: Fix typo in PER_RCU_NODE_PERIOD header comment
rcu: Use true/false in assignment to bool
rcu: Use bool value directly
...
Pull overlayfs update from Miklos Szeredi:
"The biggest part of this is making st_dev/st_ino on the overlay behave
like a normal filesystem (i.e. st_ino doesn't change on copy up,
st_dev is the same for all files and directories). Currently this only
works if all layers are on the same filesystem, but future work will
move the general case towards more sane behavior.
There are also miscellaneous fixes, including fixes to handling
append-only files. There's a small change in the VFS, but that only
has an effect on overlayfs, since otherwise file->f_path.dentry->inode
and file_inode(file) are always the same"
* 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
ovl: update documentation w.r.t. constant inode numbers
ovl: persistent inode numbers for upper hardlinks
ovl: merge getattr for dir and nondir
ovl: constant st_ino/st_dev across copy up
ovl: persistent inode number for directories
ovl: set the ORIGIN type flag
ovl: lookup non-dir copy-up-origin by file handle
ovl: use an auxiliary var for overlay root entry
ovl: store file handle of lower inode on copy up
ovl: check if all layers are on the same fs
ovl: do not set overlay.opaque on non-dir create
ovl: check IS_APPEND() on real upper inode
vfs: ftruncate check IS_APPEND() on real upper inode
ovl: Use designated initializers
ovl: lockdep annotate of nested stacked overlayfs inode lock
Pull fuse updates from Miklos Szeredi:
"Support for pid namespaces from Seth and refcount_t work from Elena"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
fuse: Add support for pid namespaces
fuse: convert fuse_conn.count from atomic_t to refcount_t
fuse: convert fuse_req.count from atomic_t to refcount_t
fuse: convert fuse_file.count from atomic_t to refcount_t
lock transfers from myself and the long awaited -ENOSPC handling series
from Jeff. The former will allow rbd users to take advantage of
exclusive lock's built-in blacklist/break-lock functionality while
staying in control of who owns the lock. With the latter in place, we
will abort filesystem writes on -ENOSPC instead of having them block
indefinitely.
Beyond that we've got the usual pile of filesystem fixes from Zheng,
some refcount_t conversion patches from Elena and a patch for an
ancient open() flags handling bug from Alexander.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQEcBAABCAAGBQJZEt/kAAoJEEp/3jgCEfOLpzAIAIld0N06DuHKG2F9mHEnLeGl
Y60BZ3Ajo32i9qPT/u9ntI99ZMlkuHcNWg6WpCCh8umbwk2eiAKRP/KcfGcWmmp9
EHj9COCmBR9TRM1pNS1lSMzljDnxf9sQmbIO9cwMQBUya5g19O0OpApzxF1YQhCR
V9B/FYV5IXELC3b/NH45oeDAD9oy/WgwbhQ2feTBQJmzIVJx+Je9hdhR1PH1rI06
ysyg3VujnUi/hoDhvPTBznNOxnHx/HQEecHH8b01MkbaCgxPH88jsUK/h7PYF3Gh
DE/sCN69HXeu1D/al3zKoZdahsJ5GWkj9Q+vvBoQJm+ZPsndC+qpgSj761n9v38=
=vamy
-----END PGP SIGNATURE-----
Merge tag 'ceph-for-4.12-rc1' of git://github.com/ceph/ceph-client
Pull ceph updates from Ilya Dryomov:
"The two main items are support for disabling automatic rbd exclusive
lock transfers from myself and the long awaited -ENOSPC handling
series from Jeff.
The former will allow rbd users to take advantage of exclusive lock's
built-in blacklist/break-lock functionality while staying in control
of who owns the lock. With the latter in place, we will abort
filesystem writes on -ENOSPC instead of having them block
indefinitely.
Beyond that we've got the usual pile of filesystem fixes from Zheng,
some refcount_t conversion patches from Elena and a patch for an
ancient open() flags handling bug from Alexander"
* tag 'ceph-for-4.12-rc1' of git://github.com/ceph/ceph-client: (31 commits)
ceph: fix memory leak in __ceph_setxattr()
ceph: fix file open flags on ppc64
ceph: choose readdir frag based on previous readdir reply
rbd: exclusive map option
rbd: return ResponseMessage result from rbd_handle_request_lock()
rbd: kill rbd_is_lock_supported()
rbd: support updating the lock cookie without releasing the lock
rbd: store lock cookie
rbd: ignore unlock errors
rbd: fix error handling around rbd_init_disk()
rbd: move rbd_unregister_watch() call into rbd_dev_image_release()
rbd: move rbd_dev_destroy() call out of rbd_dev_image_release()
ceph: when seeing write errors on an inode, switch to sync writes
Revert "ceph: SetPageError() for writeback pages if writepages fails"
ceph: handle epoch barriers in cap messages
libceph: add an epoch_barrier field to struct ceph_osd_client
libceph: abort already submitted but abortable requests when map or pool goes full
libceph: allow requests to return immediately on full conditions if caller wishes
libceph: remove req->r_replay_version
ceph: make seeky readdir more efficient
...
Pull btrfs updates from Chris Mason:
"This has fixes and cleanups Dave Sterba collected for the merge
window.
The biggest functional fixes are between btrfs raid5/6 and scrub, and
raid5/6 and device replacement. Some of our pending qgroup fixes are
included as well while I bash on the rest in testing.
We also have the usual set of cleanups, including one that makes
__btrfs_map_block() much more maintainable, and conversions from
atomic_t to refcount_t"
* 'for-linus-4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (71 commits)
btrfs: fix the gfp_mask for the reada_zones radix tree
Btrfs: fix reported number of inode blocks
Btrfs: send, fix file hole not being preserved due to inline extent
Btrfs: fix extent map leak during fallocate error path
Btrfs: fix incorrect space accounting after failure to insert inline extent
Btrfs: fix invalid attempt to free reserved space on failure to cow range
btrfs: Handle delalloc error correctly to avoid ordered extent hang
btrfs: Fix metadata underflow caused by btrfs_reloc_clone_csum error
btrfs: check if the device is flush capable
btrfs: delete unused member nobarriers
btrfs: scrub: Fix RAID56 recovery race condition
btrfs: scrub: Introduce full stripe lock for RAID56
btrfs: Use ktime_get_real_ts for root ctime
Btrfs: handle only applicable errors returned by btrfs_get_extent
btrfs: qgroup: Fix qgroup corruption caused by inode_cache mount option
btrfs: use q which is already obtained from bdev_get_queue
Btrfs: switch to div64_u64 if with a u64 divisor
Btrfs: update scrub_parity to use u64 stripe_len
Btrfs: enable repair during read for raid56 profile
btrfs: use clear_page where appropriate
...
When processing responses, and in particular freeing mids (DeleteMidQEntry),
which is very important since it also frees the associated buffers (cifs_buf_release),
we can block a long time if (writes to) socket is slow due to low memory or networking
issues.
We can block in send (smb request) waiting for memory, and be blocked in processing
responess (which could free memory if we let it) - since they both grab the
server->srv_mutex.
In practice, in the DeleteMidQEntry case - there is no reason we need to
grab the srv_mutex so remove these around DeleteMidQEntry, and it allows
us to free memory faster.
Signed-off-by: Steve French <steve.french@primarydata.com>
Acked-by: Pavel Shilovsky <pshilov@microsoft.com>
cifs_relock_file() can perform a down_write() on the inode's lock_sem even
though it was already performed in cifs_strict_readv(). Lockdep complains
about this. AFAICS, there is no problem here, and lockdep just needs to be
told that this nesting is OK.
=============================================
[ INFO: possible recursive locking detected ]
4.11.0+ #20 Not tainted
---------------------------------------------
cat/701 is trying to acquire lock:
(&cifsi->lock_sem){++++.+}, at: cifs_reopen_file+0x7a7/0xc00
but task is already holding lock:
(&cifsi->lock_sem){++++.+}, at: cifs_strict_readv+0x177/0x310
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0
----
lock(&cifsi->lock_sem);
lock(&cifsi->lock_sem);
*** DEADLOCK ***
May be due to missing lock nesting notation
1 lock held by cat/701:
#0: (&cifsi->lock_sem){++++.+}, at: cifs_strict_readv+0x177/0x310
stack backtrace:
CPU: 0 PID: 701 Comm: cat Not tainted 4.11.0+ #20
Call Trace:
dump_stack+0x85/0xc2
__lock_acquire+0x17dd/0x2260
? trace_hardirqs_on_thunk+0x1a/0x1c
? preempt_schedule_irq+0x6b/0x80
lock_acquire+0xcc/0x260
? lock_acquire+0xcc/0x260
? cifs_reopen_file+0x7a7/0xc00
down_read+0x2d/0x70
? cifs_reopen_file+0x7a7/0xc00
cifs_reopen_file+0x7a7/0xc00
? printk+0x43/0x4b
cifs_readpage_worker+0x327/0x8a0
cifs_readpage+0x8c/0x2a0
generic_file_read_iter+0x692/0xd00
cifs_strict_readv+0x29f/0x310
generic_file_splice_read+0x11c/0x1c0
do_splice_to+0xa5/0xc0
splice_direct_to_actor+0xfa/0x350
? generic_pipe_buf_nosteal+0x10/0x10
do_splice_direct+0xb5/0xe0
do_sendfile+0x278/0x3a0
SyS_sendfile64+0xc4/0xe0
entry_SYSCALL_64_fastpath+0x1f/0xbe
Signed-off-by: Rabin Vincent <rabinv@axis.com>
Acked-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Steve French <smfrench@gmail.com>
UBSAN: Undefined behaviour in fs/nfsd/nfs4proc.c:1262:34
shift exponent 128 is too large for 32-bit type 'int'
Depending on compiler+architecture, this may cause the check for
layout_type to succeed for overly large values (which seems to be the
case with amd64). The large value will be later used in de-referencing
nfsd4_layout_ops for function pointers.
Reported-by: Jani Tuovila <tuovila@synopsys.com>
Signed-off-by: Ari Kauppi <ari@synopsys.com>
[colin.king@canonical.com: use LAYOUT_TYPE_MAX instead of 32]
Cc: stable@vger.kernel.org
Reviewed-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Layoutstats is always desirable when using the flexfiles driver, so
we should enable it if that driver is being loaded. It is safe to do
so, because even when the mount specifies NFSv4.1, we will turn it
off if the server tells us it is unsupported.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
It turns out the Linux server has a bug in its implementation of
supattr_exclcreat; it returns the set of all attributes, whether
or not they are supported by minor version 1.
In order to avoid a regression, we therefore apply the supported_attrs
as a mask on top of whatever the server sent us.
Reported-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Pull misc vfs updates from Al Viro:
"Assorted bits and pieces from various people. No common topic in this
pile, sorry"
* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
fs/affs: add rename exchange
fs/affs: add rename2 to prepare multiple methods
Make stat/lstat/fstatat pass AT_NO_AUTOMOUNT to vfs_statx()
fs: don't set *REFERENCED on single use objects
fs: compat: Remove warning from COMPATIBLE_IOCTL
remove pointless extern of atime_need_update_rcu()
fs: completely ignore unknown open flags
fs: add a VALID_OPEN_FLAGS
fs: remove _submit_bh()
fs: constify tree_descr arrays passed to simple_fill_super()
fs: drop duplicate header percpu-rwsem.h
fs/affs: bugfix: Write files greater than page size on OFS
fs/affs: bugfix: enable writes on OFS disks
fs/affs: remove node generation check
fs/affs: import amigaffs.h
fs/affs: bugfix: make symbolic links work again
We fixed the bugs in it, but it's still an ugly interface, so let's see
if anybody actually depends on it. It's entirely possible that nothing
actually requires the whole "punch through read-only mappings"
semantics.
For example, gdb definitely uses the /proc/<pid>/mem interface, but it
looks like it mainly does it for regular reads of the target (that don't
need FOLL_FORCE), and looking at the gdb source code seems to fall back
on the traditional ptrace(PTRACE_POKEDATA) interface if it needs to.
If this breaks something, I do have a (more complex) version that only
enables FOLL_FORCE when somebody has PTRACE_ATTACH'ed to the target,
like the comment here used to say ("Maybe we should limit FOLL_FORCE to
actual ptrace users?").
Cc: Kees Cook <keescook@chromium.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Eric Biederman <ebiederm@xmission.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Merge more updates from Andrew Morton:
- the rest of MM
- various misc things
- procfs updates
- lib/ updates
- checkpatch updates
- kdump/kexec updates
- add kvmalloc helpers, use them
- time helper updates for Y2038 issues. We're almost ready to remove
current_fs_time() but that awaits a btrfs merge.
- add tracepoints to DAX
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (114 commits)
drivers/staging/ccree/ssi_hash.c: fix build with gcc-4.4.4
selftests/vm: add a test for virtual address range mapping
dax: add tracepoint to dax_insert_mapping()
dax: add tracepoint to dax_writeback_one()
dax: add tracepoints to dax_writeback_mapping_range()
dax: add tracepoints to dax_load_hole()
dax: add tracepoints to dax_pfn_mkwrite()
dax: add tracepoints to dax_iomap_pte_fault()
mtd: nand: nandsim: convert to memalloc_noreclaim_*()
treewide: convert PF_MEMALLOC manipulations to new helpers
mm: introduce memalloc_noreclaim_{save,restore}
mm: prevent potential recursive reclaim due to clearing PF_MEMALLOC
mm/huge_memory.c: deposit a pgtable for DAX PMD faults when required
mm/huge_memory.c: use zap_deposited_table() more
time: delete CURRENT_TIME_SEC and CURRENT_TIME
gfs2: replace CURRENT_TIME with current_time
apparmorfs: replace CURRENT_TIME with current_time()
lustre: replace CURRENT_TIME macro
fs: ubifs: replace CURRENT_TIME_SEC with current_time
fs: ufs: use ktime_get_real_ts64() for birthtime
...
Add a tracepoint to dax_insert_mapping(), following the same logging
conventions as the rest of DAX. This tracepoint, along with the one in
dax_load_hole(), lets us know how a DAX PTE fault was serviced.
Here is an example DAX fault that inserts a PTE mapping:
small-1126 [007] ....
145.451604: dax_pte_fault: dev 259:0 ino 0x1003 shared WRITE|ALLOW_RETRY|KILLABLE|USER address 0x10420000 pgoff 0x220
small-1126 [007] ....
145.452317: dax_insert_mapping: dev 259:0 ino 0x1003 shared write address 0x10420000 radix_entry 0x100006
small-1126 [007] ....
145.452399: dax_pte_fault_done: dev 259:0 ino 0x1003 shared WRITE|ALLOW_RETRY|KILLABLE|USER address 0x10420000 pgoff 0x220 MAJOR|NOPAGE
Link: http://lkml.kernel.org/r/20170221195116.13278-7-ross.zwisler@linux.intel.com
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add a tracepoint to dax_writeback_one(), following the same logging
conventions as the rest of DAX.
Here is an example range writeback which ends up flushing one PMD and
one PTE:
test-1265 [003] ....
496.615250: dax_writeback_range: dev 259:0 ino 0x1003 pgoff 0x0-0x7ffffffffffff
test-1265 [003] ....
496.616263: dax_writeback_one: dev 259:0 ino 0x1003 pgoff 0x0 pglen 0x200
test-1265 [003] ....
496.616270: dax_writeback_one: dev 259:0 ino 0x1003 pgoff 0x305 pglen 0x1
test-1265 [003] ....
496.616272: dax_writeback_range_done: dev 259:0 ino 0x1003 pgoff 0x0-0x7ffffffffffff
[akpm@linux-foundation.org: struct blk_dax_ctl has disappeared]
Link: http://lkml.kernel.org/r/20170221195116.13278-6-ross.zwisler@linux.intel.com
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add tracepoints to dax_writeback_mapping_range(), following the same
logging conventions as the rest of DAX.
Here is an example writeback call:
msync-1085 [006] ....
200.902565: dax_writeback_range: dev 259:0 ino 0x1003 pgoff 0x200-0x2ff
msync-1085 [006] ....
200.902579: dax_writeback_range_done: dev 259:0 ino 0x1003 pgoff 0x200-0x2ff
[ross.zwisler@linux.intel.com: fix regression in dax_writeback_mapping_range()]
Link: http://lkml.kernel.org/r/20170314215358.31451-1-ross.zwisler@linux.intel.com
Link: http://lkml.kernel.org/r/20170221195116.13278-5-ross.zwisler@linux.intel.com
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add tracepoints to dax_load_hole(), following the same logging conventions
as the rest of DAX.
Here is the logging generated by a PTE read from a hole:
read-1075 [002] ....
62.362108: dax_pte_fault: dev 259:0 ino 0x1003 shared ALLOW_RETRY|KILLABLE|USER address 0x10480000 pgoff 0x280
read-1075 [002] ....
62.362140: dax_load_hole: dev 259:0 ino 0x1003 shared ALLOW_RETRY|KILLABLE|USER address 0x10480000 pgoff 0x280 NOPAGE
read-1075 [002] ....
62.362141: dax_pte_fault_done: dev 259:0 ino 0x1003 shared ALLOW_RETRY|KILLABLE|USER address 0x10480000 pgoff 0x280 NOPAGE
Link: http://lkml.kernel.org/r/20170221195116.13278-4-ross.zwisler@linux.intel.com
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add tracepoints to dax_pfn_mkwrite(), following the same logging
conventions as the rest of DAX.
Here is an example PTE fault followed by a pfn_mkwrite:
small_aligned-1094 [002] ....
374.084998: dax_pte_fault: dev 259:0 ino 0x1003 shared WRITE|ALLOW_RETRY|KILLABLE|USER address 0x10400000 pgoff 0x200
small_aligned-1094 [002] ....
374.085145: dax_pte_fault_done: dev 259:0 ino 0x1003 shared WRITE|ALLOW_RETRY|KILLABLE|USER address 0x10400000 pgoff 0x200 MAJOR|NOPAGE
small_aligned-1094 [002] ....
374.085165: dax_pfn_mkwrite: dev 259:0 ino 0x1003 shared WRITE|MKWRITE|ALLOW_RETRY|KILLABLE|USER address 0x10400000 pgoff 0x200 NOPAGE
Link: http://lkml.kernel.org/r/20170221195116.13278-3-ross.zwisler@linux.intel.com
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "second round of tracepoints for DAX".
This second round of DAX tracepoint patches adds tracing to the PTE
fault path (dax_iomap_pte_fault(), dax_pfn_mkwrite(), dax_load_hole(),
dax_insert_mapping()) and to the writeback path
(dax_writeback_mapping_range(), dax_writeback_one()).
The purpose of this tracing is to give us a high level view of what DAX
is doing, whether faults are being serviced by PMDs or PTEs, and by real
storage or by zero pages covering holes.
I do have some patches nearly ready which also add tracing to
grab_mapping_entry() and dax_insert_mapping_entry(). These are more
targeted at logging how we are interacting with the radix tree, how we
use empty entries for locking, whether we "downgrade" huge zero pages to
4k PTE sized allocations, etc. In the end it seemed to me that this
might be too detailed to have as constantly present tracepoints, but if
anyone sees value in having tracepoints like this in the DAX code
permanently (Jan?), please let me know and I'll add those last two
patches.
All these tracepoints were done to be consistent with the style of the
XFS tracepoints and with the existing DAX PMD tracepoints.
This patch (of 6):
Add tracepoints to dax_iomap_pte_fault(), following the same logging
conventions as the rest of DAX.
Here is an example fault that initially tries to be serviced by the PMD
fault handler but which falls back to PTEs because the VMA isn't large
enough to hold a PMD:
small-1086 [005] ....
71.140014: xfs_filemap_huge_fault: dev 259:0 ino 0x1003
small-1086 [005] ....
71.140027: dax_pmd_fault: dev 259:0 ino 0x1003 shared WRITE|ALLOW_RETRY|KILLABLE|USER address 0x10420000 vm_start 0x10200000 vm_end 0x10500000 pgoff 0x220 max_pgoff 0x1400
small-1086 [005] ....
71.140028: dax_pmd_fault_done: dev 259:0 ino 0x1003 shared WRITE|ALLOW_RETRY|KILLABLE|USER address 0x10420000 vm_start 0x10200000 vm_end 0x10500000 pgoff 0x220 max_pgoff 0x1400 FALLBACK
small-1086 [005] ....
71.140035: dax_pte_fault: dev 259:0 ino 0x1003 shared WRITE|ALLOW_RETRY|KILLABLE|USER address 0x10420000 pgoff 0x220
small-1086 [005] ....
71.140396: dax_pte_fault_done: dev 259:0 ino 0x1003 shared WRITE|ALLOW_RETRY|KILLABLE|USER address 0x10420000 pgoff 0x220 MAJOR|NOPAGE
Link: http://lkml.kernel.org/r/20170221195116.13278-2-ross.zwisler@linux.intel.com
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
CURRENT_TIME_SEC is not y2038 safe. current_time() will be transitioned
to use 64 bit time along with vfs in a separate patch. There is no plan
to transition CURRENT_TIME_SEC to use y2038 safe time interfaces.
current_time() returns timestamps according to the granularities set in
the inode's super_block. The granularity check to call
current_fs_time() or CURRENT_TIME_SEC is not required.
Use current_time() directly to update inode timestamp. Use
timespec_trunc during file system creation, before the first inode is
created.
Link: http://lkml.kernel.org/r/1491613030-11599-9-git-send-email-deepa.kernel@gmail.com
Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Cc: Richard Weinberger <richard@nod.at>
Cc: Artem Bityutskiy <dedekind1@gmail.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
CURRENT_TIME is not y2038 safe. Replace it with ktime_get_real_ts64().
Inode time formats are already 64 bit long and accommodates time64_t.
Link: http://lkml.kernel.org/r/1491613030-11599-6-git-send-email-deepa.kernel@gmail.com
Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Cc: Evgeniy Dushistov <dushistov@mail.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
CURRENT_TIME is not y2038 safe. The macro will be deleted and all the
references to it will be replaced by ktime_get_* apis.
struct timespec is also not y2038 safe. Retain timespec for timestamp
representation here as ceph uses it internally everywhere. These
references will be changed to use struct timespec64 in a separate patch.
The current_fs_time() api is being changed to use vfs struct inode* as
an argument instead of struct super_block*.
Set the new mds client request r_stamp field using ktime_get_real_ts()
instead of using current_fs_time().
Also, since r_stamp is used as mtime on the server, use timespec_trunc()
to truncate the timestamp, using the right granularity from the
superblock.
This api will be transitioned to be y2038 safe along with vfs.
Link: http://lkml.kernel.org/r/1491613030-11599-5-git-send-email-deepa.kernel@gmail.com
Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
M: Ilya Dryomov <idryomov@gmail.com>
M: "Yan, Zheng" <zyan@redhat.com>
M: Sage Weil <sage@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
CURRENT_TIME macro is not y2038 safe on 32 bit systems.
The patch replaces all the uses of CURRENT_TIME by current_time() for
filesystem times, and ktime_get_* functions for authentication
timestamps and timezone calculations.
This is also in preparation for the patch that transitions vfs
timestamps to use 64 bit time and hence make them y2038 safe.
CURRENT_TIME macro will be deleted before merging the aforementioned
change.
The inode timestamps read from the server are assumed to have correct
granularity and range.
The patch also assumes that the difference between server and client
times lie in the range INT_MIN..INT_MAX. This is valid because this is
the difference between current times between server and client, and the
largest timezone difference is in the range of one day.
All cifs timestamps currently use timespec representation internally.
Authentication and timezone timestamps can also be transitioned into
using timespec64 when all other timestamps for cifs is transitioned to
use timespec64.
Link: http://lkml.kernel.org/r/1491613030-11599-4-git-send-email-deepa.kernel@gmail.com
Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Cc: Steve French <sfrench@samba.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
CURRENT_TIME_SEC is not y2038 safe.
Replace use of CURRENT_TIME_SEC with ktime_get_real_seconds in segment
timestamps used by GC algorithm including the segment mtime timestamps.
Link: http://lkml.kernel.org/r/1491613030-11599-2-git-send-email-deepa.kernel@gmail.com
Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit afddba49d1 ("fs: introduce write_begin, write_end, and
perform_write aops") introduced AOP_FLAG_UNINTERRUPTIBLE flag which was
checked in pagecache_write_begin(), but that check was removed by
4e02ed4b4a ("fs: remove prepare_write/commit_write").
Between these two commits, commit d9414774dc ("cifs: Convert cifs to
new aops.") added a check in cifs_write_begin(), but that check was soon
removed by commit a98ee8c1c7 ("[CIFS] fix regression in
cifs_write_begin/cifs_write_end").
Therefore, AOP_FLAG_UNINTERRUPTIBLE flag is checked nowhere. Let's
remove this flag. This patch has no functionality changes.
Link: http://lkml.kernel.org/r/1489294781-53494-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Nick Piggin <npiggin@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix typos and add the following to the scripts/spelling.txt:
intialisation||initialisation
intialised||initialised
intialise||initialise
This commit does not intend to change the British spelling itself.
Link: http://lkml.kernel.org/r/1481573103-11329-18-git-send-email-yamada.masahiro@socionext.com
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
__vmalloc* allows users to provide gfp flags for the underlying
allocation. This API is quite popular
$ git grep "=[[:space:]]__vmalloc\|return[[:space:]]*__vmalloc" | wc -l
77
The only problem is that many people are not aware that they really want
to give __GFP_HIGHMEM along with other flags because there is really no
reason to consume precious lowmemory on CONFIG_HIGHMEM systems for pages
which are mapped to the kernel vmalloc space. About half of users don't
use this flag, though. This signals that we make the API unnecessarily
too complex.
This patch simply uses __GFP_HIGHMEM implicitly when allocating pages to
be mapped to the vmalloc space. Current users which add __GFP_HIGHMEM
are simplified and drop the flag.
Link: http://lkml.kernel.org/r/20170307141020.29107-1-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Cc: Cristopher Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There are many code paths opencoding kvmalloc. Let's use the helper
instead. The main difference to kvmalloc is that those users are
usually not considering all the aspects of the memory allocator. E.g.
allocation requests <= 32kB (with 4kB pages) are basically never failing
and invoke OOM killer to satisfy the allocation. This sounds too
disruptive for something that has a reasonable fallback - the vmalloc.
On the other hand those requests might fallback to vmalloc even when the
memory allocator would succeed after several more reclaim/compaction
attempts previously. There is no guarantee something like that happens
though.
This patch converts many of those places to kv[mz]alloc* helpers because
they are more conservative.
Link: http://lkml.kernel.org/r/20170306103327.2766-2-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> # Xen bits
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Andreas Dilger <andreas.dilger@intel.com> # Lustre
Acked-by: Christian Borntraeger <borntraeger@de.ibm.com> # KVM/s390
Acked-by: Dan Williams <dan.j.williams@intel.com> # nvdim
Acked-by: David Sterba <dsterba@suse.com> # btrfs
Acked-by: Ilya Dryomov <idryomov@gmail.com> # Ceph
Acked-by: Tariq Toukan <tariqt@mellanox.com> # mlx4
Acked-by: Leon Romanovsky <leonro@mellanox.com> # mlx5
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Anton Vorontsov <anton@enomsg.org>
Cc: Colin Cross <ccross@android.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Cc: Santosh Raspatur <santosh@chelsio.com>
Cc: Hariprasad S <hariprasad@chelsio.com>
Cc: Yishai Hadas <yishaih@mellanox.com>
Cc: Oleg Drokin <oleg.drokin@intel.com>
Cc: "Yan, Zheng" <zyan@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: David Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
getxattr uses vmalloc to allocate memory if kzalloc fails. This is
filled by vfs_getxattr and then copied to the userspace. vmalloc,
however, doesn't zero out the memory so if the specific implementation
of the xattr handler is sloppy we can theoretically expose a kernel
memory. There is no real sign this is really the case but let's make
sure this will not happen and use vzalloc instead.
Fixes: 779302e678 ("fs/xattr.c:getxattr(): improve handling of allocation failures")
Link: http://lkml.kernel.org/r/20170306103327.2766-1-mhocko@kernel.org
Acked-by: Kees Cook <keescook@chromium.org>
Reported-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Michal Hocko <mhocko@suse.com>
Cc: <stable@vger.kernel.org> [3.6+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "kvmalloc", v5.
There are many open coded kmalloc with vmalloc fallback instances in the
tree. Most of them are not careful enough or simply do not care about
the underlying semantic of the kmalloc/page allocator which means that
a) some vmalloc fallbacks are basically unreachable because the kmalloc
part will keep retrying until it succeeds b) the page allocator can
invoke a really disruptive steps like the OOM killer to move forward
which doesn't sound appropriate when we consider that the vmalloc
fallback is available.
As it can be seen implementing kvmalloc requires quite an intimate
knowledge if the page allocator and the memory reclaim internals which
strongly suggests that a helper should be implemented in the memory
subsystem proper.
Most callers, I could find, have been converted to use the helper
instead. This is patch 6. There are some more relying on __GFP_REPEAT
in the networking stack which I have converted as well and Eric Dumazet
was not opposed [2] to convert them as well.
[1] http://lkml.kernel.org/r/20170130094940.13546-1-mhocko@kernel.org
[2] http://lkml.kernel.org/r/1485273626.16328.301.camel@edumazet-glaptop3.roam.corp.google.com
This patch (of 9):
Using kmalloc with the vmalloc fallback for larger allocations is a
common pattern in the kernel code. Yet we do not have any common helper
for that and so users have invented their own helpers. Some of them are
really creative when doing so. Let's just add kv[mz]alloc and make sure
it is implemented properly. This implementation makes sure to not make
a large memory pressure for > PAGE_SZE requests (__GFP_NORETRY) and also
to not warn about allocation failures. This also rules out the OOM
killer as the vmalloc is a more approapriate fallback than a disruptive
user visible action.
This patch also changes some existing users and removes helpers which
are specific for them. In some cases this is not possible (e.g.
ext4_kvmalloc, libcfs_kvzalloc) because those seems to be broken and
require GFP_NO{FS,IO} context which is not vmalloc compatible in general
(note that the page table allocation is GFP_KERNEL). Those need to be
fixed separately.
While we are at it, document that __vmalloc{_node} about unsupported gfp
mask because there seems to be a lot of confusion out there.
kvmalloc_node will warn about GFP_KERNEL incompatible (which are not
superset) flags to catch new abusers. Existing ones would have to die
slowly.
[sfr@canb.auug.org.au: f2fs fixup]
Link: http://lkml.kernel.org/r/20170320163735.332e64b7@canb.auug.org.au
Link: http://lkml.kernel.org/r/20170306103032.2540-2-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Reviewed-by: Andreas Dilger <adilger@dilger.ca> [ext4 part]
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: David Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
pid_ns_for_children set by a task is known only to the task itself, and
it's impossible to identify it from outside.
It's a big problem for checkpoint/restore software like CRIU, because it
can't correctly handle tasks, that do setns(CLONE_NEWPID) in proccess of
their work.
This patch solves the problem, and it exposes pid_ns_for_children to ns
directory in standard way with the name "pid_for_children":
~# ls /proc/5531/ns -l | grep pid
lrwxrwxrwx 1 root root 0 Jan 14 16:38 pid -> pid:[4026531836]
lrwxrwxrwx 1 root root 0 Jan 14 16:38 pid_for_children -> pid:[4026532286]
Link: http://lkml.kernel.org/r/149201123914.6007.2187327078064239572.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Andrei Vagin <avagin@virtuozzo.com>
Cc: Andreas Gruenbacher <agruenba@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Michael Kerrisk <mtk.manpages@googlemail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul Moore <paul@paul-moore.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Serge Hallyn <serge@hallyn.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "Expose task pid_ns_for_children to userspace".
pid_ns_for_children set by a task is known only to the task itself, and
it's impossible to identify it from outside.
It's a big problem for checkpoint/restore software like CRIU, because it
can't correctly handle tasks, that do setns(CLONE_NEWPID) in proccess of
their work. If they have a custom pid_ns_for_children before dump, they
must have the same ns after restore. Otherwise, restored task bumped
into enviroment it does not expect.
This patchset solves the problem. It exposes pid_ns_for_children to ns
directory in standard way with the name "pid_for_children":
~# ls /proc/5531/ns -l | grep pid
lrwxrwxrwx 1 root root 0 Jan 14 16:38 pid -> pid:[4026531836]
lrwxrwxrwx 1 root root 0 Jan 14 16:38 pid_for_children -> pid:[4026532286]
This patch (of 2):
Make possible to have link content prefix yyy different from the link
name xxx:
$ readlink /proc/[pid]/ns/xxx
yyy:[4026531838]
This will be used in next patch.
Link: http://lkml.kernel.org/r/149201120318.6007.7362655181033883000.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Reviewed-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: Andrei Vagin <avagin@virtuozzo.com>
Cc: Andreas Gruenbacher <agruenba@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Michael Kerrisk <mtk.manpages@googlemail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul Moore <paul@paul-moore.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Serge Hallyn <serge@hallyn.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Prepare to mark sensitive kernel structures for randomization by making
sure they're using designated initializers. These were identified
during allyesconfig builds of x86, arm, and arm64, with most initializer
fixes extracted from grsecurity.
Link: http://lkml.kernel.org/r/20170329210419.GA40066@beast
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Coccinelle emits this warning:
WARNING: casting value returned by memory allocation function to (struct proc_inode *) is useless.
Remove unnecessary cast.
Link: http://lkml.kernel.org/r/1487745720-16967-1-git-send-email-me@tobin.cc
Signed-off-by: Tobin C. Harding <me@tobin.cc>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Instead of messing with the commit path which has been causing issues,
add a COMMIT op after the COPY and ask for stable copies in the first
space.
It saves a round trip, since after the COPY, the client sends a COMMIT
anyway.
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
As reported by David Jeffery: "a signal was sent to lockd while lockd
was shutting down from a request to stop nfs. The signal causes lockd
to call restart_grace() which puts the lockd_net structure on the grace
list. If this signal is received at the wrong time, it will occur after
lockd_down_net() has called locks_end_grace() but before
lockd_down_net() stops the lockd thread. This leads to lockd putting
the lockd_net structure back on the grace list, then exiting without
anything removing it from the list."
So, perform the final locks_end_grace() from the the lockd thread; this
ensures it's serialized with respect to restart_grace().
Reported-by: David Jeffery <djeffery@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
In this round, we've focused on enhancing performance with regards to block
allocation, GC, and discard/in-place-update IO controls. There are a bunch
of clean-ups as well as minor bug fixes.
= Enhancement
- disable heap-based allocation by default
- issue small-sized discard commands by default
- change the policy of data hotness for logging
- distinguish IOs in terms of size and wbc type
- start SSR earlier to avoid foreground GC
- enhance data structures managing discard commands
- enhance in-place update flow
- add some more fault injection routines
- secure one more xattr entry
= Bug fix
- calculate victim cost for GC correctly
- remain correct victim segment number for GC
- race condition in nid allocator and initializer
- stale pointer produced by atomic_writes
- fix missing REQ_SYNC for flush commands
- handle missing errors in more corner cases
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCAAGBQJZEKXrAAoJEEAUqH6CSFDSJJ8P/1Zy0NS9TM/PFtT7Sevb6vgC
LcKLtX1bVhUuX9wAt5Q6BZ9927tCQPt5vLEYUxtniqEQaC0fsJAMbRYot+gR/dvN
4bGgv1TeVST5pKbmctzhAL30PvZ1w4QS6dLvPMm2sPQSrPKGUGt0J8wPiHHZuvH4
pygKzDxbrIJTeMhLm9tgFg7dWTJXV3VDb57WpA1AM1LAFVsIPF4vZnryLv3GsRmY
eGRxgZEtt/90hCRbEcPirPZrtpv/O5f12K4Vp/NPw+4XGMEk+nTYndq6rlUWVNjg
iPEDuxONyk/yb274SqB6sbNDuxHOqn7stGJepdUpSbprIsLZ0RmMaYWjSNsLU3Vh
p4fAzRqvfSqAHCt0FEL/vT8M9ST5xQRVr9P/l0kDK5Ww95RROd05bEaGm/sKc7NB
PHiWUoMIFFmuVsoCi6sM0AKps53ZGON8GEUyVKyM7NWTw1oWLPWifGMthEkysmwm
08SdU5+XqbCeyMPAA2GURqMA5A8ssuA8+F0Citf4JPckQHPPj5pAydmx2wVlfBlc
/bneR7T/8OsUbxgG8JSbdHUiPcjb20F0GTxSOTXiV/AaZAMCtyETnw64K2V6E0n7
uraKcYYhypyphCj/IYc4vnQ3dCu3U2/NvTYEVX8DBvboN38/JVqmNWgQx9g+tLzj
+r5s7PqTDuXv5Cfzc5NC
=SBUb
-----END PGP SIGNATURE-----
Merge tag 'for-f2fs-4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs
Pull f2fs updates from Jaegeuk Kim:
"In this round, we've focused on enhancing performance with regards to
block allocation, GC, and discard/in-place-update IO controls. There
are a bunch of clean-ups as well as minor bug fixes.
Enhancements:
- disable heap-based allocation by default
- issue small-sized discard commands by default
- change the policy of data hotness for logging
- distinguish IOs in terms of size and wbc type
- start SSR earlier to avoid foreground GC
- enhance data structures managing discard commands
- enhance in-place update flow
- add some more fault injection routines
- secure one more xattr entry
Bug fixes:
- calculate victim cost for GC correctly
- remain correct victim segment number for GC
- race condition in nid allocator and initializer
- stale pointer produced by atomic_writes
- fix missing REQ_SYNC for flush commands
- handle missing errors in more corner cases"
* tag 'for-f2fs-4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (111 commits)
f2fs: fix a mount fail for wrong next_scan_nid
f2fs: enhance scalability of trace macro
f2fs: relocate inode_{,un}lock in F2FS_IOC_SETFLAGS
f2fs: Make flush bios explicitely sync
f2fs: show available_nids in f2fs/status
f2fs: flush dirty nats periodically
f2fs: introduce CP_TRIMMED_FLAG to avoid unneeded discard
f2fs: allow cpc->reason to indicate more than one reason
f2fs: release cp and dnode lock before IPU
f2fs: shrink size of struct discard_cmd
f2fs: don't hold cmd_lock during waiting discard command
f2fs: nullify fio->encrypted_page for each writes
f2fs: sanity check segment count
f2fs: introduce valid_ipu_blkaddr to clean up
f2fs: lookup extent cache first under IPU scenario
f2fs: reconstruct code to write a data page
f2fs: introduce __wait_discard_cmd
f2fs: introduce __issue_discard_cmd
f2fs: enable small discard by default
f2fs: delay awaking discard thread
...
Change 'convert' to 'converts'
Change 'UBIFS' to 'UBIFS inode flags'
Signed-off-by: Rock Lee <rockdotlee@gmail.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
Assigning a value of a variable to itself is not useful.
Signed-off-by: Stefan Agner <stefan@agner.ch>
Signed-off-by: Richard Weinberger <richard@nod.at>
The check for the bad node type of sb->type is checking sa->type
and not sb-type. This looks like a cut and paste error. Fix this.
Detected by PVS-Studio, warning: V581
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
When write syscall is called, every time security label is searched to
determine that file's privileges should be changed.
If LSM(Linux Security Model) is not used, this is useless.
So introduce CONFIG_UBIFS_SECURITY to disable security labels. it's default
value is "y".
Signed-off-by: Hyunchul Lee <cheol.lee@lge.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAlkPYHkACgkQ8vlZVpUN
gaM97ggAlOm8n/tlbcdonX/+HHjlnqcy5uYD7A9AH/JordpRzy4eqcMbxMG39p1R
DBtjo9Y0i3iFEGajRc0h7KXDLeTBUQ/JZpR8H60MFfAQHnTowuI91eb3/6QeZiHh
CN/2KKzpYitPIEUfEHnVeYKOfvrzR7je5hrEiAwEkPeKv7XyrNVM0LHQ/jKpbQwg
ntIzHvxjQyo8plx/m5S4Yew7tqjYpNiq4plmyk/Vxtw2FmB/FC76UxYeadoB3EI5
etw+bCORB0tFZO27o56kXywg+mDcp7HEtVvq9LG28oEuBDAVKNoeKEvV7SiOBlZp
+HnqIz5Hx1UTxOlTAc10IjvEhriEuw==
=qCDl
-----END PGP SIGNATURE-----
Merge tag 'fscrypt_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/fscrypt
Pull fscrypt updates from Ted Ts'o:
"Only bug fixes and cleanups for this merge window"
* tag 'fscrypt_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/fscrypt:
fscrypt: correct collision claim for digested names
MAINTAINERS: fscrypt: update mailing list, patchwork, and git
ext4: clean up ext4_match() and callers
f2fs: switch to using fscrypt_match_name()
ext4: switch to using fscrypt_match_name()
fscrypt: introduce helper function for filename matching
fscrypt: avoid collisions when presenting long encrypted filenames
f2fs: check entire encrypted bigname when finding a dentry
ubifs: check for consistent encryption contexts in ubifs_lookup()
f2fs: sync f2fs_lookup() with ext4_lookup()
ext4: remove "nokey" check from ext4_lookup()
fscrypt: fix context consistency check when key(s) unavailable
fscrypt: Remove __packed from fscrypt_policy
fscrypt: Move key structure and constants to uapi
fscrypt: remove fscrypt_symlink_data_len()
fscrypt: remove unnecessary checks for NULL operations
file systems and for random write workloads into a preallocated file;
bug fixes and cleanups.
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAlkPYB8ACgkQ8vlZVpUN
gaP1HwgApoMQGegtRIbCZKUzKBJ2S6vwIoPAMz62JuwngOyWygJ1T1TliKTitG04
XvijKpUHtEggMO/ZsUOCoyr2LzJlpVvvrJZsavEubO12LKreYMpvNraZF1GACYTb
lIZpdWkpcEz5WnPV/PXW/dEMcSMhnKe8tbmHXMyAouSC6a55F5Wp456KF/plqkHU
zkWTCDbEOtHThzpL8cthUL71ji62I3Op5jn/qOfKCm6/JtUlw5pYjWkRUNqqjSQE
uQqMpqLxI/VjOdEiBPxEF6A+ZudZmoBQKY15ibWCcHUPFOPqk4RdYz6VivRI7zrg
KrrKcdFT29MtKnRfAAoJcc0nJ4e1Iw==
=il74
-----END PGP SIGNATURE-----
Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 updates from Ted Ts'o:
- add GETFSMAP support
- some performance improvements for very large file systems and for
random write workloads into a preallocated file
- bug fixes and cleanups.
* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
jbd2: cleanup write flags handling from jbd2_write_superblock()
ext4: mark superblock writes synchronous for nobarrier mounts
ext4: inherit encryption xattr before other xattrs
ext4: replace BUG_ON with WARN_ONCE in ext4_end_bio()
ext4: avoid unnecessary transaction stalls during writeback
ext4: preload block group descriptors
ext4: make ext4_shutdown() static
ext4: support GETFSMAP ioctls
vfs: add common GETFSMAP ioctl definitions
ext4: evict inline data when writing to memory map
ext4: remove ext4_xattr_check_entry()
ext4: rename ext4_xattr_check_names() to ext4_xattr_check_entries()
ext4: merge ext4_xattr_list() into ext4_listxattr()
ext4: constify static data that is never modified
ext4: trim return value and 'dir' argument from ext4_insert_dentry()
jbd2: fix dbench4 performance regression for 'nobarrier' mounts
jbd2: Fix lockdep splat with generic/270 test
mm: retry writepages() on ENOMEM when doing an data integrity writeback
For configurations that do not enable DAX filesystems or drivers, do not
require the DAX core to be built.
Given that the 'direct_access' method has been removed from
'block_device_operations', we can also go ahead and remove the
block-related dax helper functions from fs/block_dev.c to
drivers/dax/super.c. This keeps dax details out of the block layer and
lets the DAX core be built as a module in the FS_DAX=n case.
Filesystems need to include dax.h to call bdev_dax_supported().
Cc: linux-xfs@vger.kernel.org
Cc: Jens Axboe <axboe@kernel.dk>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.com>
Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
When using NFS4_CREATE_EXCLUSIVE4_1 mode, the client will overestimate the
amount of space that it needs for the attributes because it does so
before checking whether or not the server supports a given attribute.
Fix by checking the attribute mask earlier.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
The intention in the original patch was to release the lock when
we put the inode, however something got screwed up.
Reported-by: Jason Yan <yanaijie@huawei.com>
Fixes: 7b410d9ce4 ("pNFS: Delay getting the layout header in..")
Cc: stable@vger.kernel.org # v4.10+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Pull cifs fixes from Steve French:
"Various fixes for stable for CIFS/SMB3 especially for better
interoperability for SMB3 to Macs.
It also includes Pavel's improvements to SMB3 async i/o support
(which is much faster now)"
* 'for-next' of git://git.samba.org/sfrench/cifs-2.6:
CIFS: add misssing SFM mapping for doublequote
SMB3: Work around mount failure when using SMB3 dialect to Macs
cifs: fix CIFS_IOC_GET_MNT_INFO oops
CIFS: fix mapping of SFM_SPACE and SFM_PERIOD
CIFS: fix oplock break deadlocks
cifs: fix CIFS_ENUMERATE_SNAPSHOTS oops
cifs: fix leak in FSCTL_ENUM_SNAPS response handling
Set unicode flag on cifs echo request to avoid Mac error
CIFS: Add asynchronous write support through kernel AIO
CIFS: Add asynchronous read support through kernel AIO
CIFS: Add asynchronous context to support kernel AIO
cifs: fix IPv6 link local, with scope id, address parsing
cifs: small underflow in cnvrtDosUnixTm()
- various code cleanups
- introduce GETFSMAP ioctl
- various refactoring
- avoid dio reads past eof
- fix memory corruption and other errors with fragmented directory blocks
- fix accidental userspace memory corruptions
- publish fs uuid in superblock
- make fstrim terminatable
- fix race between quotaoff and in-core inode creation
- Avoid use-after-free when finishing up w/ buffer heads
- Reserve enough space to handle bmap tree resizing during cow remap
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABCgAGBQJZDfIzAAoJEPh/dxk0SrTrsEgP/3TjYbaqsad2e6KqtZwqN/Qx
DUljUxReZl4rgnAaFD55XOPYWGZ2bBGNtAQlAR7/JYZuZs6obbBrqUukS19jPVi7
SeQdknnU3yTq17LrwEeeQUOhem28GHxYtQYazdgNoTigZXABeXWzi53HzvPw5+Ci
3a+zB1clu3cycKsD+UAhz/m0Z40ckjDMsDueJMOACiax+vPjlzSu36H9wzlF/h0R
nq7VGSDZy6aS3H75PDjWVxoJGUSdO7jHYxwQflkk6wxrcmTCLZxuiDeSANOZ2KxM
y8qTln6hqxalQSH9r6n84/XrQstYWfdLqwngIL5wMSvN6UbuFyNQKuouEkWs6EEZ
4cuSqfihT7o5VcIpYiq1ZDgNzzpmDDMMeho4J9WBvm5Qt5hgPCo3gzweE/C6Sscs
m+V1NvLd+kBiHoMhYPB8/lm4nXa/wT1Y3TtHc+8A/qkZKAwoOdxWKNIY58jfmdzb
Rvv0LKi+6W5zanzXlNs3NXJBwZAeHuHXKY3UJT4BAWfjdtS6QvIf1Bcpj9ApyqE2
oOnNMRhF+wSS9dSFoPXkRjzIyoR5CoOylB0KYV9OYELYPDLczwbvtX/9+tjDEol9
odCZyyzJtKxYQbwf2TQ/ZqXQV4vw6lWOB7G4Itx7yv0Taa9vQ7cxSX2MnE7TA/pW
IQKsE6C2I24Bfr2oPfms
=oKCc
-----END PGP SIGNATURE-----
Merge tag 'xfs-4.12-merge-7' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs updates from Darrick Wong:
"Here are the XFS changes for 4.12. The big new feature for this
release is the new space mapping ioctl that we've been discussing
since LSF2016, but other than that most of the patches are larger bug
fixes, memory corruption prevention, and other cleanups.
Summary:
- various code cleanups
- introduce GETFSMAP ioctl
- various refactoring
- avoid dio reads past eof
- fix memory corruption and other errors with fragmented directory blocks
- fix accidental userspace memory corruptions
- publish fs uuid in superblock
- make fstrim terminatable
- fix race between quotaoff and in-core inode creation
- avoid use-after-free when finishing up w/ buffer heads
- reserve enough space to handle bmap tree resizing during cow remap"
* tag 'xfs-4.12-merge-7' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (53 commits)
xfs: fix use-after-free in xfs_finish_page_writeback
xfs: reserve enough blocks to handle btree splits when remapping
xfs: wait on new inodes during quotaoff dquot release
xfs: update ag iterator to support wait on new inodes
xfs: support ability to wait on new inodes
xfs: publish UUID in struct super_block
xfs: Allow user to kill fstrim process
xfs: better log intent item refcount checking
xfs: fix up quotacheck buffer list error handling
xfs: remove xfs_trans_ail_delete_bulk
xfs: don't use bool values in trace buffers
xfs: fix getfsmap userspace memory corruption while setting OF_LAST
xfs: fix __user annotations for xfs_ioc_getfsmap
xfs: corruption needs to respect endianess too!
xfs: use NULL instead of 0 to initialize a pointer in xfs_ioc_getfsmap
xfs: use NULL instead of 0 to initialize a pointer in xfs_getfsmap
xfs: simplify validation of the unwritten extent bit
xfs: remove unused values from xfs_exntst_t
xfs: remove the unused XFS_MAXLINK_1 define
xfs: more do_div cleanups
...
Pull block fixes and updates from Jens Axboe:
"Some fixes and followup features/changes that should go in, in this
merge window. This contains:
- Two fixes for lightnvm from Javier, fixing problems in the new code
merge previously in this merge window.
- A fix from Jan for the backing device changes, fixing an issue in
NFS that causes a failure to mount on certain setups.
- A change from Christoph, cleaning up the blk-mq init and exit
request paths.
- Remove elevator_change(), which is now unused. From Bart.
- A fix for queue operation invocation on a dead queue, from Bart.
- A series fixing up mtip32xx for blk-mq scheduling, removing a
bandaid we previously had in place for this. From me.
- A regression fix for this series, fixing a case where we wait on
workqueue flushing from an invalid (non-blocking) context. From me.
- A fix/optimization from Ming, ensuring that we don't both quiesce
and freeze a queue at the same time.
- A fix from Peter on lock ordering for CPU hotplug. Not a real
problem right now, but will be once the CPU hotplug rework goes in.
- A series from Omar, cleaning up out blk-mq debugfs support, and
adding support for exporting info from schedulers in debugfs as
well. This is really useful in debugging stalls or livelocks. From
Omar"
* 'for-linus' of git://git.kernel.dk/linux-block: (28 commits)
mq-deadline: add debugfs attributes
kyber: add debugfs attributes
blk-mq-debugfs: allow schedulers to register debugfs attributes
blk-mq: untangle debugfs and sysfs
blk-mq: move debugfs declarations to a separate header file
blk-mq: Do not invoke queue operations on a dead queue
blk-mq-debugfs: get rid of a bunch of boilerplate
blk-mq-debugfs: rename hw queue directories from <n> to hctx<n>
blk-mq-debugfs: don't open code strstrip()
blk-mq-debugfs: error on long write to queue "state" file
blk-mq-debugfs: clean up flag definitions
blk-mq-debugfs: separate flags with |
nfs: Fix bdi handling for cloned superblocks
block/mq: Cure cpu hotplug lock inversion
lightnvm: fix bad back free on error path
lightnvm: create cmd before allocating request
blk-mq: don't use sync workqueue flushing from drivers
mtip32xx: convert internal commands to regular block infrastructure
mtip32xx: cleanup internal tag assumptions
block: don't call blk_mq_quiesce_queue() after queue is frozen
...
* Region media error reporting: A libnvdimm region device is the parent
to one or more namespaces. To date, media errors have been reported via
the "badblocks" attribute attached to pmem block devices for namespaces
in "raw" or "memory" mode. Given that namespaces can be in "device-dax"
or "btt-sector" mode this new interface reports media errors
generically, i.e. independent of namespace modes or state. This
subsequently allows userspace tooling to craft "ACPI 6.1 Section
9.20.7.6 Function Index 4 - Clear Uncorrectable Error" requests and
submit them via the ioctl path for NVDIMM root bus devices.
* Introduce 'struct dax_device' and 'struct dax_operations': Prompted by
a request from Linus and feedback from Christoph this allows for dax
capable drivers to publish their own custom dax operations. This fixes
the broken assumption that all dax operations are related to a
persistent memory device, and makes it easier for other architectures
and platforms to add customized persistent memory support.
* 'libnvdimm' core updates: A new "deep_flush" sysfs attribute is
available for storage appliance applications to manually trigger memory
controllers to drain write-pending buffers that would otherwise be
flushed automatically by the platform ADR (asynchronous-DRAM-refresh)
mechanism at a power loss event. Support for "locked" DIMMs is included
to prevent namespaces from surfacing when the namespace label data area
is locked. Finally, fixes for various reported deadlocks and crashes,
also tagged for -stable.
* ACPI / nfit driver updates: General updates of the nfit driver to add
DSM command overrides, ACPI 6.1 health state flags support, DSM payload
debug available by default, and various fixes.
Acknowledgements that came after the branch was pushed:
commmit 565851c972 "device-dax: fix sysfs attribute deadlock"
Tested-by: Yi Zhang <yizhan@redhat.com>
commit 23f4984483 "libnvdimm: rework region badblocks clearing"
Tested-by: Toshi Kani <toshi.kani@hpe.com>
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJZDONJAAoJEB7SkWpmfYgC3SsP/2KrLvTUcz646ViuPOgZ2cC4
W6wAx6cvDSt+H52kLnFEsYoFt7WAj20ggPirb/Bc5jkGlvwE0lT9Xtmso9GpVkYT
J9ZJ9pP/4YaAD3II1gmTwaUjYi0FxoOdx3Eb92yuWkO/8ylz4b2Nu3cBpYwyziGQ
nIfEVwDXRLE86u6x0bWuf6TlVuvsbdiAI55CDqDMVQC6xIOLbSez7b8QIHlpiKEb
Mw+xqdQva0esoreZEOXEhWNO+qtfILx8/ceBEGTNMp4e/JjZ2FbrSNplM+9bH5k7
ywqP8lW+mBEw0fmBBkYoVG/xyesiiBb55JLnbi8Ew+7IUxw8a3iV7wftRi62lHcK
zAjsHe4L+MansgtZsCL8wluvIPaktAdtB4xr7l9VNLKRYRUG73jEWU0gcUNryHIL
BkQJ52pUS1PkClyAsWbBBHl1I/CvzVPd21VW0YELmLR4OywKy1c+eKw2bcYgjrb4
59HZSv6S6EoKaQC+2qvVNpePil7cdfg5V2ubH/ki9HoYVyoxDptEWHnvf0NNatIH
Y7mNcOPvhOksJmnKSyHbDjtRur7WoHIlC9D7UjEFkSBWsKPjxJHoidN4SnCMRtjQ
WKQU0seoaKj04b68Bs/Qm9NozVgnsPFIUDZeLMikLFX2Jt7YSPu+Jmi2s4re6WLh
TmJQ3Ly9t3o3/weHSzmn
=Ox0s
-----END PGP SIGNATURE-----
Merge tag 'libnvdimm-for-4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull libnvdimm updates from Dan Williams:
"The bulk of this has been in multiple -next releases. There were a few
late breaking fixes and small features that got added in the last
couple days, but the whole set has received a build success
notification from the kbuild robot.
Change summary:
- Region media error reporting: A libnvdimm region device is the
parent to one or more namespaces. To date, media errors have been
reported via the "badblocks" attribute attached to pmem block
devices for namespaces in "raw" or "memory" mode. Given that
namespaces can be in "device-dax" or "btt-sector" mode this new
interface reports media errors generically, i.e. independent of
namespace modes or state.
This subsequently allows userspace tooling to craft "ACPI 6.1
Section 9.20.7.6 Function Index 4 - Clear Uncorrectable Error"
requests and submit them via the ioctl path for NVDIMM root bus
devices.
- Introduce 'struct dax_device' and 'struct dax_operations': Prompted
by a request from Linus and feedback from Christoph this allows for
dax capable drivers to publish their own custom dax operations.
This fixes the broken assumption that all dax operations are
related to a persistent memory device, and makes it easier for
other architectures and platforms to add customized persistent
memory support.
- 'libnvdimm' core updates: A new "deep_flush" sysfs attribute is
available for storage appliance applications to manually trigger
memory controllers to drain write-pending buffers that would
otherwise be flushed automatically by the platform ADR
(asynchronous-DRAM-refresh) mechanism at a power loss event.
Support for "locked" DIMMs is included to prevent namespaces from
surfacing when the namespace label data area is locked. Finally,
fixes for various reported deadlocks and crashes, also tagged for
-stable.
- ACPI / nfit driver updates: General updates of the nfit driver to
add DSM command overrides, ACPI 6.1 health state flags support, DSM
payload debug available by default, and various fixes.
Acknowledgements that came after the branch was pushed:
- commmit 565851c972 "device-dax: fix sysfs attribute deadlock":
Tested-by: Yi Zhang <yizhan@redhat.com>
- commit 23f4984483 "libnvdimm: rework region badblocks clearing"
Tested-by: Toshi Kani <toshi.kani@hpe.com>"
* tag 'libnvdimm-for-4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (52 commits)
libnvdimm, pfn: fix 'npfns' vs section alignment
libnvdimm: handle locked label storage areas
libnvdimm: convert NDD_ flags to use bitops, introduce NDD_LOCKED
brd: fix uninitialized use of brd->dax_dev
block, dax: use correct format string in bdev_dax_supported
device-dax: fix sysfs attribute deadlock
libnvdimm: restore "libnvdimm: band aid btt vs clear poison locking"
libnvdimm: fix nvdimm_bus_lock() vs device_lock() ordering
libnvdimm: rework region badblocks clearing
acpi, nfit: kill ACPI_NFIT_DEBUG
libnvdimm: fix clear length of nvdimm_forget_poison()
libnvdimm, pmem: fix a NULL pointer BUG in nd_pmem_notify
libnvdimm, region: sysfs trigger for nvdimm_flush()
libnvdimm: fix phys_addr for nvdimm_clear_poison
x86, dax, pmem: remove indirection around memcpy_from_pmem()
block: remove block_device_operations ->direct_access()
block, dax: convert bdev_dax_supported() to dax_direct_access()
filesystem-dax: convert to dax_direct_access()
Revert "block: use DAX for partition table reads"
ext2, ext4, xfs: retrieve dax_device for iomap operations
...
1. Andreas Gruenbacher wrote a patch to replace the deprecated
call to rhashtable_walk_init with rhashtable_walk_enter.
2. Andreas also wrote a patch to eliminate redundant code in
two of our debugfs sequence files.
3. Andreas also cleaned up the rhashtable key ugliness Linus
pointed out during this cycle, following Linus's suggestions.
4. Andreas also wrote a patch to take advantage of his new
function rhashtable_lookup_get_insert_fast. This makes glock
lookup faster and more bullet-proof.
5. Andreas also wrote a patch to revert a patch in the evict
path that caused occasional deadlocks, and is no longer
needed.
6. Andrew Price wrote a patch to re-enable fallocate for the
rindex system file to enable gfs2_grow to grow properly on
secondary file system grow operations.
7. I wrote a patch to initialize an inode number field to make
certain kernel trace points more understandable.
8. I also wrote a patch that makes GFS2 file system "withdraw"
work more like it should by ignoring operations after a
withdraw that would formerly cause a BUG() and kernel panic.
9. I also reworked the entire truncate/delete algorithm,
scrapping the old recursive algorithm in favor of a new
non-recursive algorithm. This was done for performance:
This way, GFS2 no longer needs to lock multiple resource
groups while doing truncates and deletes of files that cross
multiple resource group boundaries, allowing for better
parallelism. It also solves a problem whereby deleting large
files would request a large chunk of kernel memory, which
resulted in a get_page_from_freelist warning.
10. Due to a regression found during testing, I added a new
patch to correct "GFS2: Prevent BUG from occurring when
normal Withdraws occur".
-----BEGIN PGP SIGNATURE-----
iQEcBAABAgAGBQJZDNnaAAoJENeLYdPf93o7B7kIAJzwz7vVDVg2TpWVhMmXIWhf
rZx3Gth5F0h+ZHddW7HzTLg+64XQ5//GyDD3UDtCpkhl5SJH+nt3juHyPJlRwioT
0ua4SjyKLQSoJJVAEgAwu42QjORTXab7NjYn5LEhvRc0Gg/El9WGU+ZgmP2/aAvf
KE2u/IEYNDkoJNS3Oqc7shajAyLYda6wCAASs/1ZGt9u48m/o/I23Zd7wr7EOkzw
rd3gB0x80cJqDAB5IcymGOm111Tg4g34LwsRuyMnWE3H1jOgV+J515FVHEIvZuPq
Wl9X7V8CzktI7nyLKVnZhpuv5JzyMq/vOPiD01tTFx8Oy1JCRezjmATXFjW/zIo=
=MX3c
-----END PGP SIGNATURE-----
Merge tag 'gfs2-4.12.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2
Pull GFS2 updates from Bob Peterson:
"We've got ten GFS2 patches for this merge window.
- Andreas Gruenbacher wrote a patch to replace the deprecated call to
rhashtable_walk_init with rhashtable_walk_enter.
- Andreas also wrote a patch to eliminate redundant code in two of
our debugfs sequence files.
- Andreas also cleaned up the rhashtable key ugliness Linus pointed
out during this cycle, following Linus's suggestions.
- Andreas also wrote a patch to take advantage of his new function
rhashtable_lookup_get_insert_fast. This makes glock lookup faster
and more bullet-proof.
- Andreas also wrote a patch to revert a patch in the evict path that
caused occasional deadlocks, and is no longer needed.
- Andrew Price wrote a patch to re-enable fallocate for the rindex
system file to enable gfs2_grow to grow properly on secondary file
system grow operations.
- I wrote a patch to initialize an inode number field to make certain
kernel trace points more understandable.
- I also wrote a patch that makes GFS2 file system "withdraw" work
more like it should by ignoring operations after a withdraw that
would formerly cause a BUG() and kernel panic.
- I also reworked the entire truncate/delete algorithm, scrapping the
old recursive algorithm in favor of a new non-recursive algorithm.
This was done for performance: This way, GFS2 no longer needs to
lock multiple resource groups while doing truncates and deletes of
files that cross multiple resource group boundaries, allowing for
better parallelism. It also solves a problem whereby deleting large
files would request a large chunk of kernel memory, which resulted
in a get_page_from_freelist warning.
- Due to a regression found during testing, I added a new patch to
correct 'GFS2: Prevent BUG from occurring when normal Withdraws
occur'."
* tag 'gfs2-4.12.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
GFS2: Allow glocks to be unlocked after withdraw
GFS2: Non-recursive delete
gfs2: Re-enable fallocate for the rindex
Revert "GFS2: Wait for iopen glock dequeues"
gfs2: Switch to rhashtable_lookup_get_insert_fast
GFS2: Temporarily zero i_no_addr when creating a dinode
gfs2: Don't pack struct lm_lockname
gfs2: Deduplicate gfs2_{glocks,glstats}_open
gfs2: Replace rhashtable_walk_init with rhashtable_walk_enter
GFS2: Prevent BUG from occurring when normal Withdraws occur
remove unused get_fsid_from_ino
fix bounds check for listxattr
clean up oversize xattr validation
do not set getattr_time on orangefs_lookup
return from orangefs_devreq_read quickly if possible
do not wait for timeout if umounting
handle zero size write in debugfs
Bug fixes:
do not check possibly stale size on truncate
ensure the userspace component is unmounted if mount fails
total reimplementation of dir.c
New feature:
implement statx
The new implementation of dir.c is kind of a big deal, all new
code. It has been posted to fs-devel during the previous rc period,
we didn't get much review or feedback from there, but it has been reviewed
very heavily here, so much so that we have two entire versions of the
reimplementation. Not only does the new implementation fix some
xfstests, but it passes all the new tests we made here that involve
seeking and rewinding and giant directories and long file names.
The new dir code has three patches itself:
skip forward to the next directory entry if seek is short
invalidate stored directory on seek
count directory pieces correctly
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJZDLasAAoJEM9EDqnrzg2+yR4P/jNsryNfQush5V/6EO+wpQ7p
O0epuLG42QMN67wdsQDVOOzcRQq2IAoYrgupZfEvCVsoBiYxdCTTwhN/55UctMBA
xnakv8BarrLd6pqSJOlQviP7ByXdEvy7dtYYuAEtdRnPtTZEmjDH0k9ME759+DVm
pPQ6fanPzSZuG8fdjI4QrKiFfpE5slMeyMV9SmzIq81S1i+t2b9sDYKTiP3Jt14y
KTweGdJXTRT+Piy27d80HN9ExlFXlcyru9GDWNhZi4EHlax7bq76Qwu1XKyaOg0h
MN40+18k+Zqrpj1/tq4aj3YM0P3HjpRhtb5TqOC+QhZDIL1gJ8bv8rv61snWTak+
6cXtwvIh7r4aEU+gkMLP29HXCVlGg3V4up+DdbHJVbIEXV8C5csJBP+sQUlU7A5D
WoPmheV7CJ8nicwkxYm31dhdnW7mOwW/J4uUlM9w/yU/dVfoz1SK8AtKjy0xX87c
Jpo7nuJEDprI+9neT0y5U+RHVqH08+cA5DCrdk0x8JaJIrjOZpvTROIPrtzlS7QL
aTu+W/ISXtFwnM+ERmw8TKPD7TTUXypydYhzXe8V6itDpiNp1kQFGmLGzLhAMElH
iGQkFatR6LSKh+DxUD3PREQGNyQCKpgPiqLoGYprzQ829tqLpThumfZic9lX1C/+
we5VEpRbiz6BjN110DBJ
=NGTt
-----END PGP SIGNATURE-----
Merge tag 'for-linus-4.12-ofs-1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux
Pull orangefs updates from Mike Marshall:
"Orangefs cleanups, fixes and statx support.
Some cleanups:
- remove unused get_fsid_from_ino
- fix bounds check for listxattr
- clean up oversize xattr validation
- do not set getattr_time on orangefs_lookup
- return from orangefs_devreq_read quickly if possible
- do not wait for timeout if umounting
- handle zero size write in debugfs
Bug fixes:
- do not check possibly stale size on truncate
- ensure the userspace component is unmounted if mount fails
- total reimplementation of dir.c
New feature:
- implement statx
The new implementation of dir.c is kind of a big deal, all new code.
It has been posted to fs-devel during the previous rc period, we
didn't get much review or feedback from there, but it has been
reviewed very heavily here, so much so that we have two entire
versions of the reimplementation.
Not only does the new implementation fix some xfstests, but it passes
all the new tests we made here that involve seeking and rewinding and
giant directories and long file names. The new dir code has three
patches itself:
- skip forward to the next directory entry if seek is short
- invalidate stored directory on seek
- count directory pieces correctly"
* tag 'for-linus-4.12-ofs-1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux:
orangefs: count directory pieces correctly
orangefs: invalidate stored directory on seek
orangefs: skip forward to the next directory entry if seek is short
orangefs: handle zero size write in debugfs
orangefs: do not wait for timeout if umounting
orangefs: return from orangefs_devreq_read quickly if possible
orangefs: ensure the userspace component is unmounted if mount fails
orangefs: do not check possibly stale size on truncate
orangefs: implement statx
orangefs: remove ORANGEFS_READDIR macros
orangefs: support very large directories
orangefs: support llseek on directories
orangefs: rewrite readdir to fix several bugs
orangefs: do not set getattr_time on orangefs_lookup
orangefs: clean up oversize xattr validation
orangefs: fix bounds check for listxattr
orangefs: remove unused get_fsid_from_ino
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJZDFZyAAoJEGu/nxmHO1GNpz0IAINPEyXe9zAc/K74u5mIUPKT
MqK/ifAYdOmGDu9kB68tXFQ5o3GNmAjWI4P8/T6oGlK9IudChrwTBY9Gss7iaawc
+sNu71NmnyxbWHb7w71kIdhwNiHWolgZva1Ex9yaQYqRAy/JapCke9gs5TiruM4j
zObaZnw48RwVyvU/Xixoz0hOLDGkPltOdy3tkWmy9v8sg/jSf+HF1FpAIfyO4pm+
Kf2YR9IEkHhHwhoVEbHeSOjH/Tgb8gO8Suh4OnPRAP3gnVLWhb5Deh7Pjlgoj8Gn
am2KFSkpShwvNG+yXufEwS4p7ERNd4u3uk/IWhJTuw6sE08L+dFU4Rj+DdxR2eY=
=sENx
-----END PGP SIGNATURE-----
Merge tag 'befs-v4.12-rc1' of git://github.com/luisbg/linux-befs
Pull befs fix from Luis de Bethencourt:
"One fix from Fabian Frederick making the nfs client still work after a
cache drop"
* tag 'befs-v4.12-rc1' of git://github.com/luisbg/linux-befs:
befs: make export work with cold dcache
Process RENAME_EXCHANGE in affs_rename2() adding static
affs_xrename() based on affs_rename().
We remove headers from respective directories then
affect bh to other inode directory entries for swapping.
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Currently AFFS only supports RENAME_NOREPLACE.
This patch isolates that method to a static function to
prepare RENAME_EXCHANGE addition.
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
This bug fixes a regression introduced by patch 0d1c7ae9d8.
The intent of the patch was to stop promoting glocks after a
file system is withdrawn due to a variety of errors, because doing
so results in a BUG(). (You should be able to unmount after a
withdraw rather than having the kernel panic.)
Unfortunately, it also stopped demotions, so glocks could not be
unlocked after withdraw, which means the unmount would hang.
This patch allows function do_xmote to demote locks to an
unlocked state after a withdraw, but not promote them.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Pull namespace updates from Eric Biederman:
"This is a set of small fixes that were mostly stumbled over during
more significant development. This proc fix and the fix to
posix-timers are the most significant of the lot.
There is a lot of good development going on but unfortunately it
didn't quite make the merge window"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
proc: Fix unbalanced hard link numbers
signal: Make kill_proc_info static
rlimit: Properly call security_task_setrlimit
signal: Remove unused definition of sig_user_definied
ia64: Remove unused IA64_TASK_SIGHAND_OFFSET and IA64_SIGHAND_SIGLOCK_OFFSET
ipc: Remove unused declaration of recompute_msgmni
posix-timers: Correct sanity check in posix_cpu_nsleep
sysctl: Remove dead register_sysctl_root
This patch removes useless nfs_readdir_get_array() and
nfs_readdir_release_array() as suggested by Trond Myklebust
nfs_readdir() calls nfs_revalidate_mapping() before
readdir_search_pagecache() , nfs_do_filldir(), uncached_readdir()
so mapping should be correct.
While kmap() can't fail, all subsequent error checks were removed
as well as unused labels.
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
When using 'ls -l' to display a large directory, if noac option is used,
in function nfs_getattr() nfs_need_revalidate_inode() will always be true
for NFSv3 and the nfs_entry cache of the directory will be flushed. The
flush will lead to a fully reread of the directory entries from server.
To prevent the unnecessary RPCs, we need to check whether or not the
noac option is used, and always report the invocation of nfs_getattr()
as cache hit instead cache miss when it's on.
Signed-off-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
If memory allocation fails for the callback data, we need to put the nfs_client
or we end up with an elevated refcount.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
If the server returns NFS4ERR_CONN_NOT_BOUND_TO_SESSION because we
are trunking, then RECLAIM_COMPLETE must handle that by calling
nfs4_schedule_session_recovery() and then retrying.
Reported-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Tested-by: Chuck Lever <chuck.lever@oracle.com>
SFM is mapping doublequote to 0xF020
Without this patch creating files with doublequote fails to Windows/Mac
Signed-off-by: Bjoern Jacke <bjacke@samba.org>
Signed-off-by: Steve French <smfrench@gmail.com>
CC: stable <stable@vger.kernel.org>
based on commit b3b42c0dea
("fs/affs: make export work with cold dcache")
This adds get_parent function so that nfs client can still work after
cache drop (Tested on NFS v4 with echo 3 > /proc/sys/vm/drop_caches)
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Luis de Bethencourt <luisbg@osg.samsung.com>
An upper type non directory dentry that is a copy up target
should have a reference to its lower copy up origin.
There are three ways for an upper type dentry to be instantiated:
1. A lower type dentry that is being copied up
2. An entry that is found in upper dir by ovl_lookup()
3. A negative dentry is hardlinked to an upper type dentry
In the first case, the lower reference is set before copy up.
In the second case, the lower reference is found by ovl_lookup().
In the last case of hardlinked upper dentry, it is not easy to
update the lower reference of the negative dentry. Instead,
drop the newly hardlinked negative dentry from dcache and let
the next access call ovl_lookup() to find its lower reference.
This makes sure that the inode number reported by stat(2) after
the hardlink is created is the same inode number that will be
reported by stat(2) after mount cycle, which is the inode number
of the lower copy up origin of the hardlink source.
NOTE that this does not fix breaking of lower hardlinks on copy
up, but only fixes the case of lower nlink == 1, whose upper copy
up inode is hardlinked in upper dir.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
When all layers are on the same underlying filesystem, let stat(2) return
st_dev/st_ino values of the copy up origin inode if it is known.
This results in constant st_ino/st_dev representation of files in an
overlay mount before and after copy up.
When the underlying filesystem support NFS exportfs, the result is also
persistent st_ino/st_dev representation before and after mount cycle.
Lower hardlinks are broken on copy up to different upper files, so we
cannot use the lower origin st_ino for those different files, even for the
same fs case.
When all overlay layers are on the same fs, use overlay st_dev for non-dirs
to get the correct result from du -x.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
stat(2) on overlay directories reports the overlay temp inode
number, which is constant across copy up, but is not persistent.
When all layers are on the same fs, report the copy up origin inode
number for directories.
This inode number is persistent, unique across the overlay mount and
constant across copy up.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
For directory entries, non zero oe->numlower implies OVL_TYPE_MERGE.
Define a new type flag OVL_TYPE_ORIGIN to indicate that an entry holds a
reference to its lower copy up origin.
For directory entries ORIGIN := MERGE && UPPER. For non-dir entries ORIGIN
means that a lower type dentry has been recently copied up or that we were
able to find the copy up origin from overlay.origin xattr.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
If overlay.origin xattr is found on a non-dir upper inode try to get lower
dentry by calling exportfs_decode_fh().
On failure to lookup by file handle to lower layer, do not lookup the copy
up origin by name, because the lower found by name could be another file in
case the upper file was renamed.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Sometimes it is interesting to know if an upper file is pure upper or a
copy up target, and if it is a copy up target, it may be interesting to
find the copy up origin.
This will be used to preserve lower inode numbers across copy up.
Store the lower inode file handle in upper inode extended attribute
overlay.origin on copy up to use it later for these cases. Store the lower
filesystem uuid along side the file handle, so we can validate that we are
looking for the origin file in the original fs.
If lower fs does not support NFS export ops store a zero sized xattr so we
can always use the overlay.origin xattr to distinguish between a copy up
and a pure upper inode.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Some features can only work when all layers are on the same fs. Test this
condition during mount time, so features can check them later.
Add helper ovl_same_sb() to return the common super block in case all
layers are on the same fs.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Here is the big set of new char/misc driver drivers and features for
4.12-rc1.
There's lots of new drivers added this time around, new firmware drivers
from Google, more auxdisplay drivers, extcon drivers, fpga drivers, and
a bunch of other driver updates. Nothing major, except if you happen to
have the hardware for these drivers, and then you will be happy :)
All of these have been in linux-next for a while with no reported
issues.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCWQvAgg8cZ3JlZ0Brcm9h
aC5jb20ACgkQMUfUDdst+yknsACgzkAeyz16Z97J3UTaeejbR7nKUCAAoKY4WEHY
8O9f9pr9gj8GMBwxeZQa
=OIfB
-----END PGP SIGNATURE-----
Merge tag 'char-misc-4.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc
Pull char/misc driver updates from Greg KH:
"Here is the big set of new char/misc driver drivers and features for
4.12-rc1.
There's lots of new drivers added this time around, new firmware
drivers from Google, more auxdisplay drivers, extcon drivers, fpga
drivers, and a bunch of other driver updates. Nothing major, except if
you happen to have the hardware for these drivers, and then you will
be happy :)
All of these have been in linux-next for a while with no reported
issues"
* tag 'char-misc-4.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (136 commits)
firmware: google memconsole: Fix return value check in platform_memconsole_init()
firmware: Google VPD: Fix return value check in vpd_platform_init()
goldfish_pipe: fix build warning about using too much stack.
goldfish_pipe: An implementation of more parallel pipe
fpga fr br: update supported version numbers
fpga: region: release FPGA region reference in error path
fpga altera-hps2fpga: disable/unprepare clock on error in alt_fpga_bridge_probe()
mei: drop the TODO from samples
firmware: Google VPD sysfs driver
firmware: Google VPD: import lib_vpd source files
misc: lkdtm: Add volatile to intentional NULL pointer reference
eeprom: idt_89hpesx: Add OF device ID table
misc: ds1682: Add OF device ID table
misc: tsl2550: Add OF device ID table
w1: Remove unneeded use of assert() and remove w1_log.h
w1: Use kernel common min() implementation
uio_mf624: Align memory regions to page size and set correct offsets
uio_mf624: Refactor memory info initialization
uio: Allow handling of non page-aligned memory regions
hangcheck-timer: Fix typo in comment
...
Commits cc8385b59e and 7ef70b4d99 added preallocation for the
reada radix trees and also switched them over to GFP_KERNEL for the
default gfp mask.
Since we're doing radix tree insertions under spinlocks, we need
to make sure the mask doesn't allow sleeping. This fix keeps
the radix preallocation but switches back to the original gfp_mask.
Reported-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
A large directory full of differently sized file names triggered this.
Most directories, even very large directories with shorter names, would
be lucky enough to fit in one server response.
Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
If an application seeks to a position before the point which has been
read, it must want updates which have been made to the directory. So
delete the copy stored in the kernel so it will be fetched again.
Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
If userspace seeks to a position in the stream which is not correct, it
would have returned EIO because the data in the buffer at that offset
would be incorrect. This and the userspace daemon returning a corrupt
directory are indistinguishable.
Now if the data does not look right, skip forward to the next chunk and
try again. The motivation is that if the directory changes, an
application may seek to a position that was valid and no longer is valid.
It is not yet possible for a directory to change.
Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
When ext4 encryption was originally merged, we were encrypting the
user-specified filename in ext4_match(), introducing a lot of additional
complexity into ext4_match() and its callers. This has since been
changed to encrypt the filename earlier, so we can remove the gunk
that's no longer needed. This more or less reverts ext4_search_dir()
and ext4_find_dest_de() to the way they were in the v4.0 kernel.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Switch f2fs directory searches to use the fscrypt_match_name() helper
function. There should be no functional change.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Acked-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Switch ext4 directory searches to use the fscrypt_match_name() helper
function. There should be no functional change.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Introduce a helper function fscrypt_match_name() which tests whether a
fscrypt_name matches a directory entry. Also clean up the magic numbers
and document things properly.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
When accessing an encrypted directory without the key, userspace must
operate on filenames derived from the ciphertext names, which contain
arbitrary bytes. Since we must support filenames as long as NAME_MAX,
we can't always just base64-encode the ciphertext, since that may make
it too long. Currently, this is solved by presenting long names in an
abbreviated form containing any needed filesystem-specific hashes (e.g.
to identify a directory block), then the last 16 bytes of ciphertext.
This needs to be sufficient to identify the actual name on lookup.
However, there is a bug. It seems to have been assumed that due to the
use of a CBC (ciphertext block chaining)-based encryption mode, the last
16 bytes (i.e. the AES block size) of ciphertext would depend on the
full plaintext, preventing collisions. However, we actually use CBC
with ciphertext stealing (CTS), which handles the last two blocks
specially, causing them to appear "flipped". Thus, it's actually the
second-to-last block which depends on the full plaintext.
This caused long filenames that differ only near the end of their
plaintexts to, when observed without the key, point to the wrong inode
and be undeletable. For example, with ext4:
# echo pass | e4crypt add_key -p 16 edir/
# seq -f "edir/abcdefghijklmnopqrstuvwxyz012345%.0f" 100000 | xargs touch
# find edir/ -type f | xargs stat -c %i | sort | uniq | wc -l
100000
# sync
# echo 3 > /proc/sys/vm/drop_caches
# keyctl new_session
# find edir/ -type f | xargs stat -c %i | sort | uniq | wc -l
2004
# rm -rf edir/
rm: cannot remove 'edir/_A7nNFi3rhkEQlJ6P,hdzluhODKOeWx5V': Structure needs cleaning
...
To fix this, when presenting long encrypted filenames, encode the
second-to-last block of ciphertext rather than the last 16 bytes.
Although it would be nice to solve this without depending on a specific
encryption mode, that would mean doing a cryptographic hash like SHA-256
which would be much less efficient. This way is sufficient for now, and
it's still compatible with encryption modes like HEH which are strong
pseudorandom permutations. Also, changing the presented names is still
allowed at any time because they are only provided to allow applications
to do things like delete encrypted directories. They're not designed to
be used to persistently identify files --- which would be hard to do
anyway, given that they're encrypted after all.
For ease of backports, this patch only makes the minimal fix to both
ext4 and f2fs. It leaves ubifs as-is, since ubifs doesn't compare the
ciphertext block yet. Follow-on patches will clean things up properly
and make the filesystems use a shared helper function.
Fixes: 5de0b4d0cd ("ext4 crypto: simplify and speed up filename encryption")
Reported-by: Gwendal Grignou <gwendal@chromium.org>
Cc: stable@vger.kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
If user has no key under an encrypted dir, fscrypt gives digested dentries.
Previously, when looking up a dentry, f2fs only checks its hash value with
first 4 bytes of the digested dentry, which didn't handle hash collisions fully.
This patch enhances to check entire dentry bytes likewise ext4.
Eric reported how to reproduce this issue by:
# seq -f "edir/abcdefghijklmnopqrstuvwxyz012345%.0f" 100000 | xargs touch
# find edir -type f | xargs stat -c %i | sort | uniq | wc -l
100000
# sync
# echo 3 > /proc/sys/vm/drop_caches
# keyctl new_session
# find edir -type f | xargs stat -c %i | sort | uniq | wc -l
99999
Cc: <stable@vger.kernel.org>
Reported-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
(fixed f2fs_dentry_hash() to work even when the hash is 0)
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
As ext4 and f2fs do, ubifs should check for consistent encryption
contexts during ->lookup() in an encrypted directory. This protects
certain users of filesystem encryption against certain types of offline
attacks.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
As for ext4, now that fscrypt_has_permitted_context() correctly handles
the case where we have the key for the parent directory but not the
child, f2fs_lookup() no longer has to work around it. Also add the same
warning message that ext4 uses.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Now that fscrypt_has_permitted_context() correctly handles the case
where we have the key for the parent directory but not the child, we
don't need to try to work around this in ext4_lookup().
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
To mitigate some types of offline attacks, filesystem encryption is
designed to enforce that all files in an encrypted directory tree use
the same encryption policy (i.e. the same encryption context excluding
the nonce). However, the fscrypt_has_permitted_context() function which
enforces this relies on comparing struct fscrypt_info's, which are only
available when we have the encryption keys. This can cause two
incorrect behaviors:
1. If we have the parent directory's key but not the child's key, or
vice versa, then fscrypt_has_permitted_context() returned false,
causing applications to see EPERM or ENOKEY. This is incorrect if
the encryption contexts are in fact consistent. Although we'd
normally have either both keys or neither key in that case since the
master_key_descriptors would be the same, this is not guaranteed
because keys can be added or removed from keyrings at any time.
2. If we have neither the parent's key nor the child's key, then
fscrypt_has_permitted_context() returned true, causing applications
to see no error (or else an error for some other reason). This is
incorrect if the encryption contexts are in fact inconsistent, since
in that case we should deny access.
To fix this, retrieve and compare the fscrypt_contexts if we are unable
to set up both fscrypt_infos.
While this slightly hurts performance when accessing an encrypted
directory tree without the key, this isn't a case we really need to be
optimizing for; access *with* the key is much more important.
Furthermore, the performance hit is barely noticeable given that we are
already retrieving the fscrypt_context and doing two keyring searches in
fscrypt_get_encryption_info(). If we ever actually wanted to optimize
this case we might start by caching the fscrypt_contexts.
Cc: stable@vger.kernel.org # 4.0+
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Currently jbd2_write_superblock() silently adds REQ_SYNC to flags with
which journal superblock is written. Make this explicit by making flags
passed down to jbd2_write_superblock() contain REQ_SYNC.
CC: linux-ext4@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Commit b685d3d65a "block: treat REQ_FUA and REQ_PREFLUSH as
synchronous" removed REQ_SYNC flag from WRITE_FUA implementation.
generic_make_request_checks() however strips REQ_FUA flag from a bio
when the storage doesn't report volatile write cache and thus write
effectively becomes asynchronous which can lead to performance
regressions. This affects superblock writes for ext4. Fix the problem
by marking superblock writes always as synchronous.
Fixes: b685d3d65a
CC: linux-ext4@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
In commit 0d3b12584972 "nfs: Convert to separately allocated bdi" I have
wrongly cloned bdi reference in nfs_clone_super(). Further inspection
has shown that originally the code was actually allocating a new bdi (in
->clone_server callback) which was later registered in
nfs_fs_mount_common() and used for sb->s_bdi in nfs_initialise_sb().
This could later result in bdi for the original superblock not getting
unregistered when that superblock got shutdown (as the cloned sb still
held bdi reference) and later when a new superblock was created under
the same anonymous device number, a clash in sysfs has happened on bdi
registration:
------------[ cut here ]------------
WARNING: CPU: 1 PID: 10284 at /linux-next/fs/sysfs/dir.c:31 sysfs_warn_dup+0x64/0x74
sysfs: cannot create duplicate filename '/devices/virtual/bdi/0:32'
Modules linked in: axp20x_usb_power gpio_axp209 nvmem_sunxi_sid sun4i_dma sun4i_ss virt_dma
CPU: 1 PID: 10284 Comm: mount.nfs Not tainted 4.11.0-rc4+ #14
Hardware name: Allwinner sun7i (A20) Family
[<c010f19c>] (unwind_backtrace) from [<c010bc74>] (show_stack+0x10/0x14)
[<c010bc74>] (show_stack) from [<c03c6e24>] (dump_stack+0x78/0x8c)
[<c03c6e24>] (dump_stack) from [<c0122200>] (__warn+0xe8/0x100)
[<c0122200>] (__warn) from [<c0122250>] (warn_slowpath_fmt+0x38/0x48)
[<c0122250>] (warn_slowpath_fmt) from [<c02ac178>] (sysfs_warn_dup+0x64/0x74)
[<c02ac178>] (sysfs_warn_dup) from [<c02ac254>] (sysfs_create_dir_ns+0x84/0x94)
[<c02ac254>] (sysfs_create_dir_ns) from [<c03c8b8c>] (kobject_add_internal+0x9c/0x2ec)
[<c03c8b8c>] (kobject_add_internal) from [<c03c8e24>] (kobject_add+0x48/0x98)
[<c03c8e24>] (kobject_add) from [<c048d75c>] (device_add+0xe4/0x5a0)
[<c048d75c>] (device_add) from [<c048ddb4>] (device_create_groups_vargs+0xac/0xbc)
[<c048ddb4>] (device_create_groups_vargs) from [<c048dde4>] (device_create_vargs+0x20/0x28)
[<c048dde4>] (device_create_vargs) from [<c02075c8>] (bdi_register_va+0x44/0xfc)
[<c02075c8>] (bdi_register_va) from [<c023d378>] (super_setup_bdi_name+0x48/0xa4)
[<c023d378>] (super_setup_bdi_name) from [<c0312ef4>] (nfs_fill_super+0x1a4/0x204)
[<c0312ef4>] (nfs_fill_super) from [<c03133f0>] (nfs_fs_mount_common+0x140/0x1e8)
[<c03133f0>] (nfs_fs_mount_common) from [<c03335cc>] (nfs4_remote_mount+0x50/0x58)
[<c03335cc>] (nfs4_remote_mount) from [<c023ef98>] (mount_fs+0x14/0xa4)
[<c023ef98>] (mount_fs) from [<c025cba0>] (vfs_kern_mount+0x54/0x128)
[<c025cba0>] (vfs_kern_mount) from [<c033352c>] (nfs_do_root_mount+0x80/0xa0)
[<c033352c>] (nfs_do_root_mount) from [<c0333818>] (nfs4_try_mount+0x28/0x3c)
[<c0333818>] (nfs4_try_mount) from [<c0313874>] (nfs_fs_mount+0x2cc/0x8c4)
[<c0313874>] (nfs_fs_mount) from [<c023ef98>] (mount_fs+0x14/0xa4)
[<c023ef98>] (mount_fs) from [<c025cba0>] (vfs_kern_mount+0x54/0x128)
[<c025cba0>] (vfs_kern_mount) from [<c02600f0>] (do_mount+0x158/0xc7c)
[<c02600f0>] (do_mount) from [<c0260f98>] (SyS_mount+0x8c/0xb4)
[<c0260f98>] (SyS_mount) from [<c0107840>] (ret_fast_syscall+0x0/0x3c)
Fix the problem by always creating new bdi for a superblock as we used
to do.
Reported-and-tested-by: Corentin Labbe <clabbe.montjoie@gmail.com>
Fixes: 0d3b12584972ce5781179ad3f15cca3cdb5cae05
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
The file open flags (O_foo) are platform specific and should never go
out to an interface that is not local to the system.
Unfortunately these flags have leaked out onto the wire in the cephfs
implementation. That lead to bogus flags getting transmitted on ppc64.
This patch converts the kernel view of flags to the ceph view of file
open flags.
Fixes: 124e68e74 ("ceph: file operations")
Signed-off-by: Alexander Graf <agraf@suse.de>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
The dirfragtree is lazily updated, it's not always accurate. Infinite
loops happens in following circumstance.
- client send request to read frag A
- frag A has been fragmented into frag B and C. So mds fills the reply
with contents of frag B
- client wants to read next frag C. ceph_choose_frag(frag value of C)
return frag A.
The fix is using previous readdir reply to calculate next readdir frag
when possible.
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Currently, we don't have a real feedback mechanism in place for when we
start seeing buffered writeback errors. If writeback is failing, there
is nothing that prevents an application from continuing to dirty pages
that aren't being cleaned.
In the event that we're seeing write errors of any sort occur on an
inode, have the callback set a flag to force further writes to be
synchronous. When the next write succeeds, clear the flag to allow
buffered writeback to continue.
Since this is just a hint to the write submission mechanism, we only
take the i_ceph_lock when a lockless check shows that the flag needs to
be changed.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: "Yan, Zheng” <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
This reverts commit b109eec6f4.
If I'm filling up a filesystem with this sort of command:
$ dd if=/dev/urandom of=/mnt/cephfs/fillfile bs=2M oflag=sync
...then I'll eventually get back EIO on a write. Further calls
will give us ENOSPC.
I'm not sure what prompted this change, but I don't think it's what we
want to do. If writepages failed, we will have already set the mapping
error appropriately, and that's what gets reported by fsync() or
close().
__filemap_fdatawait_range however, does this:
wait_on_page_writeback(page);
if (TestClearPageError(page))
ret = -EIO;
...and that -EIO ends up trumping the mapping's error if one exists.
When writepages fails, we only want to set the error in the mapping,
and not flag the individual pages.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: "Yan, Zheng” <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Have the client store and update the osdc epoch_barrier when a cap
message comes in with one.
When sending cap messages, send the epoch barrier as well. This allows
clients to inform servers that their released caps may not be used until
a particular OSD map epoch.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: "Yan, Zheng” <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Usually, when the osd map is flagged as full or the pool is at quota,
write requests just hang. This is not what we want for cephfs, where
it would be better to simply report -ENOSPC back to userland instead
of stalling.
If the caller knows that it will want an immediate error return instead
of blocking on a full or at-quota error condition then allow it to set a
flag to request that behavior.
Set that flag in ceph_osdc_new_request (since ceph.ko is the only caller),
and on any other write request from ceph.ko.
A later patch will deal with requests that were submitted before the new
map showing the full condition came in.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Current cephfs client uses string to indicate start position of
readdir. The string is last entry of previous readdir reply.
This approach does not work for seeky readdir because we can
not easily convert the new postion to a string. For seeky readdir,
mds needs to return dentries from the beginning. Client keeps
retrying if the reply does not contain the dentry it wants.
In current version of ceph, mds sorts CDentry in its cache in
hash order. Client also uses dentry hash to compose dir postion.
For seeky readdir, if client passes the hash part of dir postion
to mds. mds can avoid replying useless dentries.
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
If a mds has stopped, close its session and clean up its session
requests/caps. The process is similar to handling SESSION_CLOSE
initiated by mds.
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
__unregister_session() free the session if it drops the last
reference. We should grab an extra reference if we want to use
session after __unregister_session().
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
mdsmap::m_max_mds is the expected count of active mds. It's not the
max rank of active mds. User can decrease mdsmap::m_max_mds, but does
not stop mds whose rank >= mdsmap::m_max_mds.
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.
Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.
Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.
Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
No reason to hide CephFS-specific features in the rbd case. Recent
feature bits mix RADOS and CephFS-specific stuff together anyway.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Macs send the maximum buffer size in response on ioctl to validate
negotiate security information, which causes us to fail the mount
as the response buffer is larger than the expected response.
Changed ioctl response processing to allow for padding of validate
negotiate ioctl response and limit the maximum response size to
maximum buffer size.
Signed-off-by: Steve French <steve.french@primarydata.com>
CC: Stable <stable@vger.kernel.org>
-write_checkpoint
-do_checkpoint
-next_free_nid <--- something wrong with next free nid
-f2fs_fill_super
-build_node_manager
-build_free_nids
-get_current_nat_page
-__get_meta_page <--- attempt to access beyond end of device
Signed-off-by: Yunlei He <heyunlei@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Merge misc updates from Andrew Morton:
- a few misc things
- most of MM
- KASAN updates
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (102 commits)
kasan: separate report parts by empty lines
kasan: improve double-free report format
kasan: print page description after stacks
kasan: improve slab object description
kasan: change report header
kasan: simplify address description logic
kasan: change allocation and freeing stack traces headers
kasan: unify report headers
kasan: introduce helper functions for determining bug type
mm: hwpoison: call shake_page() after try_to_unmap() for mlocked page
mm: hwpoison: call shake_page() unconditionally
mm/swapfile.c: fix swap space leak in error path of swap_free_entries()
mm/gup.c: fix access_ok() argument type
mm/truncate: avoid pointless cleancache_invalidate_inode() calls.
mm/truncate: bail out early from invalidate_inode_pages2_range() if mapping is empty
fs/block_dev: always invalidate cleancache in invalidate_bdev()
fs: fix data invalidation in the cleancache during direct IO
zram: reduce load operation in page_same_filled
zram: use zram_free_page instead of open-coded
zram: introduce zram data accessor
...
An open directory may have a NULL private_data pointer prior to readdir.
Fixes: 0de1f4c6f6 ("Add way to query server fs info for smb3")
Cc: stable@vger.kernel.org
Signed-off-by: David Disseldorp <ddiss@suse.de>
Signed-off-by: Steve French <smfrench@gmail.com>
- trailing space maps to 0xF028
- trailing period maps to 0xF029
This fix corrects the mapping of file names which have a trailing character
that would otherwise be illegal (period or space) but is allowed by POSIX.
Signed-off-by: Bjoern Jacke <bjacke@samba.org>
CC: Stable <stable@vger.kernel.org>
Signed-off-by: Steve French <smfrench@gmail.com>
invalidate_bdev() calls cleancache_invalidate_inode() iff ->nrpages != 0
which doen't make any sense.
Make sure that invalidate_bdev() always calls cleancache_invalidate_inode()
regardless of mapping->nrpages value.
Fixes: c515e1fd36 ("mm/fs: add hooks to support cleancache")
Link: http://lkml.kernel.org/r/20170424164135.22350-3-aryabinin@virtuozzo.com
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Alexey Kuznetsov <kuznet@virtuozzo.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Nikolay Borisov <n.borisov.lkml@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "Properly invalidate data in the cleancache", v2.
We've noticed that after direct IO write, buffered read sometimes gets
stale data which is coming from the cleancache. The reason for this is
that some direct write hooks call call invalidate_inode_pages2[_range]()
conditionally iff mapping->nrpages is not zero, so we may not invalidate
data in the cleancache.
Another odd thing is that we check only for ->nrpages and don't check
for ->nrexceptional, but invalidate_inode_pages2[_range] also
invalidates exceptional entries as well. So we invalidate exceptional
entries only if ->nrpages != 0? This doesn't feel right.
- Patch 1 fixes direct IO writes by removing ->nrpages check.
- Patch 2 fixes similar case in invalidate_bdev().
Note: I only fixed conditional cleancache_invalidate_inode() here.
Do we also need to add ->nrexceptional check in into invalidate_bdev()?
- Patches 3-4: some optimizations.
This patch (of 4):
Some direct IO write fs hooks call invalidate_inode_pages2[_range]()
conditionally iff mapping->nrpages is not zero. This can't be right,
because invalidate_inode_pages2[_range]() also invalidate data in the
cleancache via cleancache_invalidate_inode() call. So if page cache is
empty but there is some data in the cleancache, buffered read after
direct IO write would get stale data from the cleancache.
Also it doesn't feel right to check only for ->nrpages because
invalidate_inode_pages2[_range] invalidates exceptional entries as well.
Fix this by calling invalidate_inode_pages2[_range]() regardless of
nrpages state.
Note: nfs,cifs,9p doesn't need similar fix because the never call
cleancache_get_page() (nor directly, nor via mpage_readpage[s]()), so
they are not affected by this bug.
Fixes: c515e1fd36 ("mm/fs: add hooks to support cleancache")
Link: http://lkml.kernel.org/r/20170424164135.22350-2-aryabinin@virtuozzo.com
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Alexey Kuznetsov <kuznet@virtuozzo.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Nikolay Borisov <n.borisov.lkml@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
kjournald2 is central to the transaction commit processing. As such any
potential allocation from this kernel thread has to be GFP_NOFS. Make
sure to mark the whole kernel thread GFP_NOFS by the memalloc_nofs_save.
[akpm@linux-foundation.org: coding-style fixes]
Link: http://lkml.kernel.org/r/20170306131408.9828-8-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Suggested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Chris Mason <clm@fb.com>
Cc: David Sterba <dsterba@suse.cz>
Cc: Brian Foster <bfoster@redhat.com>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Nikolay Borisov <nborisov@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
now that we have memalloc_nofs_{save,restore} api we can mark the whole
transaction context as implicitly GFP_NOFS. All allocations will
automatically inherit GFP_NOFS this way. This means that we do not have
to mark any of those requests with GFP_NOFS and moreover all the
ext4_kv[mz]alloc(GFP_NOFS) are also safe now because even the hardcoded
GFP_KERNEL allocations deep inside the vmalloc will be NOFS now.
[akpm@linux-foundation.org: tweak comments]
Link: http://lkml.kernel.org/r/20170306131408.9828-7-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Chris Mason <clm@fb.com>
Cc: David Sterba <dsterba@suse.cz>
Cc: Brian Foster <bfoster@redhat.com>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Nikolay Borisov <nborisov@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
kmem_zalloc_large and _xfs_buf_map_pages use memalloc_noio_{save,restore}
API to prevent from reclaim recursion into the fs because vmalloc can
invoke unconditional GFP_KERNEL allocations and these functions might be
called from the NOFS contexts. The memalloc_noio_save will enforce
GFP_NOIO context which is even weaker than GFP_NOFS and that seems to be
unnecessary. Let's use memalloc_nofs_{save,restore} instead as it
should provide exactly what we need here - implicit GFP_NOFS context.
Link: http://lkml.kernel.org/r/20170306131408.9828-6-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Chris Mason <clm@fb.com>
Cc: David Sterba <dsterba@suse.cz>
Cc: Jan Kara <jack@suse.cz>
Cc: Nikolay Borisov <nborisov@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
GFP_NOFS context is used for the following 5 reasons currently:
- to prevent from deadlocks when the lock held by the allocation
context would be needed during the memory reclaim
- to prevent from stack overflows during the reclaim because the
allocation is performed from a deep context already
- to prevent lockups when the allocation context depends on other
reclaimers to make a forward progress indirectly
- just in case because this would be safe from the fs POV
- silence lockdep false positives
Unfortunately overuse of this allocation context brings some problems to
the MM. Memory reclaim is much weaker (especially during heavy FS
metadata workloads), OOM killer cannot be invoked because the MM layer
doesn't have enough information about how much memory is freeable by the
FS layer.
In many cases it is far from clear why the weaker context is even used
and so it might be used unnecessarily. We would like to get rid of
those as much as possible. One way to do that is to use the flag in
scopes rather than isolated cases. Such a scope is declared when really
necessary, tracked per task and all the allocation requests from within
the context will simply inherit the GFP_NOFS semantic.
Not only this is easier to understand and maintain because there are
much less problematic contexts than specific allocation requests, this
also helps code paths where FS layer interacts with other layers (e.g.
crypto, security modules, MM etc...) and there is no easy way to convey
the allocation context between the layers.
Introduce memalloc_nofs_{save,restore} API to control the scope of
GFP_NOFS allocation context. This is basically copying
memalloc_noio_{save,restore} API we have for other restricted allocation
context GFP_NOIO. The PF_MEMALLOC_NOFS flag already exists and it is
just an alias for PF_FSTRANS which has been xfs specific until recently.
There are no more PF_FSTRANS users anymore so let's just drop it.
PF_MEMALLOC_NOFS is now checked in the MM layer and drops __GFP_FS
implicitly same as PF_MEMALLOC_NOIO drops __GFP_IO. memalloc_noio_flags
is renamed to current_gfp_context because it now cares about both
PF_MEMALLOC_NOFS and PF_MEMALLOC_NOIO contexts. Xfs code paths preserve
their semantic. kmem_flags_convert() doesn't need to evaluate the flag
anymore.
This patch shouldn't introduce any functional changes.
Let's hope that filesystems will drop direct GFP_NOFS (resp. ~__GFP_FS)
usage as much as possible and only use a properly documented
memalloc_nofs_{save,restore} checkpoints where they are appropriate.
[akpm@linux-foundation.org: fix comment typo, reflow comment]
Link: http://lkml.kernel.org/r/20170306131408.9828-5-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Chris Mason <clm@fb.com>
Cc: David Sterba <dsterba@suse.cz>
Cc: Jan Kara <jack@suse.cz>
Cc: Brian Foster <bfoster@redhat.com>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Nikolay Borisov <nborisov@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
xfs has defined PF_FSTRANS to declare a scope GFP_NOFS semantic quite
some time ago. We would like to make this concept more generic and use
it for other filesystems as well. Let's start by giving the flag a more
generic name PF_MEMALLOC_NOFS which is in line with an exiting
PF_MEMALLOC_NOIO already used for the same purpose for GFP_NOIO
contexts. Replace all PF_FSTRANS usage from the xfs code in the first
step before we introduce a full API for it as xfs uses the flag directly
anyway.
This patch doesn't introduce any functional change.
Link: http://lkml.kernel.org/r/20170306131408.9828-4-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Chris Mason <clm@fb.com>
Cc: David Sterba <dsterba@suse.cz>
Cc: Jan Kara <jack@suse.cz>
Cc: Nikolay Borisov <nborisov@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Show MADV_FREE pages info of each vma in smaps. The interface is for
diganose or monitoring purpose, userspace could use it to understand
what happens in the application. Since userspace could dirty MADV_FREE
pages without notice from kernel, this interface is the only place we
can get accurate accounting info about MADV_FREE pages.
[mhocko@kernel.org: update Documentation/filesystems/proc.txt]
Link: http://lkml.kernel.org/r/89efde633559de1ec07444f2ef0f4963a97a2ce8.1487965799.git.shli@fb.com
Signed-off-by: Shaohua Li <shli@fb.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Minchan Kim <minchan@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Configfs is the interface for ocfs2-tools to set configure to kernel and
$configfs_dir/cluster/$clustername/heartbeat/dead_threshold is the one
used to configure heartbeat dead threshold. Kernel has a default value
of it but user can set O2CB_HEARTBEAT_THRESHOLD in /etc/sysconfig/o2cb
to override it.
Commit 45b997737a ("ocfs2/cluster: use per-attribute show and store
methods") changed heartbeat dead threshold name while ocfs2-tools did
not, so ocfs2-tools won't set this configurable and the default value is
always used. So revert it.
Fixes: 45b997737a ("ocfs2/cluster: use per-attribute show and store methods")
Link: http://lkml.kernel.org/r/1490665245-15374-1-git-send-email-junxiao.bi@oracle.com
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Acked-by: Joseph Qi <jiangqi903@gmail.com>
Cc: Mark Fasheh <mfasheh@versity.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch expands cover region of inode->i_rwsem to keep setting flag
atomically.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Commit b685d3d65a "block: treat REQ_FUA and REQ_PREFLUSH as
synchronous" removed REQ_SYNC flag from WRITE_{FUA|PREFLUSH|...}
definitions. generic_make_request_checks() however strips REQ_FUA and
REQ_PREFLUSH flags from a bio when the storage doesn't report volatile
write cache and thus write effectively becomes asynchronous which can
lead to performance regressions.
Fix the problem by making sure all bios which are synchronous are
properly marked with REQ_SYNC.
Fixes: b685d3d65a
Cc: stable@vger.kernel.org # 4.9+
CC: Jaegeuk Kim <jaegeuk@kernel.org>
CC: linux-f2fs-devel@lists.sourceforge.net
Signed-off-by: Jan Kara <jack@suse.cz>
Acked-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In xfs_reflink_end_cow, we erroneously reserve only enough blocks to
handle adding 1 extent. This is problematic if we fragment free space,
have to do CoW, and then have to perform multiple bmap btree expansions.
Furthermore, the BUI recovery routine doesn't reserve /any/ blocks to
handle btree splits, so log recovery fails after our first error causes
the filesystem to go down.
Therefore, refactor the transaction block reservation macros until we
have a macro that works for our deferred (re)mapping activities, and fix
both problems by using that macro.
With 1k blocks we can hit this fairly often in g/187 if the scratch fs
is big enough.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Pull quota, reiserfs, udf and ext2 updates from Jan Kara:
"The branch contains changes to quota code so that it does not modify
persistent flags in inode->i_flags (it was the only place in kernel
doing that) and handle it inside filesystem's quotaon/off handlers
instead.
The branch also contains two UDF cleanups, a couple of reiserfs fixes
and one fix for ext2 quota locking"
* 'generic' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
ext4: Improve comments in ext4_quota_{on|off}()
udf: use kmap_atomic for memcpy copying
udf: use octal for permissions
quota: Remove dquot_quotactl_ops
reiserfs: Remove i_attrs_to_sd_attrs()
reiserfs: Remove useless setting of i_flags
jfs: Remove jfs_get_inode_flags()
ext2: Remove ext2_get_inode_flags()
ext4: Remove ext4_get_inode_flags()
quota: Stop setting IMMUTABLE and NOATIME flags on quota files
jfs: Set flags on quota files directly
ext2: Set flags on quota files directly
reiserfs: Set flags on quota files directly
ext4: Set flags on quota files directly
reiserfs: Protect dquot_writeback_dquots() by s_umount semaphore
reiserfs: Make cancel_old_flush() reliable
ext2: Call dquot_writeback_dquots() with s_umount held
reiserfs: avoid a -Wmaybe-uninitialized warning
Pull fsnotify updates from Jan Kara:
"The branch contains mainly a rework of fsnotify infrastructure fixing
a shortcoming that we have waited for response to fanotify permission
events with SRCU read lock held and when the process consuming events
was slow to respond the kernel has stalled.
It also contains several cleanups of unnecessary indirections in
fsnotify framework and a bugfix from Amir fixing leakage of kernel
internal errno to userspace"
* 'fsnotify' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: (37 commits)
fanotify: don't expose EOPENSTALE to userspace
fsnotify: remove a stray unlock
fsnotify: Move ->free_mark callback to fsnotify_ops
fsnotify: Add group pointer in fsnotify_init_mark()
fsnotify: Drop inode_mark.c
fsnotify: Remove fsnotify_find_{inode|vfsmount}_mark()
fsnotify: Remove fsnotify_detach_group_marks()
fsnotify: Rename fsnotify_clear_marks_by_group_flags()
fsnotify: Inline fsnotify_clear_{inode|vfsmount}_mark_group()
fsnotify: Remove fsnotify_recalc_{inode|vfsmount}_mask()
fsnotify: Remove fsnotify_set_mark_{,ignored_}mask_locked()
fanotify: Release SRCU lock when waiting for userspace response
fsnotify: Pass fsnotify_iter_info into handle_event handler
fsnotify: Provide framework for dropping SRCU lock in ->handle_event
fsnotify: Remove special handling of mark destruction on group shutdown
fsnotify: Detach mark from object list when last reference is dropped
fsnotify: Move queueing of mark for destruction into fsnotify_put_mark()
inotify: Do not drop mark reference under idr_lock
fsnotify: Free fsnotify_mark_connector when there is no mark attached
fsnotify: Lock object list with connector lock
...
This patch flushes dirty nats in order to acquire available nids by writing
checkpoint. Otherwise, we can have no chance to get freed nids.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Introduce CP_TRIMMED_FLAG to indicate all invalid block were trimmed
before umount, so once we do mount with image which contain the flag,
we don't record invalid blocks as undiscard one, when fstrim is being
triggered, we can avoid issuing redundant discard commands.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Change to use different bits of cpc->reason to indicate different status,
so cpc->reason can indicate more than one reason.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
We don't need to rewrite the page under cp_rwsem and dnode locks.
Signed-off-by: Hou Pengyang <houpengyang@huawei.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Pull security subsystem updates from James Morris:
"Highlights:
IMA:
- provide ">" and "<" operators for fowner/uid/euid rules
KEYS:
- add a system blacklist keyring
- add KEYCTL_RESTRICT_KEYRING, exposes keyring link restriction
functionality to userland via keyctl()
LSM:
- harden LSM API with __ro_after_init
- add prlmit security hook, implement for SELinux
- revive security_task_alloc hook
TPM:
- implement contextual TPM command 'spaces'"
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (98 commits)
tpm: Fix reference count to main device
tpm_tis: convert to using locality callbacks
tpm: fix handling of the TPM 2.0 event logs
tpm_crb: remove a cruft constant
keys: select CONFIG_CRYPTO when selecting DH / KDF
apparmor: Make path_max parameter readonly
apparmor: fix parameters so that the permission test is bypassed at boot
apparmor: fix invalid reference to index variable of iterator line 836
apparmor: use SHASH_DESC_ON_STACK
security/apparmor/lsm.c: set debug messages
apparmor: fix boolreturn.cocci warnings
Smack: Use GFP_KERNEL for smk_netlbl_mls().
smack: fix double free in smack_parse_opts_str()
KEYS: add SP800-56A KDF support for DH
KEYS: Keyring asymmetric key restrict method with chaining
KEYS: Restrict asymmetric key linkage using a specific keychain
KEYS: Add a lookup_restriction function for the asymmetric key type
KEYS: Add KEYCTL_RESTRICT_KEYRING
KEYS: Consistent ordering for __key_link_begin and restrict check
KEYS: Add an optional lookup_restriction hook to key_type
...
By default we set DCACHE_REFERENCED and I_REFERENCED on any dentry or
inode we create. This is problematic as this means that it takes two
trips through the LRU for any of these objects to be reclaimed,
regardless of their actual lifetime. With enough pressure from these
caches we can easily evict our working set from page cache with single
use objects. So instead only set *REFERENCED if we've already been
added to the LRU list. This means that we've been touched since the
first time we were accessed, and so more likely to need to hang out in
cache.
To illustrate this issue I wrote the following scripts
https://github.com/josefbacik/debug-scripts/tree/master/cache-pressure
on my test box. It is a single socket 4 core CPU with 16gib of RAM and
I tested on an Intel 2tib NVME drive. The cache-pressure.sh script
creates a new file system and creates 2 6.5gib files in order to take up
13gib of the 16gib of ram with pagecache. Then it runs a test program
that reads these 2 files in a loop, and keeps track of how often it has
to read bytes for each loop. On an ideal system with no pressure we
should have to read 0 bytes indefinitely. The second thing this script
does is start a fs_mark job that creates a ton of 0 length files,
putting pressure on the system with slab only allocations. On exit the
script prints out how many bytes were read by the read-file program.
The results are as follows
Without patch:
/mnt/btrfs-test/reads/file1: total read during loops 27262988288
/mnt/btrfs-test/reads/file2: total read during loops 27262976000
With patch:
/mnt/btrfs-test/reads/file2: total read during loops 18640457728
/mnt/btrfs-test/reads/file1: total read during loops 9565376512
This patch results in a 50% reduction of the amount of pages evicted
from our working set.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
When the final cifsFileInfo_put() is called from cifsiod and an oplock
break work is queued, lockdep complains loudly:
=============================================
[ INFO: possible recursive locking detected ]
4.11.0+ #21 Not tainted
---------------------------------------------
kworker/0:2/78 is trying to acquire lock:
("cifsiod"){++++.+}, at: flush_work+0x215/0x350
but task is already holding lock:
("cifsiod"){++++.+}, at: process_one_work+0x255/0x8e0
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0
----
lock("cifsiod");
lock("cifsiod");
*** DEADLOCK ***
May be due to missing lock nesting notation
2 locks held by kworker/0:2/78:
#0: ("cifsiod"){++++.+}, at: process_one_work+0x255/0x8e0
#1: ((&wdata->work)){+.+...}, at: process_one_work+0x255/0x8e0
stack backtrace:
CPU: 0 PID: 78 Comm: kworker/0:2 Not tainted 4.11.0+ #21
Workqueue: cifsiod cifs_writev_complete
Call Trace:
dump_stack+0x85/0xc2
__lock_acquire+0x17dd/0x2260
? match_held_lock+0x20/0x2b0
? trace_hardirqs_off_caller+0x86/0x130
? mark_lock+0xa6/0x920
lock_acquire+0xcc/0x260
? lock_acquire+0xcc/0x260
? flush_work+0x215/0x350
flush_work+0x236/0x350
? flush_work+0x215/0x350
? destroy_worker+0x170/0x170
__cancel_work_timer+0x17d/0x210
? ___preempt_schedule+0x16/0x18
cancel_work_sync+0x10/0x20
cifsFileInfo_put+0x338/0x7f0
cifs_writedata_release+0x2a/0x40
? cifs_writedata_release+0x2a/0x40
cifs_writev_complete+0x29d/0x850
? preempt_count_sub+0x18/0xd0
process_one_work+0x304/0x8e0
worker_thread+0x9b/0x6a0
kthread+0x1b2/0x200
? process_one_work+0x8e0/0x8e0
? kthread_create_on_node+0x40/0x40
ret_from_fork+0x31/0x40
This is a real warning. Since the oplock is queued on the same
workqueue this can deadlock if there is only one worker thread active
for the workqueue (which will be the case during memory pressure when
the rescuer thread is handling it).
Furthermore, there is at least one other kind of hang possible due to
the oplock break handling if there is only worker. (This can be
reproduced without introducing memory pressure by having passing 1 for
the max_active parameter of cifsiod.) cifs_oplock_break() can wait
indefintely in the filemap_fdatawait() while the cifs_writev_complete()
work is blocked:
sysrq: SysRq : Show Blocked State
task PC stack pid father
kworker/0:1 D 0 16 2 0x00000000
Workqueue: cifsiod cifs_oplock_break
Call Trace:
__schedule+0x562/0xf40
? mark_held_locks+0x4a/0xb0
schedule+0x57/0xe0
io_schedule+0x21/0x50
wait_on_page_bit+0x143/0x190
? add_to_page_cache_lru+0x150/0x150
__filemap_fdatawait_range+0x134/0x190
? do_writepages+0x51/0x70
filemap_fdatawait_range+0x14/0x30
filemap_fdatawait+0x3b/0x40
cifs_oplock_break+0x651/0x710
? preempt_count_sub+0x18/0xd0
process_one_work+0x304/0x8e0
worker_thread+0x9b/0x6a0
kthread+0x1b2/0x200
? process_one_work+0x8e0/0x8e0
? kthread_create_on_node+0x40/0x40
ret_from_fork+0x31/0x40
dd D 0 683 171 0x00000000
Call Trace:
__schedule+0x562/0xf40
? mark_held_locks+0x29/0xb0
schedule+0x57/0xe0
io_schedule+0x21/0x50
wait_on_page_bit+0x143/0x190
? add_to_page_cache_lru+0x150/0x150
__filemap_fdatawait_range+0x134/0x190
? do_writepages+0x51/0x70
filemap_fdatawait_range+0x14/0x30
filemap_fdatawait+0x3b/0x40
filemap_write_and_wait+0x4e/0x70
cifs_flush+0x6a/0xb0
filp_close+0x52/0xa0
__close_fd+0xdc/0x150
SyS_close+0x33/0x60
entry_SYSCALL_64_fastpath+0x1f/0xbe
Showing all locks held in the system:
2 locks held by kworker/0:1/16:
#0: ("cifsiod"){.+.+.+}, at: process_one_work+0x255/0x8e0
#1: ((&cfile->oplock_break)){+.+.+.}, at: process_one_work+0x255/0x8e0
Showing busy workqueues and worker pools:
workqueue cifsiod: flags=0xc
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/1
in-flight: 16:cifs_oplock_break
delayed: cifs_writev_complete, cifs_echo_request
pool 0: cpus=0 node=0 flags=0x0 nice=0 hung=0s workers=3 idle: 750 3
Fix these problems by creating a a new workqueue (with a rescuer) for
the oplock break work.
Signed-off-by: Rabin Vincent <rabinv@axis.com>
Signed-off-by: Steve French <smfrench@gmail.com>
CC: Stable <stable@vger.kernel.org>
As with 618763958b, an open directory may have a NULL private_data
pointer prior to readdir. CIFS_ENUMERATE_SNAPSHOTS must check for this
before dereference.
Fixes: 834170c859 ("Enable previous version support")
Signed-off-by: David Disseldorp <ddiss@suse.de>
CC: Stable <stable@vger.kernel.org>
Signed-off-by: Steve French <smfrench@gmail.com>
The server may respond with success, and an output buffer less than
sizeof(struct smb_snapshot_array) in length. Do not leak the output
buffer in this case.
Fixes: 834170c859 ("Enable previous version support")
Signed-off-by: David Disseldorp <ddiss@suse.de>
CC: Stable <stable@vger.kernel.org>
Signed-off-by: Steve French <smfrench@gmail.com>
In order to shrink size of struct discard_cmd, change variable type of
@state in struct discard_cmd from int to unsigned char.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Previously, with protection of cmd_lock, we will wait for end io of
discard command which potentially may lead long latency, making worse
concurrency.
So, in this patch, we try to add reference into discard entry to prevent
the entry being released by other thread, then we can avoid holding
global cmd_lock during waiting discard to finish.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
F2FS uses 4 bytes to represent block address. As a result, supported
size of disk is 16 TB and it equals to 16 * 1024 * 1024 / 2 segments.
Signed-off-by: Jin Qian <jinqian@google.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If a page is cold, NOT atomit written and need_ipu now, there is
a high probability that IPU should be adapted. For IPU, we try to
check extent tree to get the block index first, instead of reading
the dnode page, where may lead to an useless dnode IO, since no need to
update the dnode index for IPU.
Signed-off-by: Hou Pengyang <houpengyang@huawei.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch introduces encrypt_one_page which encrypts one data page before
submit_bio, and change the use of need_inplace_update.
Signed-off-by: Hou Pengyang <houpengyang@huawei.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Pull livepatch updates from Jiri Kosina:
- a per-task consistency model is being added for architectures that
support reliable stack dumping (extending this, currently rather
trivial set, is currently in the works).
This extends the nature of the types of patches that can be applied
by live patching infrastructure. The code stems from the design
proposal made [1] back in November 2014. It's a hybrid of SUSE's
kGraft and RH's kpatch, combining advantages of both: it uses
kGraft's per-task consistency and syscall barrier switching combined
with kpatch's stack trace switching. There are also a number of
fallback options which make it quite flexible.
Most of the heavy lifting done by Josh Poimboeuf with help from
Miroslav Benes and Petr Mladek
[1] https://lkml.kernel.org/r/20141107140458.GA21774@suse.cz
- module load time patch optimization from Zhou Chengming
- a few assorted small fixes
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/livepatching:
livepatch: add missing printk newlines
livepatch: Cancel transition a safe way for immediate patches
livepatch: Reduce the time of finding module symbols
livepatch: make klp_mutex proper part of API
livepatch: allow removal of a disabled patch
livepatch: add /proc/<pid>/patch_state
livepatch: change to a per-task consistency model
livepatch: store function sizes
livepatch: use kstrtobool() in enabled_store()
livepatch: move patching functions into patch.c
livepatch: remove unnecessary object loaded check
livepatch: separate enabled and patched states
livepatch/s390: add TIF_PATCH_PENDING thread flag
livepatch/s390: reorganize TIF thread flag bits
livepatch/powerpc: add TIF_PATCH_PENDING thread flag
livepatch/x86: add TIF_PATCH_PENDING thread flag
livepatch: create temporary klp_update_patch_state() stub
x86/entry: define _TIF_ALLWORK_MASK flags explicitly
stacktrace/x86: add function for detecting reliable stack traces
Pull networking updates from David Millar:
"Here are some highlights from the 2065 networking commits that
happened this development cycle:
1) XDP support for IXGBE (John Fastabend) and thunderx (Sunil Kowuri)
2) Add a generic XDP driver, so that anyone can test XDP even if they
lack a networking device whose driver has explicit XDP support
(me).
3) Sparc64 now has an eBPF JIT too (me)
4) Add a BPF program testing framework via BPF_PROG_TEST_RUN (Alexei
Starovoitov)
5) Make netfitler network namespace teardown less expensive (Florian
Westphal)
6) Add symmetric hashing support to nft_hash (Laura Garcia Liebana)
7) Implement NAPI and GRO in netvsc driver (Stephen Hemminger)
8) Support TC flower offload statistics in mlxsw (Arkadi Sharshevsky)
9) Multiqueue support in stmmac driver (Joao Pinto)
10) Remove TCP timewait recycling, it never really could possibly work
well in the real world and timestamp randomization really zaps any
hint of usability this feature had (Soheil Hassas Yeganeh)
11) Support level3 vs level4 ECMP route hashing in ipv4 (Nikolay
Aleksandrov)
12) Add socket busy poll support to epoll (Sridhar Samudrala)
13) Netlink extended ACK support (Johannes Berg, Pablo Neira Ayuso,
and several others)
14) IPSEC hw offload infrastructure (Steffen Klassert)"
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (2065 commits)
tipc: refactor function tipc_sk_recv_stream()
tipc: refactor function tipc_sk_recvmsg()
net: thunderx: Optimize page recycling for XDP
net: thunderx: Support for XDP header adjustment
net: thunderx: Add support for XDP_TX
net: thunderx: Add support for XDP_DROP
net: thunderx: Add basic XDP support
net: thunderx: Cleanup receive buffer allocation
net: thunderx: Optimize CQE_TX handling
net: thunderx: Optimize RBDR descriptor handling
net: thunderx: Support for page recycling
ipx: call ipxitf_put() in ioctl error path
net: sched: add helpers to handle extended actions
qed*: Fix issues in the ptp filter config implementation.
qede: Fix concurrency issue in PTP Tx path processing.
stmmac: Add support for SIMATIC IOT2000 platform
net: hns: fix ethtool_get_strings overflow in hns driver
tcp: fix wraparound issue in tcp_lp
bpf, arm64: fix jit branch offset related to ldimm64
bpf, arm64: implement jiting of BPF_XADD
...
Mac requires the unicode flag to be set for cifs, even for the smb
echo request (which doesn't have strings).
Without this Mac rejects the periodic echo requests (when mounting
with cifs) that we use to check if server is down
Signed-off-by: Steve French <smfrench@gmail.com>
CC: Stable <stable@vger.kernel.org>
This patch adds support to process write calls passed by io_submit()
asynchronously. It based on the previously introduced async context
that allows to process i/o responses in a separate thread and
return the caller immediately for asynchronous calls.
This improves writing performance of single threaded applications
with increasing of i/o queue depth size.
Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Steve French <smfrench@gmail.com>
This patch adds support to process read calls passed by io_submit()
asynchronously. It based on the previously introduced async context
that allows to process i/o responses in a separate thread and
return the caller immediately for asynchronous calls.
This improves reading performance of single threaded applications
with increasing of i/o queue depth size.
Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Currently the code doesn't recognize asynchronous calls passed
by io_submit() and processes all calls synchronously. This is not
what kernel AIO expects. This patch introduces a new async context
that keeps track of all issued i/o requests and moves a response
collecting procedure to a separate thread. This allows to return
to a caller immediately for async calls and call iocb->ki_complete()
once all requests are completed. For sync calls the current thread
simply waits until all requests are completed.
Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Steve French <smfrench@gmail.com>
When the IP address is gotten from the UNC, use only the address part
of the UNC. Else all after the percent sign in an IPv6 link local
address is interpreted as a scope id. This includes the slash and
share name. A scope id is expected to be an integer and any trailing
characters makes the conversion to integer fail.
Example of mount command that fails:
mount -i -t cifs //fe80::6a05:caff:fe3e:8ffc%2/test /mnt/t -o sec=none
Signed-off-by: Daniel N Pettersson <danielnp@axis.com>
Signed-off-by: Steve French <smfrench@gmail.com>
January is month 1. There is no zero-th month. If someone passes a
zero month then it means we read from one space before the start of the
total_days_of_prev_months[] array.
We may as well also be strict about days as well.
Fixes: 1bd5bbcb65 ("[CIFS] Legacy time handling for Win9x and OS/2 part 1")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Pull fs/compat.c cleanups from Al Viro:
"More moving of compat syscalls from fs/compat.c to fs/*.c where the
native counterparts live.
And death to compat_sys_getdents64() - the only architecture that used
to need it was ia64, and _that_ has lost biarch support quite a few
years ago"
* 'work.compat' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
fs/compat.c: trim unused includes
move compat_rw_copy_check_uvector() over to fs/read_write.c
fhandle: move compat syscalls from compat.c
open: move compat syscalls from compat.c
stat: move compat syscalls from compat.c
fcntl: move compat syscalls from compat.c
readdir: move compat syscalls from compat.c
statfs: move compat syscalls from compat.c
utimes: move compat syscalls from compat.c
move compat select-related syscalls to fs/select.c
Remove compat_sys_getdents64()
Pull splice updates from Al Viro:
"These actually missed the last cycle; the branch itself is from last
December"
* 'work.splice' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
make nr_pages calculation in default_file_splice_read() a bit less ugly
splice/tee/vmsplice: validate flags
splice_pipe_desc: kill ->flags
remove spd_release_page()
Pull iov_iter updates from Al Viro:
"Cleanups that sat in -next + -stable fodder that has just missed 4.11.
There's more iov_iter work in my local tree, but I'd prefer to push
the stuff that had been in -next first"
* 'work.iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
iov_iter: don't revert iov buffer if csum error
generic_file_read_iter(): make use of iov_iter_revert()
generic_file_direct_write(): make use of iov_iter_revert()
orangefs: use iov_iter_revert()
sctp: switch to copy_from_iter_full()
net/9p: switch to copy_from_iter_full()
switch memcpy_from_msg() to copy_from_iter_full()
rds: make use of iov_iter_revert()
Pull CIFS fixes from Steve French:
"Three cifs/smb3 fixes - including two for stable"
* 'for-next' of git://git.samba.org/sfrench/cifs-2.6:
cifs: don't check for failure from mempool_alloc()
Do not return number of bytes written for ioctl CIFS_IOC_COPYCHUNK_FILE
Fix match_prepath()
- restore powerpc dumping; Ankit Kumar
- fix more bugs in the rarely exercises module unloading logic
- reorganize filesystem locking to fix problems noticed by lockdep
- refactor internal pstore APIs to make development and review easier:
- improve error reporting
- add kernel-doc structure and function comments
- avoid insane argument passing by using a common record structure
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
Comment: Kees Cook <kees@outflux.net>
iQIcBAABCgAGBQJZB3wzAAoJEIly9N/cbcAmVQAP/3+GzoxUcL43ypsDa1CmCsFN
l2roQjzWLNGfHgq5qkS/mNtrUdEvBMUBd2oyhHcaiqM0DsuuO3rKTp6dZ8oczYjN
6GoTmU8ZwIPze3VEadNPjCIdpsfTMNKtZvVJCTrWnsgXxTawDS89qqr7SCs3qhBS
Dm1E8oX77YyhOKoGA6O3CJxpdm/Ge4+KpPR6Uwj90eVro04vYoiwnBjyLUzE7w1l
JcXEGEh1t5NjUxHeMwW7HZwJYZfA3DQ6I3MOOzGhf9tsKp6J0LQTTV8PMSEo1mif
mLZhDBy8BBlunL+b2Tp3+c4QItGSHkBCWASI2RLa2TM7xvL67oC+qm/WaUyoRovy
hllEG96rsCs3Zx7fFFsfQCwURcTWfJQMrD+0d/fM+P2ylWvgp+KU6PeLTS9IHu6M
3n6i5i6A6OY/QvmZr1tN/06kUBjtQmo8EgQ0jxoxAlWyNcJqi93hmJyaRW28KxjS
tjFTNLZMrslj0UDmjiD6fIuaT6gsGDB+3wAMPVAf+iV/k/2GUlj3ZILe4RaABAe9
8xaUu11tZ5sTniayZ+10bA+6+K5n7uTlgU8RfFgaUZoRAzHgtyijOmdo6N+HILfK
klv59B1Fmf6JpDlq7L9vurOqE82FAWFn4DruFM2bAaky2meFUNbYFiNfwK4l6lPI
pmAgpdgRRvNMBCEmbVfv
=S14G
-----END PGP SIGNATURE-----
Merge tag 'pstore-v4.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull pstore updates from Kees Cook:
"This has a large internal refactoring along with several smaller
fixes.
- constify compression structures; Bhumika Goyal
- restore powerpc dumping; Ankit Kumar
- fix more bugs in the rarely exercises module unloading logic
- reorganize filesystem locking to fix problems noticed by lockdep
- refactor internal pstore APIs to make development and review
easier:
- improve error reporting
- add kernel-doc structure and function comments
- avoid insane argument passing by using a common record
structure"
* tag 'pstore-v4.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: (23 commits)
pstore: Solve lockdep warning by moving inode locks
pstore: Fix flags to enable dumps on powerpc
pstore: Remove unused vmalloc.h in pmsg
pstore: simplify write_user_compat()
pstore: Remove write_buf() callback
pstore: Replace arguments for write_buf_user() API
pstore: Replace arguments for write_buf() API
pstore: Replace arguments for erase() API
pstore: Do not duplicate record metadata
pstore: Allocate records on heap instead of stack
pstore: Pass record contents instead of copying
pstore: Always allocate buffer for decompression
pstore: Replace arguments for write() API
pstore: Replace arguments for read() API
pstore: Switch pstore_mkfile to pass record
pstore: Move record decompression to function
pstore: Extract common arguments into structure
pstore: Add kernel-doc for struct pstore_info
pstore: Improve register_pstore() error reporting
pstore: Avoid race in module unloading
...
Consider the following deadlock:
Process P1 Process P2 Process P3
========== ========== ==========
lock_page(page)
lseg = pnfs_update_layout(inode)
lo = NFS_I(inode)->layout
pnfs_error_mark_layout_for_return(lo)
lock_page(page)
lseg = pnfs_update_layout(inode)
In this scenario,
- P1 has declared the layout to be in error, but P2 holds a reference to
a layout segment on that inode, so the layoutreturn is deferred.
- P2 is waiting for a page lock held by P3.
- P3 is asking for a new layout segment, but is blocked waiting
for the layoutreturn.
The fix is to ensure that pnfs_error_mark_layout_for_return() does
not set the NFS_LAYOUT_RETURN flag, which blocks P3. Instead, we allow
the latter to call LAYOUTGET so that it can make progress and unblock
P2.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
In pnfs_clear_layoutreturn_info, ensure that we don't clear the layout
return info if there are new segments queued for return due to, for
instance, a race between a LAYOUTRETURN and a failed I/O attempt.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
When using both encryption and SELinux (or another feature that requires
an xattr per file) on a filesystem with 256-byte inodes, each file's
xattrs usually spill into an external xattr block. Currently, the
xattrs are inherited in the order ACL, security, then encryption.
Therefore, if spillage occurs, the encryption xattr will always end up
in the external block. This is not ideal because the encryption xattrs
contain a nonce, so they will always be unique and will prevent the
external xattr blocks from being deduplicated.
To improve the situation, change the inheritance order to encryption,
ACL, then security. This gives the encryption xattr a better chance to
be stored in-inode, allowing the other xattr(s) to be deduplicated.
Note that it may be better for userspace to format the filesystem with
512-byte inodes in this case. However, it's not the default.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Pul x86/process updates from Ingo Molnar:
"The main change in this cycle was to add the ARCH_[GET|SET]_CPUID
prctl() ABI extension to control the availability of the CPUID
instruction, analogously to the existing PR_GET|SET_TSC ABI that
controls RDTSC.
Motivation: the 'rr' user-space record-and-replay execution debugger
would like to trap and emulate the CPUID instruction - which
instruction is normally unprivileged.
Trapping CPUID is possible on IvyBridge and later Intel CPUs - expose
this hardware capability"
* 'x86-process-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/syscalls/32: Ignore arch_prctl for other architectures
um/arch_prctl: Fix fallout from x86 arch_prctl() rework
x86/arch_prctl: Add ARCH_[GET|SET]_CPUID
x86/cpufeature: Detect CPUID faulting support
x86/syscalls/32: Wire up arch_prctl on x86-32
x86/arch_prctl: Add do_arch_prctl_common()
x86/arch_prctl/64: Rename do_arch_prctl() to do_arch_prctl_64()
x86/arch_prctl/64: Use SYSCALL_DEFINE2 to define sys_arch_prctl()
x86/arch_prctl: Rename 'code' argument to 'option'
x86/msr: Rename MISC_FEATURE_ENABLES to MISC_FEATURES_ENABLES
x86/process: Optimize TIF_NOTSC switch
x86/process: Correct and optimize TIF_BLOCKSTEP switch
x86/process: Optimize TIF checks in __switch_to_xtra()
Pull uaccess unification updates from Al Viro:
"This is the uaccess unification pile. It's _not_ the end of uaccess
work, but the next batch of that will go into the next cycle. This one
mostly takes copy_from_user() and friends out of arch/* and gets the
zero-padding behaviour in sync for all architectures.
Dealing with the nocache/writethrough mess is for the next cycle;
fortunately, that's x86-only. Same for cleanups in iov_iter.c (I am
sold on access_ok() in there, BTW; just not in this pile), same for
reducing __copy_... callsites, strn*... stuff, etc. - there will be a
pile about as large as this one in the next merge window.
This one sat in -next for weeks. -3KLoC"
* 'work.uaccess' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (96 commits)
HAVE_ARCH_HARDENED_USERCOPY is unconditional now
CONFIG_ARCH_HAS_RAW_COPY_USER is unconditional now
m32r: switch to RAW_COPY_USER
hexagon: switch to RAW_COPY_USER
microblaze: switch to RAW_COPY_USER
get rid of padding, switch to RAW_COPY_USER
ia64: get rid of copy_in_user()
ia64: sanitize __access_ok()
ia64: get rid of 'segment' argument of __do_{get,put}_user()
ia64: get rid of 'segment' argument of __{get,put}_user_check()
ia64: add extable.h
powerpc: get rid of zeroing, switch to RAW_COPY_USER
esas2r: don't open-code memdup_user()
alpha: fix stack smashing in old_adjtimex(2)
don't open-code kernel_setsockopt()
mips: switch to RAW_COPY_USER
mips: get rid of tail-zeroing in primitives
mips: make copy_from_user() zero tail explicitly
mips: clean and reorder the forest of macros...
mips: consolidate __invoke_... wrappers
...
The new message has an incorrect format string, causing a warning in some
configurations:
fs/block_dev.c: In function 'bdev_dax_supported':
fs/block_dev.c:779:5: error: format '%d' expects argument of type 'int', but argument 2 has type 'long int' [-Werror=format=]
"error: dax access failed (%d)", len);
This changes it to use the correct %ld instead of %d.
Fixes: 2093f2e9df ("block, dax: convert bdev_dax_supported() to dax_direct_access()")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Pull block layer updates from Jens Axboe:
- Add BFQ IO scheduler under the new blk-mq scheduling framework. BFQ
was initially a fork of CFQ, but subsequently changed to implement
fairness based on B-WF2Q+, a modified variant of WF2Q. BFQ is meant
to be used on desktop type single drives, providing good fairness.
From Paolo.
- Add Kyber IO scheduler. This is a full multiqueue aware scheduler,
using a scalable token based algorithm that throttles IO based on
live completion IO stats, similary to blk-wbt. From Omar.
- A series from Jan, moving users to separately allocated backing
devices. This continues the work of separating backing device life
times, solving various problems with hot removal.
- A series of updates for lightnvm, mostly from Javier. Includes a
'pblk' target that exposes an open channel SSD as a physical block
device.
- A series of fixes and improvements for nbd from Josef.
- A series from Omar, removing queue sharing between devices on mostly
legacy drivers. This helps us clean up other bits, if we know that a
queue only has a single device backing. This has been overdue for
more than a decade.
- Fixes for the blk-stats, and improvements to unify the stats and user
windows. This both improves blk-wbt, and enables other users to
register a need to receive IO stats for a device. From Omar.
- blk-throttle improvements from Shaohua. This provides a scalable
framework for implementing scalable priotization - particularly for
blk-mq, but applicable to any type of block device. The interface is
marked experimental for now.
- Bucketized IO stats for IO polling from Stephen Bates. This improves
efficiency of polled workloads in the presence of mixed block size
IO.
- A few fixes for opal, from Scott.
- A few pulls for NVMe, including a lot of fixes for NVMe-over-fabrics.
From a variety of folks, mostly Sagi and James Smart.
- A series from Bart, improving our exposed info and capabilities from
the blk-mq debugfs support.
- A series from Christoph, cleaning up how handle WRITE_ZEROES.
- A series from Christoph, cleaning up the block layer handling of how
we track errors in a request. On top of being a nice cleanup, it also
shrinks the size of struct request a bit.
- Removal of mg_disk and hd (sorry Linus) by Christoph. The former was
never used by platforms, and the latter has outlived it's usefulness.
- Various little bug fixes and cleanups from a wide variety of folks.
* 'for-4.12/block' of git://git.kernel.dk/linux-block: (329 commits)
block: hide badblocks attribute by default
blk-mq: unify hctx delay_work and run_work
block: add kblock_mod_delayed_work_on()
blk-mq: unify hctx delayed_run_work and run_work
nbd: fix use after free on module unload
MAINTAINERS: bfq: Add Paolo as maintainer for the BFQ I/O scheduler
blk-mq-sched: alloate reserved tags out of normal pool
mtip32xx: use runtime tag to initialize command header
scsi: Implement blk_mq_ops.show_rq()
blk-mq: Add blk_mq_ops.show_rq()
blk-mq: Show operation, cmd_flags and rq_flags names
blk-mq: Make blk_flags_show() callers append a newline character
blk-mq: Move the "state" debugfs attribute one level down
blk-mq: Unregister debugfs attributes earlier
blk-mq: Only unregister hctxs for which registration succeeded
blk-mq-debugfs: Rename functions for registering and unregistering the mq directory
blk-mq: Let blk_mq_debugfs_register() look up the queue name
blk-mq: Register <dev>/queue/mq after having registered <dev>/queue
ide-pm: always pass 0 error to ide_complete_rq in ide_do_devset
ide-pm: always pass 0 error to __blk_end_request_all
..
Currently ext4_writepages() submits all pages with transaction started.
When no page needs block allocation or extent conversion we can submit
all dirty pages in the inode while holding a single transaction handle
and when device is congested this can take significant amount of time.
Thus ext4_writepages() can block transaction commits for extended
periods of time.
Take for example a simple benchmark simulating PostgreSQL database
(pgioperf in mmtest). The benchmark runs 16 processes doing random reads
from a huge file, one process doing random writes to the huge file, and
one process doing sequential writes to a small files and frequently
running fsync. With unpatched kernel transaction commits take on average
~18s with standard deviation of ~41s, top 5 commit times are:
274.466639s, 126.467347s, 86.992429s, 34.351563s, 31.517653s.
After this patch transaction commits take on average 0.1s with standard
deviation of 0.15s, top 5 commit times are:
0.563792s, 0.519980s, 0.509841s, 0.471700s, 0.469899s
[ Modified so we use an explicit do_map flag instead of relying on
io_end not being allocated, the since io_end->inode is needed for I/O
error handling. -- tytso ]
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
This commit exposes the necessary constants and structures for a
userspace program to pass filesystem encryption keys into the keyring.
The fscrypt_key structure was already part of the kernel ABI, this
change just makes it so programs no longer have to redeclare these
structures (like e4crypt in e2fsprogs currently does).
Note that we do not expose the other FS_*_KEY_SIZE constants as they are
not necessary. Only XTS is supported for contents_encryption_mode, so
currently FS_MAX_KEY_SIZE bytes of key material must always be passed to
the kernel.
This commit also removes __packed from fscrypt_key as it does not
contain any implicit padding and does not refer to an on-disk structure.
Signed-off-by: Joe Richey <joerichey@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
The functions in fs/crypto/*.c are only called by filesystems configured
with encryption support. Since the ->get_context(), ->set_context(),
and ->empty_dir() operations are always provided in that case (and must
be, otherwise there would be no way to get/set encryption policies, or
in the case of ->get_context() even access encrypted files at all),
there is no need to check for these operations being NULL and we can
remove these unneeded checks.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Richard Weinberger <richard@nod.at>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
With enabled meta_bg option block group descriptors
reading IO is not sequential and requires optimization.
Signed-off-by: Andrew Perepechko <andrew.perepechko@seagate.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Make the ext4_shutdown() function static, as suggested by running sparse
('make C=2 fs/ext4/'). This was the only such warning in fs/ext4/.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Support the GETFSMAP ioctls so that we can use the xfs free space
management tools to probe ext4 as well. Note that this is a partial
implementation -- we only report fixed-location metadata and free space;
everything else is reported as "unknown".
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Currently the case of writing via mmap to a file with inline data is not
handled. This is maybe a rare case since it requires a writable memory
map of a very small file, but it is trivial to trigger with on
inline_data filesystem, and it causes the
'BUG_ON(ext4_test_inode_state(inode, EXT4_STATE_MAY_INLINE_DATA));' in
ext4_writepages() to be hit:
mkfs.ext4 -O inline_data /dev/vdb
mount /dev/vdb /mnt
xfs_io -f /mnt/file \
-c 'pwrite 0 1' \
-c 'mmap -w 0 1m' \
-c 'mwrite 0 1' \
-c 'fsync'
kernel BUG at fs/ext4/inode.c:2723!
invalid opcode: 0000 [#1] SMP
CPU: 1 PID: 2532 Comm: xfs_io Not tainted 4.11.0-rc1-xfstests-00301-g071d9acf3d1f #633
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-20170228_101828-anatol 04/01/2014
task: ffff88003d3a8040 task.stack: ffffc90000300000
RIP: 0010:ext4_writepages+0xc89/0xf8a
RSP: 0018:ffffc90000303ca0 EFLAGS: 00010283
RAX: 0000028410000000 RBX: ffff8800383fa3b0 RCX: ffffffff812afcdc
RDX: 00000a9d00000246 RSI: ffffffff81e660e0 RDI: 0000000000000246
RBP: ffffc90000303dc0 R08: 0000000000000002 R09: 869618e8f99b4fa5
R10: 00000000852287a2 R11: 00000000a03b49f4 R12: ffff88003808e698
R13: 0000000000000000 R14: 7fffffffffffffff R15: 7fffffffffffffff
FS: 00007fd3e53094c0(0000) GS:ffff88003e400000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fd3e4c51000 CR3: 000000003d554000 CR4: 00000000003406e0
Call Trace:
? _raw_spin_unlock+0x27/0x2a
? kvm_clock_read+0x1e/0x20
do_writepages+0x23/0x2c
? do_writepages+0x23/0x2c
__filemap_fdatawrite_range+0x80/0x87
filemap_write_and_wait_range+0x67/0x8c
ext4_sync_file+0x20e/0x472
vfs_fsync_range+0x8e/0x9f
? syscall_trace_enter+0x25b/0x2d0
vfs_fsync+0x1c/0x1e
do_fsync+0x31/0x4a
SyS_fsync+0x10/0x14
do_syscall_64+0x69/0x131
entry_SYSCALL64_slow_path+0x25/0x25
We could try to be smart and keep the inline data in this case, or at
least support delayed allocation when allocating the block, but these
solutions would be more complicated and don't seem worthwhile given how
rare this case seems to be. So just fix the bug by calling
ext4_convert_inline_data() when we're asked to make a page writable, so
that any inline data gets evicted, with the block allocated immediately.
Reported-by: Nick Alcock <nick.alcock@oracle.com>
Cc: stable@vger.kernel.org
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
ext4_xattr_check_entry() was redundant with validation of the full xattr
entries list in ext4_xattr_check_entries(), which all callers also did.
ext4_xattr_check_entry() also didn't actually do correct validation;
specifically, it never checked that the value doesn't overlap the xattr
names, nor did it account for padding when checking whether the xattr
value overflows the available space. So remove it to eliminate any
potential confusion.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
ext4_xattr_check_names() actually validates both the xattr names and
values, not just the names. So rename it to ext4_xattr_check_entries()
to avoid confusion.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
There's no difference between ext4_xattr_list() and ext4_listxattr(), so
merge them together and just have ext4_listxattr(). Some years ago they
took different arguments, but that's no longer the case.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Constify static data in ext4 that is never (intentionally) modified so
that it is placed in .rodata and benefits from memory protection.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
In the initial implementation of ext4 encryption, the filename was
encrypted in ext4_insert_dentry(), which could fail and also required
access to the 'dir' inode. Since then ext4 filename encryption has been
changed to encrypt the filename earlier, so we can revert the additions
to ext4_insert_dentry().
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Commit b685d3d65a "block: treat REQ_FUA and REQ_PREFLUSH as
synchronous" removed REQ_SYNC flag from WRITE_FUA implementation. Since
JBD2 strips REQ_FUA and REQ_FLUSH flags from submitted IO when the
filesystem is mounted with nobarrier mount option, journal superblock
writes ended up being async writes after this patch and that caused
heavy performance regression for dbench4 benchmark with high number of
processes. In my test setup with HP RAID array with non-volatile write
cache and 32 GB ram, dbench4 runs with 8 processes regressed by ~25%.
Fix the problem by making sure journal superblock writes are always
treated as synchronous since they generally block progress of the
journalling machinery and thus the whole filesystem.
Fixes: b685d3d65a
CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
I've hit a lockdep splat with generic/270 test complaining that:
3216.fsstress.b/3533 is trying to acquire lock:
(jbd2_handle){++++..}, at: [<ffffffff813152e0>] jbd2_log_wait_commit+0x0/0x150
but task is already holding lock:
(jbd2_handle){++++..}, at: [<ffffffff8130bd3b>] start_this_handle+0x35b/0x850
The underlying problem is that jbd2_journal_force_commit_nested()
(called from ext4_should_retry_alloc()) may get called while a
transaction handle is started. In such case it takes care to not wait
for commit of the running transaction (which would deadlock) but only
for a commit of a transaction that is already committing (which is safe
as that doesn't wait for any filesystem locks).
In fact there are also other callers of jbd2_log_wait_commit() that take
care to pass tid of a transaction that is already committing and for
those cases, the lockdep instrumentation is too restrictive and leading
to false positive reports. Fix the problem by calling
jbd2_might_wait_for_commit() from jbd2_log_wait_commit() only if the
transaction isn't already committing.
Fixes: 1eaa566d36
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
cmd in COMPATIBLE_IOCTL is always a u32, so cast it so there isn't a
warning about an overflow in XFORM.
From: Mark Charlebois <charlebm@gmail.com>
Signed-off-by: Mark Charlebois <charlebm@gmail.com>
Signed-off-by: Behan Webster <behanw@converseincode.com>
Signed-off-by: Matthias Kaehlcke <mka@chromium.org>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
If the layout is being invalidated on the server, then we must
invoke nfs_commit_inode() to ensure any commits to the DS get
cleared out.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
If the layout was invalidated, then assume we should requeue all the
pending writes for the DS in question.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
If the attempt to write through pNFS fails, we need to use the same
failure semantics as for the read path: If the FF_FLAGS_NO_IO_THRU_MDS
flag is set or we have sufficient valid DSes, then we must retry through
pNFS
Fixes: d67ae825a5 ("pnfs/flexfiles: Add the FlexFile Layout Driver")
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
proc_create_mount_point() forgot to increase the parent's nlink, and
it resulted in unbalanced hard link numbers, e.g. /proc/fs shows one
less than expected.
Fixes: eb6d38d542 ("proc: Allow creating permanently empty directories...")
Cc: stable@vger.kernel.org
Reported-by: Tristan Ye <tristan.ye@suse.com>
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Pull btrfs fix from Chris Mason:
"We have one more fix for btrfs.
This gets rid of a new WARN_ON from rc1 that ended up making more
noise than we really want. The larger fix for the underflow got
delayed a bit and it's better for now to put it under
CONFIG_BTRFS_DEBUG"
* 'for-linus-4.11' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
btrfs: qgroup: move noisy underflow warning to debugging build
pnfs_error_mark_layout_for_return needs to check that the layout is
valid before calling pnfs_set_plh_return_info().
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
If the RPC slot was interrupted and server replied to the next
operation on the "reused" slot with ERR_DELAY, don't clear out
the "interrupted" flag until we properly recover.
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Function xdr_inline_decode() will return a NULL pointer if the input
buffer does not have long enough buffer to decode nbytes of data.
However, in function decode_op_map(), the return value of
xdr_inline_decode() is not validated before it is used. This patch adds
a check to the return value of xdr_inline_decode().
Signed-off-by: Pan Bian <bianpan2016@163.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
The quotaoff operation has a race with inode allocation that results
in a livelock. An inode allocation that occurs before the quota
status flags are updated acquires the appropriate dquots for the
inode via xfs_qm_vop_dqalloc(). It then inserts the XFS_INEW inode
into the perag radix tree, sometime later attaches the dquots to the
inode and finally clears the XFS_INEW flag. Quotaoff expects to
release the dquots from all inodes in the filesystem via
xfs_qm_dqrele_all_inodes(). This invokes the AG inode iterator,
which skips inodes in the XFS_INEW state because they are not fully
constructed. If the scan occurs after dquots have been attached to
an inode, but before XFS_INEW is cleared, the newly allocated inode
will continue to hold a reference to the applicable dquots. When
quotaoff invokes xfs_qm_dqpurge_all(), the reference count of those
dquot(s) remain elevated and the dqpurge scan spins indefinitely.
To address this problem, update the xfs_qm_dqrele_all_inodes() scan
to wait on inodes marked on the XFS_INEW state. We wait on the
inodes explicitly rather than skip and retry to avoid continuous
retry loops due to a parallel inode allocation workload. Since
quotaoff updates the quota state flags and uses a synchronous
transaction before the dqrele scan, and dquots are attached to
inodes after radix tree insertion iff quota is enabled, one INEW
waiting pass through the AG guarantees that the scan has processed
all inodes that could possibly hold dquot references.
Reported-by: Eryu Guan <eguan@redhat.com>
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The AG inode iterator currently skips new inodes as such inodes are
inserted into the inode radix tree before they are fully
constructed. Certain contexts require the ability to wait on the
construction of new inodes, however. The fs-wide dquot release from
the quotaoff sequence is an example of this.
Update the AG inode iterator to support the ability to wait on
inodes flagged with XFS_INEW upon request. Create a new
xfs_inode_ag_iterator_flags() interface and support a set of
iteration flags to modify the iteration behavior. When the
XFS_AGITER_INEW_WAIT flag is set, include XFS_INEW flags in the
radix tree inode lookup and wait on them before the callback is
executed.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Inodes that are inserted into the perag tree but still under
construction are flagged with the XFS_INEW bit. Most contexts either
skip such inodes when they are encountered or have the ability to
handle them.
The runtime quotaoff sequence introduces a context that must wait
for construction of such inodes to correctly ensure that all dquots
in the fs are released. In anticipation of this, support the ability
to wait on new inodes. Wake the appropriate bit when XFS_INEW is
cleared.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Copy the uuid of the filesystem to struct super_block s_uuid field,
as several other filesystems already do. Copy regardless of the nouuid
mount option, because other filesystems also do not guaranty uniqueness
of the s_uuid field in super_block struct.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
mempool_alloc() cannot fail if the gfp flags allow it to
sleep, and both GFP_FS allows for sleeping.
So these tests of the return value from mempool_alloc()
cannot be needed.
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Steve French <smfrench@gmail.com>
commit 620d8745b3 ("Introduce cifs_copy_file_range()") changes the
behaviour of the cifs ioctl call CIFS_IOC_COPYCHUNK_FILE. In case of
successful writes, it now returns the number of bytes written. This
return value is treated as an error by the xfstest cifs/001. Depending
on the errno set at that time, this may or may not result in the test
failing.
The patch fixes this by setting the return value to 0 in case of
successful writes.
Fixes: commit 620d8745b3 ("Introduce cifs_copy_file_range()")
Reported-by: Eryu Guan <eguan@redhat.com>
Signed-off-by: Sachin Prabhu <sprabhu@redhat.com>
Acked-by: Pavel Shilovsky <pshilov@microsoft.com>
Cc: stable@vger.kernel.org
Signed-off-by: Steve French <smfrench@gmail.com>
Incorrect return value for shares not using the prefix path means that
we will never match superblocks for these shares.
Fixes: commit c1d8b24d18 ("Compare prepaths when comparing superblocks")
Cc: stable@vger.kernel.org
Signed-off-by: Sachin Prabhu <sprabhu@redhat.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Lockdep complains about a possible deadlock between mount and unlink
(which is technically impossible), but fixing this improves possible
future multiple-backend support, and keeps locking in the right order.
The lockdep warning could be triggered by unlinking a file in the
pstore filesystem:
-> #1 (&sb->s_type->i_mutex_key#14){++++++}:
lock_acquire+0xc9/0x220
down_write+0x3f/0x70
pstore_mkfile+0x1f4/0x460
pstore_get_records+0x17a/0x320
pstore_fill_super+0xa4/0xc0
mount_single+0x89/0xb0
pstore_mount+0x13/0x20
mount_fs+0xf/0x90
vfs_kern_mount+0x66/0x170
do_mount+0x190/0xd50
SyS_mount+0x90/0xd0
entry_SYSCALL_64_fastpath+0x1c/0xb1
-> #0 (&psinfo->read_mutex){+.+.+.}:
__lock_acquire+0x1ac0/0x1bb0
lock_acquire+0xc9/0x220
__mutex_lock+0x6e/0x990
mutex_lock_nested+0x16/0x20
pstore_unlink+0x3f/0xa0
vfs_unlink+0xb5/0x190
do_unlinkat+0x24c/0x2a0
SyS_unlinkat+0x16/0x30
entry_SYSCALL_64_fastpath+0x1c/0xb1
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&sb->s_type->i_mutex_key#14);
lock(&psinfo->read_mutex);
lock(&sb->s_type->i_mutex_key#14);
lock(&psinfo->read_mutex);
Reported-by: Marta Lofstedt <marta.lofstedt@intel.com>
Reported-by: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Namhyung Kim <namhyung@kernel.org>
We want to use kthread_stop() in order to ensure the threads are
shut down before we tear down the nfs_callback_info in nfs_callback_down.
Tested-and-reviewed-by: Kinglong Mee <kinglongmee@gmail.com>
Reported-by: Kinglong Mee <kinglongmee@gmail.com>
Fixes: bb6aeba736 ("NFSv4.x: Switch to using svc_set_num_threads()...")
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Since the vmalloc code has been removed from write_pmsg() in the commit
"5bf6d1b pstore/pmsg: drop bounce buffer", remove the unused header
vmalloc.h.
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
our NFSv2/v3 xdr code that could crash the server or leak memory.
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJZAlLrAAoJECebzXlCjuG+lb8P/idTu9rLGaU1VYPInrdoXru0
iPY+p5inmGSYW2MfCGlS7disaCACgzPBKVqKjeNB1hfHn2JZCrfeBd0XvBYlc7TH
JYmlHKSBjN/ZfnYMl0WrlVUCZstt4JVmxWjO3sZTCL3nImEbGA7d13yBDagxISWh
gM9wOOiJwR5lzT/W3MezNCYj4n27/vVhODMP+Qhy0rpK08dBORIH0Bi7hwtVgdD9
cMVIXMbujmJrCz0Uhbo/DhoItcePRBrCLTXdY6WEeMmHxnXlyaA2XuA0EjjZHuMs
+BsbAOsNy1BxY1c8Z2ignYgRymUvUBHeiJeIGZLbyKKM2OJMtE4BoqlSWjx08P3Y
hTwTuaw8u7uxfAsvxepukZoonWj/uVY5tP2Hyq+K6CIKKJpB7vj+c3QGYD3dNnUu
zDl/LG3ayZgpPXqfjKRnSzZE+St1/IwDnvaM2WN2B1mkuerVr8qDGq6xd4kB0QKX
BcEKiwcb2ewfPaLlVnSXz6Wbuh2pB42BObJjC3qbOgMvQ7SBUM0UcZmFpJJ20uCR
BX20aFzB/GHcd6fTRHpDrAxB4XGdVG/8Da5Ki2WhRnmmaeSXushhiKWImujY9B6i
s4mZHu4gGGJdLzOT5u2HZl93STevriL70SA9nZPhPLQycnJMUAO0buU9UcNI7wXR
GVu2F3IHd+anxDtDwOv4
=Glhl
-----END PGP SIGNATURE-----
Merge tag 'nfsd-4.11-3' of git://linux-nfs.org/~bfields/linux
Pull nfsd fixes from Bruce Fields:
"Thanks to Ari Kauppi and Tuomas Haanpää at Synopsis for spotting bugs
in our NFSv2/v3 xdr code that could crash the server or leak memory"
* tag 'nfsd-4.11-3' of git://linux-nfs.org/~bfields/linux:
nfsd: stricter decoding of write-like NFSv2/v3 ops
nfsd4: minor NFSv2/v3 write decoding cleanup
nfsd: check for oversized NFSv2/v3 arguments
stable.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQEcBAABCAAGBQJZAhJLAAoJEEp/3jgCEfOLw4gH/ia+bMzmsnkYtjMfxQfCh0ia
MHi7JS/YcAej/o71c/tvWlTU7mRbmvUCVSAcishRNytEBNGL8YzkP12vMOp/5Vdx
kKk6yDWn9z0mR5/YdBKaE8ziM5Umdy+zLqeL4yuxyhtbxKFGUPG4txJKS5WD80yU
Ld/toF2fL3y/JEs+s1pd5G+DPhEhEm2hFf56/VI6N7y08CHJgTqHB3GJ3ZnuUbnU
UhSvNR9skdVirObI8jt3oWIix8uAGq5+6MjVeTqXo75Qng5sdBGZ8S2agxXbM3j7
Hu8h/1bhKyPCUzAXnOyGcZeR+5DQolKmlKLhogbT4I9X4YC2ie4Djg0bmFHscWI=
=8aUa
-----END PGP SIGNATURE-----
Merge tag 'ceph-for-4.11-rc9' of git://github.com/ceph/ceph-client
Pull ceph fix from Ilya Dryomov:
"A fix for a kernel stack overflow bug in ceph setattr code, marked for
stable"
* tag 'ceph-for-4.11-rc9' of git://github.com/ceph/ceph-client:
ceph: fix recursion between ceph_set_acl() and __ceph_setattr()
Pull vfs fixes from Al Viro:
- fix orangefs handling of faults on write() - I'd missed that one back
when orangefs was going through review.
- readdir counterpart of "9p: cope with bogus responses from server in
p9_client_{read,write}" - server might be lying or broken, and we'd
better not overrun the kmalloc'ed buffer we are copying the results
into.
- NFS O_DIRECT read/write can leave iov_iter advanced by too much;
that's what had been causing iov_iter_pipe() warnings davej had been
seeing.
- statx_timestamp.tv_nsec type fix (s32 -> u32). That one really should
go in before 4.11.
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
uapi: change the type of struct statx_timestamp.tv_nsec to unsigned
fix nfs O_DIRECT advancing iov_iter too much
p9_client_readdir() fix
orangefs_bufmap_copy_from_iovec(): fix EFAULT handling
fstrim can take really long time on big, slow device or on file system
with a lots of allocation groups. Currently there is no way for the user
to cancell the operation. This patch makes it possible for the user to
kill fstrim pocess by adding the check for fatal_signal_pending() in
xfs_trim_extents().
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Reported-by: Zdenek Kabelac <zkabelac@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The change in commit 1e2f82d1e9 ("statx: Kill fd-with-NULL-path
support in favour of AT_EMPTY_PATH") to error on a NULL pathname to
statx() is inconsistent.
It results in the error EINVAL for a NULL pathname. Other system calls
with similar APIs (fchownat(), fstatat(), linkat()), return EFAULT.
The solution is simply to remove the EINVAL check. As I already pointed
out in [1], user_path_at*() and filename_lookup() will handle the NULL
pathname as per the other APIs, to correctly produce the error EFAULT.
[1] https://lkml.org/lkml/2017/4/26/561
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently we just stash anything we got into file->f_flags, and the
report it in fcntl(F_GETFD). This patch just clears out all unknown
flags so that we don't pass them to the fs or report them.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Add a central define for all valid open flags, and use it in the uniqueness
check.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
_submit_bh() allowed submitting a buffer_head for I/O using custom
bio_flags. It used to be used by jbd to set BIO_SNAP_STABLE, introduced
by commit 7136851117 ("mm: make snapshotting pages for stable writes a
per-bio operation"). However, the code and flag has since been removed
and no _submit_bh() users remain.
These days, bio_flags are mostly used internally by the block layer to
track the state of bio's. As such, it doesn't really make sense for
filesystems to use them instead of op_flags when wanting special
behavior for block requests.
Therefore, remove _submit_bh() and trim the bio_flags argument from
submit_bh_wbc().
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
simple_fill_super() is passed an array of tree_descr structures which
describe the files to create in the filesystem's root directory. Since
these arrays are never modified intentionally, they should be 'const' so
that they are placed in .rodata and benefit from memory protection.
This patch updates the function signature and all users, and also
constifies tree_descr.name.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Previous AFFS patch fixed OFS write operations but unveiled
another bug: files greater than 4KB are being created with a wrong
size resulting in errors like the following:
dd if=/dev/zero of=file bs=4097 count=1
cp file /mnt/affs/
cp: error writing '/mnt/affs/file': Bad address
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
We called unconditionally affs_bread_ino() with create 0 resulting in
"error (device ...): get_block(): strange block request 0"
when trying to write on AFFS OFS format.
This patch adds create parameter to that function.
0 for affs_readpage_ofs()
1 for affs_write_begin_ofs()
Bug was found here:
https://bugzilla.kernel.org/show_bug.cgi?id=114961
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
node generation has to be stored on disk.
AFAICS we won't be able to manage it on AFFS.
This patch removes relevant check in affs_nfs_get_inode()
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Have that file in global include/linux is not needed.
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
AFFS symbolic links were broken since kernel 2.6.29
Problem was bisected to the following
commit ebd09abbd9 ("vfs: ensure page symlinks are NUL-terminated")
commit 035146851c ("vfs: introduce helper function to safely
NUL-terminate symlinks")
AFFS wasn't setting inode size when reading symbolic link from disk or
creating a new one. Result was zero allocation in pagecache.
ln -s file symlink
ls -lrt
file
symlink ->
This patch adds inode isize information on inode get and symbolic link
addition.
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
With the new statx() syscall, the following both allow the attributes of
the file attached to a file descriptor to be retrieved:
statx(dfd, NULL, 0, ...);
and:
statx(dfd, "", AT_EMPTY_PATH, ...);
Change the code to reject the first option, though this means copying
the path and engaging pathwalk for the fstat() equivalent. dfd can be a
non-directory provided path is "".
[ The timing of this isn't wonderful, but applying this now before we
have statx() in any released kernel, before anybody starts using the
NULL special case. - Linus ]
Fixes: a528d35e8b ("statx: Add a system call to make enhanced file info available")
Reported-by: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Eric Sandeen <sandeen@sandeen.net>
cc: fstests@vger.kernel.org
cc: linux-api@vger.kernel.org
cc: linux-man@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If we write zero bytes to this debugfs file, then it will cause an
underflow when we do copy_from_user(buf, ubuf, count - 1). Debugfs can
normally only be written to by root so the impact of this is low.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
When the computer is turned off, all the processes are killed and then
all the filesystems are umounted. OrangeFS should not wait for the
userspace daemon to come back in that case.
This only works for plain umount(2). To actually take advantage of this
interactively, `umount -f' is needed; otherwise umount will issue a
statfs first, which will wait for the userspace daemon to come back.
Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
It is not necessary to take the lock and search through the request list
if the list is empty.
Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
If the mount is aborted after userspace has been asked to mount,
userspace must be told to unmount.
Ordinarily orangefs_kill_sb does the unmount. However it cannot be
called if the superblock has not been set up. This is a very narrow
window.
The NULL fs_id is not unmounted.
Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
Let the server figure this out because our size might be out of date or
not present.
The bug was that
xfs_io -f -t -c "pread -v 0 100" /mnt/foo
echo "Test" > /mnt/foo
xfs_io -f -t -c "pread -v 0 100" /mnt/foo
fails because the second truncate did not happen if nothing had
requested the size after the write in echo. Thus i_size was zero (not
present) and the orangefs_setattr though i_size was zero and there was
nothing to do.
Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Cc: stable@vger.kernel.org
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
Fortunately OrangeFS has had a getattr request mask for a long time.
The server basically has two difficulty levels for attributes. Fetching
any attribute except size requires communicating with the metadata
server for that handle. Since all the attributes are right there, it
makes sense to return them all. Fetching the size requires
communicating with every I/O server (that the file is distributed
across). Therefore if asked for anything except size, get everything
except size, and if asked for size, get everything.
Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
They are clones of the ORANGEFS_ITERATE macros in use elsewhere. Delete
ORANGEFS_ITERATE_NEXT which is a hack previously used by readdir.
Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
This works by maintaining a linked list of pages which the directory
has been read into rather than one giant fixed-size buffer.
This replaces code which limits the total directory size to the total
amount that could be returned in one server request. Since filenames
are usually considerably shorter than the maximum, the old code could
usually handle several server requests before running out of space.
Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
This and the previous commit fix xfstests generic/257.
Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
In the past, readdir assumed that the user buffer will be large enough
that all entries from the server will fit. If this was not true,
entries would be skipped.
Since it works now, request 512 entries rather than 96 per server
operation.
Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
Since orangefs_lookup calls orangefs_iget which calls
orangefs_inode_getattr, getattr_time will get set.
Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Cc: stable@vger.kernel.org
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
Also don't check flags as this has been validated by the VFS already.
Fix an off-by-one error in the max size checking.
Stop logging just because userspace wants to write attributes which do
not fit.
This and the previous commit fix xfstests generic/020.
Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Cc: stable@vger.kernel.org
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
If the client receives a fatal server error from nfs_pageio_add_request(),
then we should always truncate the page on which the error occurred.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
EACCES, EDQUOT, EFBIG and ESTALE are all fatal errors as far as NFS
I/O is concerned. They need to be reported back to the application.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Currently when there are buffered writes that were not yet flushed and
they fall within allocated ranges of the file (that is, not in holes or
beyond eof assuming there are no prealloc extents beyond eof), btrfs
simply reports an incorrect number of used blocks through the stat(2)
system call (or any of its variants), regardless of mount options or
inode flags (compress, compress-force, nodatacow). This is because the
number of blocks used that is reported is based on the current number
of bytes in the vfs inode plus the number of dealloc bytes in the btrfs
inode. The later covers bytes that both fall within allocated regions
of the file and holes.
Example scenarios where the number of reported blocks is wrong while the
buffered writes are not flushed:
$ mkfs.btrfs -f /dev/sdc
$ mount /dev/sdc /mnt/sdc
$ xfs_io -f -c "pwrite -S 0xaa 0 64K" /mnt/sdc/foo1
wrote 65536/65536 bytes at offset 0
64 KiB, 16 ops; 0.0000 sec (259.336 MiB/sec and 66390.0415 ops/sec)
$ sync
$ xfs_io -c "pwrite -S 0xbb 0 64K" /mnt/sdc/foo1
wrote 65536/65536 bytes at offset 0
64 KiB, 16 ops; 0.0000 sec (192.308 MiB/sec and 49230.7692 ops/sec)
# The following should have reported 64K...
$ du -h /mnt/sdc/foo1
128K /mnt/sdc/foo1
$ sync
# After flushing the buffered write, it now reports the correct value.
$ du -h /mnt/sdc/foo1
64K /mnt/sdc/foo1
$ xfs_io -f -c "falloc -k 0 128K" -c "pwrite -S 0xaa 0 64K" /mnt/sdc/foo2
wrote 65536/65536 bytes at offset 0
64 KiB, 16 ops; 0.0000 sec (520.833 MiB/sec and 133333.3333 ops/sec)
$ sync
$ xfs_io -c "pwrite -S 0xbb 64K 64K" /mnt/sdc/foo2
wrote 65536/65536 bytes at offset 65536
64 KiB, 16 ops; 0.0000 sec (260.417 MiB/sec and 66666.6667 ops/sec)
# The following should have reported 128K...
$ du -h /mnt/sdc/foo2
192K /mnt/sdc/foo2
$ sync
# After flushing the buffered write, it now reports the correct value.
$ du -h /mnt/sdc/foo2
128K /mnt/sdc/foo2
So the number of used file blocks is simply incorrect, unlike in other
filesystems such as ext4 and xfs for example, but only while the buffered
writes are not flushed.
Fix this by tracking the number of delalloc bytes that fall within holes
and beyond eof of a file, and use instead this new counter when reporting
the number of used blocks for an inode.
Another different problem that exists is that the delalloc bytes counter
is reset when writeback starts (by clearing the EXTENT_DEALLOC flag from
the respective range in the inode's iotree) and the vfs inode's bytes
counter is only incremented when writeback finishes (through
insert_reserved_file_extent()). Therefore while writeback is ongoing we
simply report a wrong number of blocks used by an inode if the write
operation covers a range previously unallocated. While this change does
not fix this problem, it does minimizes it a lot by shortening that time
window, as the new dealloc bytes counter (new_delalloc_bytes) is only
decremented when writeback finishes right before updating the vfs inode's
bytes counter. Fully fixing this second problem is not trivial and will
be addressed later by a different patch.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Normally we don't have inline extents followed by regular extents, but
there's currently at least one harmless case where this happens. For
example, when the page size is 4Kb and compression is enabled:
$ mkfs.btrfs -f /dev/sdb
$ mount -o compress /dev/sdb /mnt
$ xfs_io -f -c "pwrite -S 0xaa 0 4K" -c "fsync" /mnt/foobar
$ xfs_io -c "pwrite -S 0xbb 8K 4K" -c "fsync" /mnt/foobar
In this case we get a compressed inline extent, representing 4Kb of
data, followed by a hole extent and then a regular data extent. The
inline extent was not expanded/converted to a regular extent exactly
because it represents 4Kb of data. This does not cause any apparent
problem (such as the issue solved by commit e1699d2d7b
("btrfs: add missing memset while reading compressed inline extents"))
except trigger an unexpected case in the incremental send code path
that makes us issue an operation to write a hole when it's not needed,
resulting in more writes at the receiver and wasting space at the
receiver.
So teach the incremental send code to deal with this particular case.
The issue can be currently triggered by running fstests btrfs/137 with
compression enabled (MOUNT_OPTIONS="-o compress" ./check btrfs/137).
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
If the call to btrfs_qgroup_reserve_data() failed, we were leaking an
extent map structure. The failure can happen either due to an -ENOMEM
condition or, when quotas are enabled, due to -EDQUOT for example.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[BUG]
If run_delalloc_range() returns error and there is already some ordered
extents created, btrfs will be hanged with the following backtrace:
Call Trace:
__schedule+0x2d4/0xae0
schedule+0x3d/0x90
btrfs_start_ordered_extent+0x160/0x200 [btrfs]
? wake_atomic_t_function+0x60/0x60
btrfs_run_ordered_extent_work+0x25/0x40 [btrfs]
btrfs_scrubparity_helper+0x1c1/0x620 [btrfs]
btrfs_flush_delalloc_helper+0xe/0x10 [btrfs]
process_one_work+0x2af/0x720
? process_one_work+0x22b/0x720
worker_thread+0x4b/0x4f0
kthread+0x10f/0x150
? process_one_work+0x720/0x720
? kthread_create_on_node+0x40/0x40
ret_from_fork+0x2e/0x40
[CAUSE]
|<------------------ delalloc range --------------------------->|
| OE 1 | OE 2 | ... | OE n |
|<>| |<---------- cleanup range --------->|
||
\_=> First page handled by end_extent_writepage() in __extent_writepage()
The problem is caused by error handler of run_delalloc_range(), which
doesn't handle any created ordered extents, leaving them waiting on
btrfs_finish_ordered_io() to finish.
However after run_delalloc_range() returns error, __extent_writepage()
won't submit bio, so btrfs_writepage_end_io_hook() won't be triggered
except the first page, and btrfs_finish_ordered_io() won't be triggered
for created ordered extents either.
So OE 2~n will hang forever, and if OE 1 is larger than one page, it
will also hang.
[FIX]
Introduce btrfs_cleanup_ordered_extents() function to cleanup created
ordered extents and finish them manually.
The function is based on existing
btrfs_endio_direct_write_update_ordered() function, and modify it to
act just like btrfs_writepage_endio_hook() but handles specified range
other than one page.
After fix, delalloc error will be handled like:
|<------------------ delalloc range --------------------------->|
| OE 1 | OE 2 | ... | OE n |
|<>|<-------- ----------->|<------ old error handler --------->|
|| ||
|| \_=> Cleaned up by cleanup_ordered_extents()
\_=> First page handled by end_extent_writepage() in __extent_writepage()
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
[BUG]
When btrfs_reloc_clone_csum() reports error, it can underflow metadata
and leads to kernel assertion on outstanding extents in
run_delalloc_nocow() and cow_file_range().
BTRFS info (device vdb5): relocating block group 12582912 flags data
BTRFS info (device vdb5): found 1 extents
assertion failed: inode->outstanding_extents >= num_extents, file: fs/btrfs//extent-tree.c, line: 5858
Currently, due to another bug blocking ordered extents, the bug is only
reproducible under certain block group layout and using error injection.
a) Create one data block group with one 4K extent in it.
To avoid the bug that hangs btrfs due to ordered extent which never
finishes
b) Make btrfs_reloc_clone_csum() always fail
c) Relocate that block group
[CAUSE]
run_delalloc_nocow() and cow_file_range() handles error from
btrfs_reloc_clone_csum() wrongly:
(The ascii chart shows a more generic case of this bug other than the
bug mentioned above)
|<------------------ delalloc range --------------------------->|
| OE 1 | OE 2 | ... | OE n |
|<----------- cleanup range --------------->|
|<----------- ----------->|
\/
btrfs_finish_ordered_io() range
So error handler, which calls extent_clear_unlock_delalloc() with
EXTENT_DELALLOC and EXTENT_DO_ACCOUNT bits, and btrfs_finish_ordered_io()
will both cover OE n, and free its metadata, causing metadata under flow.
[Fix]
The fix is to ensure after calling btrfs_add_ordered_extent(), we only
call error handler after increasing the iteration offset, so that
cleanup range won't cover any created ordered extent.
|<------------------ delalloc range --------------------------->|
| OE 1 | OE 2 | ... | OE n |
|<----------- ----------->|<---------- cleanup range --------->|
\/
btrfs_finish_ordered_io() range
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
The optimization for opaque dir create was wrongly being applied
also to non-dir create.
Fixes: 97c684cc91 ("ovl: create directories inside merged parent opaque")
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: <stable@vger.kernel.org> # v4.10
A null check followed by a return is being performed already, so block
is always non-null at the second check on block, hence we can remove
this redundant null-check (Detected by PVS-Studio). Also re-work
comment to clean up a check-patch warning.
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
consider the sequence of commands:
mkdir -p /import/nfs /import/bind /import/etc
mount --bind / /import/bind
mount --make-private /import/bind
mount --bind /import/etc /import/bind/etc
exportfs -o rw,no_root_squash,crossmnt,async,no_subtree_check localhost:/
mount -o vers=4 localhost:/ /import/nfs
ls -l /import/nfs/etc
You would not expect this to report a stale file handle.
Yet it does.
The manipulations under /import/bind cause the dentry for
/etc to get the DCACHE_MOUNTED flag set, even though nothing
is mounted on /etc. This causes nfsd to call
nfsd_cross_mnt() even though there is no mountpoint. So an
upcall to mountd for "/etc" is performed.
The 'crossmnt' flag on the export of / causes mountd to
report that /etc is exported as it is a descendant of /. It
assumes the kernel wouldn't ask about something that wasn't
a mountpoint. The filehandle returned identifies the
filesystem and the inode number of /etc.
When this filehandle is presented to rpc.mountd, via
"nfsd.fh", the inode cannot be found associated with any
name in /etc/exports, or with any mountpoint listed by
getmntent(). So rpc.mountd says the filehandle doesn't
exist. Hence ESTALE.
This is fixed by teaching nfsd not to trust DCACHE_MOUNTED
too much. It is just a hint, not a guarantee.
Change nfsd_mountpoint() to return '1' for a certain mountpoint,
'2' for a possible mountpoint, and 0 otherwise.
Then change nfsd_crossmnt() to check if follow_down()
actually found a mountpount and, if not, to avoid performing
a lookup if the location is not known to certainly require
an export-point.
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
kstrdup() already checks for NULL.
(Brought to our attention by Jason Yann noticing (from sparse output)
that it should have been declared static.)
Signed-off-by: NeilBrown <neilb@suse.com>
Reported-by: Jason Yan <yanaijie@huawei.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
A client can append random data to the end of an NFSv2 or NFSv3 RPC call
without our complaining; we'll just stop parsing at the end of the
expected data and ignore the rest.
Encoded arguments and replies are stored together in an array of pages,
and if a call is too large it could leave inadequate space for the
reply. This is normally OK because NFS RPC's typically have either
short arguments and long replies (like READ) or long arguments and short
replies (like WRITE). But a client that sends an incorrectly long reply
can violate those assumptions. This was observed to cause crashes.
So, insist that the argument not be any longer than we expect.
Also, several operations increment rq_next_page in the decode routine
before checking the argument size, which can leave rq_next_page pointing
well past the end of the page array, causing trouble later in
svc_free_pages.
As followup we may also want to rewrite the encoding routines to check
more carefully that they aren't running off the end of the page array.
Reported-by: Tuomas Haanpää <thaan@synopsys.com>
Reported-by: Ari Kauppi <ari@synopsys.com>
Cc: stable@vger.kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
This patch start to enable 4K granularity small discard by default
when realtime discard is on, so, in seriously fragmented space,
small size discard can be issued in time to avoid useless storage
space occupying of invalid filesystem's data, then performance of
flash storage can be recovered.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
It's better to delay awaking discard thread while queuing discard commands
in checkpoint, it will help to give more chances for merging big and small
discard.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch seperate nat page read io from nat_tree_lock.
-lock_page
-get_node_info()
-current_nat_addr
...... -> write_checkpoint
-get_meta_page
Because we lock node page, we can make sure no other threads
modify this nid concurrently. So we just obtain current_nat_addr
under nat_tree_lock, node info is always same in both nat pack.
Signed-off-by: Yunlei He <heyunlei@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Commit 88c5c13a50 (f2fs: fix multiple f2fs_add_link() calls having
same name) does not cover the scenario where inline dentry is enabled.
In that case, F2FS_I(dir)->task will be NULL, and __f2fs_add_link will
lookup dentries one more time.
This patch fixes it by moving the assigment of current task to a upper
level to cover both normal and inline dentry.
Cc: <stable@vger.kernel.org>
Fixes: 88c5c13a50 (f2fs: fix multiple f2fs_add_link() calls having same name)
Signed-off-by: Sheng Yong <shengyong1@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The NFSv2/v3 code does not systematically check whether we decode past
the end of the buffer. This generally appears to be harmless, but there
are a few places where we do arithmetic on the pointers involved and
don't account for the possibility that a length could be negative. Add
checks to catch these.
Reported-by: Tuomas Haanpää <thaan@synopsys.com>
Reported-by: Ari Kauppi <ari@synopsys.com>
Reviewed-by: NeilBrown <neilb@suse.com>
Cc: stable@vger.kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
A client can append random data to the end of an NFSv2 or NFSv3 RPC call
without our complaining; we'll just stop parsing at the end of the
expected data and ignore the rest.
Encoded arguments and replies are stored together in an array of pages,
and if a call is too large it could leave inadequate space for the
reply. This is normally OK because NFS RPC's typically have either
short arguments and long replies (like READ) or long arguments and short
replies (like WRITE). But a client that sends an incorrectly long reply
can violate those assumptions. This was observed to cause crashes.
Also, several operations increment rq_next_page in the decode routine
before checking the argument size, which can leave rq_next_page pointing
well past the end of the page array, causing trouble later in
svc_free_pages.
So, following a suggestion from Neil Brown, add a central check to
enforce our expectation that no NFSv2/v3 call has both a large call and
a large reply.
As followup we may also want to rewrite the encoding routines to check
more carefully that they aren't running off the end of the page array.
We may also consider rejecting calls that have any extra garbage
appended. That would be safer, and within our rights by spec, but given
the age of our server and the NFS protocol, and the fact that we've
never enforced this before, we may need to balance that against the
possibility of breaking some oddball client.
Reported-by: Tuomas Haanpää <thaan@synopsys.com>
Reported-by: Ari Kauppi <ari@synopsys.com>
Cc: stable@vger.kernel.org
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Now that all the producers and consumers of dax interfaces have been
converted to using dax_operations on a dax_device, remove the block
device direct_access enabling.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Now that a dax_device is plumbed through all dax-capable drivers we can
switch from block_device_operations to dax_operations for invoking
->direct_access.
This also lets us kill off some usages of struct blk_dax_ctl on the way
to its eventual removal.
Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
commit d1a5f2b4d8 ("block: use DAX for partition table reads") was
part of a stalled effort to allow dax mappings of block devices. Since
then the device-dax mechanism has filled the role of dax-mapping static
device ranges.
Now that we are moving ->direct_access() from a block_device operation
to a dax_inode operation we would need block devices to map and carry
their own dax_inode reference.
Unless / until we decide to revive dax mapping of raw block devices
through the dax_inode scheme, there is no need to carry
read_dax_sector(). Its removal in turn allows for the removal of
bdev_direct_access() and should have been included in commit
2237570168 ("block_dev: remove DAX leftovers").
Cc: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
In preparation for converting fs/dax.c to use dax_direct_access()
instead of bdev_direct_access(), add the plumbing to retrieve the
dax_device associated with a given block_device.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
If the server has already returned a fatal write error that the user
has not yet received on this file, then don't write back the other pages.
Instead, act as if they have been sent, and have returned with the same
error.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
The assumption should be that if the caller returns PNFS_ATTEMPTED, then hdr
has been consumed, and so we should not be testing hdr->task.tk_status.
If the caller returns PNFS_TRY_AGAIN, then we need to recoalesce and
free hdr.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
ceph_set_acl() calls __ceph_setattr() if the setacl operation needs
to modify inode's i_mode. __ceph_setattr() updates inode's i_mode,
then calls posix_acl_chmod().
The problem is that __ceph_setattr() calls posix_acl_chmod() before
sending the setattr request. The get_acl() call in posix_acl_chmod()
can trigger a getxattr request. The reply of the getxattr request
can restore inode's i_mode to its old value. The set_acl() call in
posix_acl_chmod() sees old value of inode's i_mode, so it calls
__ceph_setattr() again.
Cc: stable@vger.kernel.org # needs backporting for < 4.9
Link: http://tracker.ceph.com/issues/19688
Reported-by: Jerry Lee <leisurelysw24@gmail.com>
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Tested-by: Luis Henriques <lhenriques@suse.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Use ASSERTs on the log intent item refcounts so that we fail noisily if
anyone tries to double-free the item.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
The quotacheck error handling of the delwri buffer list assumes the
resident buffers are locked and doesn't clear the _XBF_DELWRI_Q flag
on the buffers that are dequeued. This can lead to assert failures
on buffer release and possibly other locking problems.
Move this code to a delwri queue cancel helper function to
encapsulate the logic required to properly release buffers from a
delwri queue. Update the helper to clear the delwri queue flag and
call it from quotacheck.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
xfs_iflush_done uses an on-stack variable length array to pass the log
items to be deleted to xfs_trans_ail_delete_bulk. On-stack VLAs are a
nasty gcc extension that can lead to unbounded stack allocations, but
fortunately we can easily avoid them by simply open coding
xfs_trans_ail_delete_bulk in xfs_iflush_done, which is the only caller
of it except for the single-item xfs_trans_ail_delete.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
At the end of a getfsmap call, we will set FMR_OF_LAST in the last
struct fsmap that was handed in by userspace if we've truly run out of
space mapping record (as opposed to simply running out of space in the
user array). Unfortunately, fmh_entries is the wrong check for whether
or not we've filled out anything in the user array because the ioctl
provides that fmh_count==0 sets fmh_entries without filling out the user
array. Therefore we end up writing things into user memory areas that we
weren't given, and kaboom.
Since Christoph amended the getfsmap structure to track the number of
fsmap entries we've actually filled out, use that as part of deciding if
we have to set the OF_LAST flag.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
By passing the whole fsmap_head structure and an index we can get the
user point annotations right for the embedded variable sized array
in struct fsmap_head.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
[darrick: change idx to unsigned int]
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
At least if we want to be able to recognize the pattern. Add a missing
byte swap to the corruption injection case in xlog_sync.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Found by sparse.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Found by sparse.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
XFS only supports the unwritten extent bit in the data fork, and only if
the file system has a version 5 superblock or the unwritten extent
feature bit.
We currently have two routines that validate the invariant:
xfs_check_nostate_extents which return -EFSCORRUPTED when it's not met,
and xfs_validate_extent that triggers and assert in debug build.
Both of them iterate over all extents of an inode fork when called,
which isn't very efficient.
This patch instead adds a new helper that verifies the invariant one
extent at a time, and calls it from the places where we iterate over
all extents to converted them from or two the in-memory format. The
callers then return -EFSCORRUPTED when reading invalid extents from
disk, or trigger an assert when writing them to disk.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
We only ever use the normal and unwritten states. And the actual
ondisk format (this enum isn't despite being in xfs_format.h) only
has space for the unwritten bit anyway.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
On some architectures do_div does the pointer compare
trick to make sure that we've sent it an unsigned 64-bit
number. (Why unsigned? I don't know.)
Fix up the few places that squawk about this; in
xfs_bmap_wants_extents() we just used a bare int64_t so change
that to unsigned.
In xfs_adjust_extent_unmap_boundaries() all we wanted was the
mod, and we have an xfs-specific function to handle that w/o
side effects, which includes proper casting for do_div.
In xfs_daddr_to_ag[b]no, we were using the wrong type anyway;
XFS_BB_TO_FSBT returns a block in the filesystem, so use
xfs_rfsblock_t not xfs_daddr_t, and gain the unsignedness
from that type as a bonus.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The kbuild test robot caught this; in debug code we have another
caller of do_div with a 32-bit dividend (j) which is caught now
that we are using the kernel-supplied do_div.
None of the values used here are 64-bit; just use simple division.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The trailing newlines wil lead to extra newlines in the trace file
which looks like the following output, so remove them.
>kworker/4:1H-1508 [004] .... 47879.101608: xfs_discard_extent: dev 8:0
>
>kworker/u16:2-238 [004] .... 47879.101725: xfs_extent_busy_clear: dev 8:0
Signed-off-by: Hou Tao <houtao1@huawei.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
[darrick: fix the getfsmap tracepoints too]
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Directory block readahead uses a complex iteration mechanism to map
between high-level directory blocks and underlying physical extents.
This mechanism attempts to traverse the higher-level dir blocks in a
manner that handles multi-fsb directory blocks and simultaneously
maintains a reference to the corresponding physical blocks.
This logic doesn't handle certain (discontiguous) physical extent
layouts correctly with multi-fsb directory blocks. For example,
consider the case of a 4k FSB filesystem with a 2 FSB (8k) directory
block size and a directory with the following extent layout:
EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL
0: [0..7]: 88..95 0 (88..95) 8
1: [8..15]: 80..87 0 (80..87) 8
2: [16..39]: 168..191 0 (168..191) 24
3: [40..63]: 5242952..5242975 1 (72..95) 24
Directory block 0 spans physical extents 0 and 1, dirblk 1 lies
entirely within extent 2 and dirblk 2 spans extents 2 and 3. Because
extent 2 is larger than the directory block size, the readahead code
erroneously assumes the block is contiguous and issues a readahead
based on the physical mapping of the first fsb of the dirblk. This
results in read verifier failure and a spurious corruption or crc
failure, depending on the filesystem format.
Further, the subsequent readahead code responsible for walking
through the physical table doesn't correctly advance the physical
block reference for dirblk 2. Instead of advancing two physical
filesystem blocks, the first iteration of the loop advances 1 block
(correctly), but the subsequent iteration advances 2 more physical
blocks because the next physical extent (extent 3, above) happens to
cover more than dirblk 2. At this point, the higher-level directory
block walking is completely off the rails of the actual physical
layout of the directory for the respective mapping table.
Update the contiguous dirblock logic to consider the current offset
in the physical extent to avoid issuing directory readahead to
unrelated blocks. Also, update the mapping table advancing code to
consider the current offset within the current dirblock to avoid
advancing the mapping reference too far beyond the dirblock.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Carlos had a case where "find" seemed to start spinning
forever and never return.
This was on a filesystem with non-default multi-fsb (8k)
directory blocks, and a fragmented directory with extents
like this:
0:[0,133646,2,0]
1:[2,195888,1,0]
2:[3,195890,1,0]
3:[4,195892,1,0]
4:[5,195894,1,0]
5:[6,195896,1,0]
6:[7,195898,1,0]
7:[8,195900,1,0]
8:[9,195902,1,0]
9:[10,195908,1,0]
10:[11,195910,1,0]
11:[12,195912,1,0]
12:[13,195914,1,0]
...
i.e. the first extent is a contiguous 2-fsb dir block, but
after that it is fragmented into 1 block extents.
At the top of the readdir path, we allocate a mapping array
which (for this filesystem geometry) can hold 10 extents; see
the assignment to map_info->map_size. During readdir, we are
therefore able to map extents 0 through 9 above into the array
for readahead purposes. If we count by 2, we see that the last
mapped index (9) is the first block of a 2-fsb directory block.
At the end of xfs_dir2_leaf_readbuf() we have 2 loops to fill
more readahead; the outer loop assumes one full dir block is
processed each loop iteration, and an inner loop that ensures
that this is so by advancing to the next extent until a full
directory block is mapped.
The problem is that this inner loop may step past the last
extent in the mapping array as it tries to reach the end of
the directory block. This will read garbage for the extent
length, and as a result the loop control variable 'j' may
become corrupted and never fail the loop conditional.
The number of valid mappings we have in our array is stored
in map->map_valid, so stop this inner loop based on that limit.
There is an ASSERT at the top of the outer loop for this
same condition, but we never made it out of the inner loop,
so the ASSERT never fired.
Huge appreciation for Carlos for debugging and isolating
the problem.
Debugged-and-analyzed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Tested-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Bill O'Donnell <billodo@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
On a ppc64 machine executing overlayfs/019 with xfs as the lower and
upper filesystem causes the following call trace,
WARNING: CPU: 2 PID: 8034 at /root/repos/linux/fs/iomap.c:765 .iomap_dio_actor+0xcc/0x420
Modules linked in:
CPU: 2 PID: 8034 Comm: fsstress Tainted: G L 4.11.0-rc5-next-20170405 #100
task: c000000631314880 task.stack: c0000003915d4000
NIP: c00000000035a72c LR: c00000000035a6f4 CTR: c00000000035a660
REGS: c0000003915d7570 TRAP: 0700 Tainted: G L (4.11.0-rc5-next-20170405)
MSR: 800000000282b032 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI>
CR: 24004284 XER: 00000000
CFAR: c0000000006f7190 SOFTE: 1
GPR00: c00000000035a6f4 c0000003915d77f0 c0000000015a3f00 000000007c22f600
GPR04: 000000000022d000 0000000000002600 c0000003b2d56360 c0000003915d7960
GPR08: c0000003915d7cd0 0000000000000002 0000000000002600 c000000000521cc0
GPR12: 0000000024004284 c00000000fd80a00 000000004b04ae64 ffffffffffffffff
GPR16: 000000001000ca70 0000000000000000 c0000003b2d56380 c00000000153d2b8
GPR20: 0000000000000010 c0000003bc87bac8 0000000000223000 000000000022f5ff
GPR24: c0000003b2d56360 000000000000000c 0000000000002600 000000000022d000
GPR28: 0000000000000000 c0000003915d7960 c0000003b2d56360 00000000000001ff
NIP [c00000000035a72c] .iomap_dio_actor+0xcc/0x420
LR [c00000000035a6f4] .iomap_dio_actor+0x94/0x420
Call Trace:
[c0000003915d77f0] [c00000000035a6f4] .iomap_dio_actor+0x94/0x420 (unreliable)
[c0000003915d78f0] [c00000000035b9f4] .iomap_apply+0xf4/0x1f0
[c0000003915d79d0] [c00000000035c320] .iomap_dio_rw+0x230/0x420
[c0000003915d7ae0] [c000000000512a14] .xfs_file_dio_aio_read+0x84/0x160
[c0000003915d7b80] [c000000000512d24] .xfs_file_read_iter+0x104/0x130
[c0000003915d7c10] [c0000000002d6234] .__vfs_read+0x114/0x1a0
[c0000003915d7cf0] [c0000000002d7a8c] .vfs_read+0xac/0x1a0
[c0000003915d7d90] [c0000000002d96b8] .SyS_read+0x58/0x100
[c0000003915d7e30] [c00000000000b8e0] system_call+0x38/0xfc
Instruction dump:
78630020 7f831b78 7ffc07b4 7c7ce039 40820360 a13d0018 2f890003 419e0288
2f890004 419e00a0 2f890001 419e02a8 <0fe00000> 3b80fffb 38210100 7f83e378
The above problem can also be recreated on a regular xfs filesystem
using the command,
$ fsstress -d /mnt -l 1000 -n 1000 -p 1000
The reason for the call trace is,
1. When 'reserving' blocks for delayed allocation , XFS reserves more
blocks (i.e. past file's current EOF) than required. This is done
because XFS assumes that userspace might write more data and hence
'reserving' more blocks might lead to the file's new data being
stored contiguously on disk.
2. The in-memory 'struct xfs_bmbt_irec' mapping the file's last extent would
then cover the prealloc-ed EOF blocks in addition to the regular blocks.
3. When flushing the dirty blocks to disk, we only flush data till the
file's EOF. But before writing out the dirty data, we allocate blocks
on the disk for holding the file's new data. This allocation includes
the blocks that are part of the 'prealloc EOF blocks'.
4. Later, when the last reference to the inode is being closed, XFS frees the
unused 'prealloc EOF blocks' in xfs_inactive().
In step 3 above, When allocating space on disk for the delayed allocation
range, the space allocator might sometimes allocate less blocks than
required. If such an allocation ends right at the current EOF of the
file, We will not be able to clear the "delayed allocation" flag for the
'prealloc EOF blocks', since we won't have dirty buffer heads associated
with that range of the file.
In such a situation if a Direct I/O read operation is performed on file
range [X, Y] (where X < EOF and Y > EOF), we flush dirty data in the
range [X, Y] and invalidate page cache for that range (Refer to
iomap_dio_rw()). Later for performing the Direct I/O read, XFS obtains
the extent items (which are still cached in memory) for the file
range. When doing so we are not supposed to get an extent item with
IOMAP_DELALLOC flag set, since the previous "flush" operation should
have converted any delayed allocation data in the range [X, Y]. Hence we
end up hitting a WARN_ON_ONCE(1) statement in iomap_dio_actor().
This commit fixes the bug by preventing the read operation from going
beyond iomap_dio->i_size.
Reported-by: Santhosh G <santhog4@linux.vnet.ibm.com>
Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Now that reflink operations don't set the firstblock value we don't
need the workarounds for non-NULL firstblock values without a prior
allocation.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The main thing that xfs_bmap_remap_alloc does is fixing the AGFL, similar
to what we do in the space allocator. But the reflink code doesn't touch
the allocation btree unlike the normal space allocator, so we couldn't
care less about the state of the AGFL.
So remove xfs_bmap_remap_alloc and just handle the di_nblocks update in
the caller.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Add a new helper to be used for reflink extent list additions instead of
funneling them through xfs_bmapi_write and overloading the firstblock
member in struct xfs_bmalloca and struct xfs_alloc_args.
With some small changes to xfs_bmap_remap_alloc this also means we do
not need a xfs_bmalloca structure for this case at all.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
For the reflink case we'd much rather pass the required arguments than
faking up a struct xfs_bmalloca.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
We never do COW operations for the attr fork, so don't pretend we handle
them.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
bno should be a xfs_fsblock_t, which is 64-bit wides instead of a
xfs_aglock_t, which truncates the value to 32 bits.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
If we have a layout segment cached in pgio->pg_lseg, we should check it
for validity before reusing it in a new RPC request. Otherwise, if we
recoalesce, we can end up looping forever.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
When delivering an event to userspace for a file on an NFS share,
if the file is deleted on server side before user reads the event,
user will not get the event.
If the event queue contained several events, the stale event is
quietly dropped and read() returns to user with events read so far
in the buffer.
If the event queue contains a single stale event or if the stale
event is a permission event, read() returns to user with the kernel
internal error code 518 (EOPENSTALE), which is not a POSIX error code.
Check the internal return value -EOPENSTALE in fanotify_read(), just
the same as it is checked in path_openat() and drop the event in the
cases that it is not already dropped.
This is a reproducer from Marko Rauhamaa:
Just take the example program listed under "man fanotify" ("fantest")
and follow these steps:
==============================================================
NFS Server NFS Client(1) NFS Client(2)
==============================================================
# echo foo >/nfsshare/bar.txt
# cat /nfsshare/bar.txt
foo
# ./fantest /nfsshare
Press enter key to terminate.
Listening for events.
# rm -f /nfsshare/bar.txt
# cat /nfsshare/bar.txt
read: Unknown error 518
cat: /nfsshare/bar.txt: Operation not permitted
==============================================================
where NFS Client (1) and (2) are two terminal sessions on a single NFS
Client machine.
Reported-by: Marko Rauhamaa <marko.rauhamaa@f-secure.com>
Tested-by: Marko Rauhamaa <marko.rauhamaa@f-secure.com>
Cc: <linux-api@vger.kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Async request may be throttled in block layer, so page for async may keep WRITE_BACK
for a long time.
For encrytped inode, we need wait on page writeback no matter if the device supports
BDI_CAP_STABLE_WRITES. This may result in a higher waiting page writeback time for
async encrypted inode page.
This patch skips IPU for encrypted inode's updating write.
Signed-off-by: Hou Pengyang <houpengyang@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch also reverts d0db7703ac ("f2fs: do SSR in higher priority").
This patch fixes out of free segments caused by many small file creation by
1) mkfs -s 1 2G
2) mount
3) untar
- preoduce 60000 small files burstly
4) sync
- flush node pages
- flush imeta
Here, when we do f2fs_balance_fs, we missed # of imeta blocks, resulting in
skipping to check has_not_enough_free_secs.
Another test is done by
1) mkfs -s 12 2G
2) mount
3) untar
- preoduce 60000 small files burstly
4) sync
- flush node pages
- flush imeta
In this case, this patch also fixes wrong block allocation under large section
size.
Reported-by: William Brana <wbrana@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
With a recent addition of f2fs_lookup_extent_tree(), we get a warning about
the use of empty macros:
fs/f2fs/extent_cache.c: In function 'f2fs_lookup_extent_tree':
fs/f2fs/extent_cache.c:358:32: error: suggest braces around empty body in an 'else' statement [-Werror=empty-body]
stat_inc_rbtree_node_hit(sbi);
A good way to avoid the warning and make the code more robust is to define
all no-op macros as 'do { } while (0)'.
Fixes: 54c2258cd6 ("f2fs: extract rb-tree operation infrastructure")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reivewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch adds an ioctl to flush data in faster device to cold area. User can
give device number and number of segments to move. It doesn't move it if there
is only one device.
The parameter looks like:
struct f2fs_flush_device {
u32 dev_num; /* device number to flush */
u32 segments; /* # of segments to flush */
};
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Improve comments in ext4_quota_{on|off}() to explain that returning
success despite ext4_journal_start() failing is deliberate.
Signed-off-by: Jan Kara <jack@suse.cz>
We recently shifted this code around, so we're no longer holding the
lock on this path.
Fixes: 755b5bc681 ("fsnotify: Remove indirection from mark list addition")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Use temporary mapping for memory copying operations.
To avoid any sleeping problem,
mark_inode_dirty(inode) was moved after kunmap() in
udf_adinicb_readpage()
down_write(&iinfo->i_data_sem) set before kmap_atomic()
in udf_expand_file_adinicb()
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Jan Kara <jack@suse.cz>
According to commit f90774e1fd ("checkpatch: look for symbolic
permissions and suggest octal instead")
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Jan Kara <jack@suse.cz>
- More O_TMPFILE fallout
- RENAME_WHITEOUT regression due to a mis-merge
- Memory leak in ubifs_mknod()
- Power-cut problem in UBI's update volume feature
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABAgAGBQJY/HTyAAoJEEtJtSqsAOnW7bYP/jYNCRKUls/FhWcF02h/tuPz
gYemObWxoEvPqy8laNwyYEkzct4ADsz4C2RCGxygJks2qYCzuXy+IQd/gQlJNIhM
B9sNa1se1ntt71vwdwgRUgVRMGjknJ6rY8ZP2actuuzztdJE9fGNKhEarMVyQQ61
f7+AvmFc0CHTEBi5hcHX/Rtdqvzt4OOJl5pLo+VT2RRegxb3oHgl1UTD+gqkdkMg
WOxvdriSQPi8HHz1clCN36lG2H05sKLd7YEhSWKf6Pt0YYJmF8ckxLERzRclpV4D
XXqUEwhrovJC+oRQbXS+DaJfjsML9y8GB28izC5c+H8/teGKv2n5NXtjsyWNCfnx
gmZT1S24TnTsofbmd1GoemMMD4Vlt9EDq2+QVFCaPgK9eWkDQ/xlhjrfPMsuz+YY
uCWgyzsJyvRewiOlkN1jcTua6f1nHt0SP7alNy0XUQF08jwYMaaf5IJAbEHP8Ink
seIXNObiGgHl+bBsTQ4SU9PUIJIKIar0TSQf8HXV+/Go3XDz4ZTSjc1KVXs5Hg97
irUPllIKmjb5Doiln12A5zePmgT2uaChFG21+6SwEWkFYrD3XIxBORxkEF6E7FJU
asRk2oBzNYezz+qpjn3NeZnIV7Uc9B8bBZpqFrsys2KRAJIZRtdhpAF+jIbrS+7N
JrakPr2I/FKnsNU/0mZh
=ER2q
-----END PGP SIGNATURE-----
Merge tag 'upstream-4.11-rc7' of git://git.infradead.org/linux-ubifs
Pull UBI/UBIFS fixes from Richard Weinberger:
"This contains fixes for issues in both UBI and UBIFS:
- more O_TMPFILE fallout
- RENAME_WHITEOUT regression due to a mis-merge
- memory leak in ubifs_mknod()
- power-cut problem in UBI's update volume feature"
* tag 'upstream-4.11-rc7' of git://git.infradead.org/linux-ubifs:
ubifs: Fix O_TMPFILE corner case in ubifs_link()
ubifs: Fix RENAME_WHITEOUT support
ubifs: Fix debug messages for an invalid filename in ubifs_dump_inode
ubifs: Fix debug messages for an invalid filename in ubifs_dump_node
ubifs: Remove filename from debug messages in ubifs_readdir
ubifs: Fix memory leak in error path in ubifs_mknod
ubi/upd: Always flush after prepared for an update
Both conflict were simple overlapping changes.
In the kaweth case, Eric Dumazet's skb_cow() bug fix overlapped the
conversion of the driver in net-next to use in-netdev stats.
Signed-off-by: David S. Miller <davem@davemloft.net>
unsupported NFSv4 compound op.
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJY+mlFAAoJECebzXlCjuG+eCMP/2r/lXdx91Kibg4/D/9OB+LW
YmcyUH83iRq/F/V88yEMz5XMUTZ/UkZ3iSGSDObjhBK7oswAtZhC1KiWOiuI9ry4
tcqZM3se7FQNpbwm2wvhtRmbjd2YOZfS3o32Uorhc/beMZzIzZhUb15pp5o3hOIy
4aoxeUs6//B51fil4Ur0oeLI1EqUd3t+1z0ytyRRzSfORHF/QEiTTXSmmKK+K2XA
t35XXYJE3SCLxQpWW/CHT/IODv6nsaM0rgN/kkmZUxAyaw2GXT7U6+mzkz3tSL8s
AEqNUptCn2M3iAv/e6wgpO3BS49HR+Gdg+JxPilCcs+GniWu4zH3sE0HFjTuqQ23
7u/NB4lkDqFs9pO1x4D5Y/69yXbs6jUHn+NMnhTZS85pTwU5o3RNhzwegoOfIp/3
CY22R92ZLHEqb1/VTc7rpeuj3d9r7Swg3U8X7kDnn7XtAj2HlMqaHZfw2zjTIYBO
wqKnikTDMHuzZ7MtegLld4oayNZO4hN7Vi01Npzgr3nkwvpFpz8Jbvsn8mMHvUq8
ZYWoMfz5rbH20N7cnFUAtB5fm67U9smjb01uzx3m6ylh1SqTWxAoZPWluGJTv0IA
JKEM4g4PSNOizhib6KI2PW0SngLkCWKLy463fDS2Mx/M7DBZf0j4697TDK21R5NB
Pny/pwJT4mjLwePc/qt2
=mSBF
-----END PGP SIGNATURE-----
Merge tag 'nfsd-4.11-2' of git://linux-nfs.org/~bfields/linux
Pull nfsd bugfix from Bruce Fields:
"Fix a 4.11 regression that triggers a BUG() on an attempt to use an
unsupported NFSv4 compound op"
* tag 'nfsd-4.11-2' of git://linux-nfs.org/~bfields/linux:
nfsd: fix oops on unsupported operation
Commit 25520d55cd ("block: Inline blk_integrity in struct gendisk")
introduced blk_integrity_revalidate(), which seems to assume ownership
of the stable pages flag and unilaterally clears it if no blk_integrity
profile is registered:
if (bi->profile)
disk->queue->backing_dev_info->capabilities |=
BDI_CAP_STABLE_WRITES;
else
disk->queue->backing_dev_info->capabilities &=
~BDI_CAP_STABLE_WRITES;
It's called from revalidate_disk() and rescan_partitions(), making it
impossible to enable stable pages for drivers that support partitions
and don't use blk_integrity: while the call in revalidate_disk() can be
trivially worked around (see zram, which doesn't support partitions and
hence gets away with zram_revalidate_disk()), rescan_partitions() can
be triggered from userspace at any time. This breaks rbd, where the
ceph messenger is responsible for generating/verifying CRCs.
Since blk_integrity_{un,}register() "must" be used for (un)registering
the integrity profile with the block layer, move BDI_CAP_STABLE_WRITES
setting there. This way drivers that call blk_integrity_register() and
use integrity infrastructure won't interfere with drivers that don't
but still want stable pages.
Fixes: 25520d55cd ("block: Inline blk_integrity in struct gendisk")
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # 4.4+, needs backporting
Tested-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
new flag: LOOKUP_DOWN. If the starting point is overmounted, cross
into whatever's mounted on top, triggering referrals et.al.
Use that instead of follow_down_one() loop in mntns_install(), handle
errors properly.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
NFS attempts to wait for read and write completion before unlocking in
order to ensure that the data returned was protected by the lock. When
this waiting is interrupted by a signal, the unlock may be skipped, and
messages similar to the following are seen in the kernel ring buffer:
[20.167876] Leaked locks on dev=0x0:0x2b ino=0x8dd4c3:
[20.168286] POSIX: fl_owner=ffff880078b06940 fl_flags=0x1 fl_type=0x0 fl_pid=20183
[20.168727] POSIX: fl_owner=ffff880078b06680 fl_flags=0x1 fl_type=0x0 fl_pid=20185
For NFSv3, the missing unlock will cause the server to refuse conflicting
locks indefinitely. For NFSv4, the leftover lock will be removed by the
server after the lease timeout.
This patch fixes this issue by skipping the usual wait in
nfs_iocounter_wait if the FL_CLOSE flag is set when signaled. Instead, the
wait happens in the unlock RPC task on the NFS UOC rpc_waitqueue.
For NFSv3, use lockd's new nlmclnt_operations along with
nfs_async_iocounter_wait to defer NLM's unlock task until the lock
context's iocounter reaches zero.
For NFSv4, call nfs_async_iocounter_wait() directly from unlock's
current rpc_call_prepare.
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
NFS would enjoy the ability to modify the behavior of the NLM client's
unlock RPC task in order to delay the transmission of the unlock until IO
that was submitted under that lock has completed. This ability can ensure
that the NLM client will always complete the transmission of an unlock even
if the waiting caller has been interrupted with fatal signal.
For this purpose, a pointer to a struct nlmclnt_operations can be assigned
in a nfs_module's nfs_rpc_ops that will install those nlmclnt_operations on
the nlm_host. The struct nlmclnt_operations defines three callback
operations that will be used in a following patch:
nlmclnt_alloc_call - used to call back after a successful allocation of
a struct nlm_rqst in nlmclnt_proc().
nlmclnt_unlock_prepare - used to call back during NLM unlock's
rpc_call_prepare. The NLM client defers calling rpc_call_start()
until this callback returns false.
nlmclnt_release_call - used to call back when the NLM client's struct
nlm_rqst is freed.
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
By sleeping on a new NFS Unlock-On-Close waitqueue, rpc tasks may wait for
a lock context's iocounter to reach zero. The rpc waitqueue is only woken
when the open_context has the NFS_CONTEXT_UNLOCK flag set in order to
mitigate spurious wake-ups for any iocounter reaching zero.
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Set FL_CLOSE in fl_flags as in locks_remove_posix() when clearing locks.
NFS will check for this flag to ensure an unlock is sent in a following
patch.
Fuse handles flock and posix locks differently for FL_CLOSE, and so
requires a fixup to retain the existing behavior for flock.
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Acked-by: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
We only need to check lock exclusive/shared types against open mode when
flock() is used on NFS, so move it into the flock-specific path instead of
checking it for all locks.
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
flock64_to_posix_lock() is already doing this check
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Reviewed-by: Jeff Layton <jeff.layton@primarydata.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
The objlayout code has been in the tree, but it's been unmaintained and
no server product for it actually ever shipped.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Replace bdev_direct_access() with dax_direct_access() that uses
dax_device and dax_operations instead of a block_device and
block_device_operations for dax. Once all consumers of the old api have
been converted bdev_direct_access() will be deleted.
Given that block device partitioning decisions can cause dax page
alignment constraints to be violated this also introduces the
bdev_dax_pgoff() helper. It handles calculating a logical pgoff relative
to the dax_device and also checks for page alignment.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
The check in nfs4_ff_layout_prepare_ds() seems to be missing.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Fixes: a33e4b036d ("pNFS: return status from nfs4_pnfs_ds_connect")
Cc: Weston Andros Adamson <dros@primarydata.com>
Cc: stable@vger.kernel.org # v4.11
This passes on the scsi_cmnd result field to users of passthrough
requests. Currently we abuse req->errors for this purpose, but that
field will go away in its current form.
Note that the old IDE code abuses the errors field in very creative
ways and stores all kinds of different values in it. I didn't dare
to touch this magic, so the abuses are brought forward 1:1.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Bart Van Assche <Bart.VanAssche@sandisk.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
The function only returns -EIO if rq->errors is non-zero, which is not
very useful and lets a large number of callers ignore the return value.
Just let the callers figure out their error themselves.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Bart Van Assche <Bart.VanAssche@sandisk.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
If the server fails to return the attributes as part of an OPEN
reply, and then reboots, we can end up hanging. The reason is that
the client attempts to send a GETATTR in order to pick up the
missing OPEN call, but fails to release the slot first, causing
reboot recovery to deadlock.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Fixes: 2e80dbe7ac ("NFSv4.1: Close callback races for OPEN, LAYOUTGET...")
Cc: stable@vger.kernel.org # v4.8+
Drop 'parent' argument of bdi_register() and bdi_register_va(). It is
always NULL.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
Now that all bdi structures filesystems use are properly refcounted, we
can remove the SB_I_DYNBDI flag.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
Allocate struct backing_dev_info separately instead of embedding it
inside the superblock. This unifies handling of bdi among users.
CC: Richard Weinberger <richard@nod.at>
CC: Artem Bityutskiy <dedekind1@gmail.com>
CC: Adrian Hunter <adrian.hunter@intel.com>
CC: linux-mtd@lists.infradead.org
Acked-by: Richard Weinberger <richard@nod.at>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
Allocate struct backing_dev_info separately instead of embedding it
inside the superblock. This unifies handling of bdi among users.
CC: Anna Schumaker <anna.schumaker@netapp.com>
CC: linux-nfs@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Acked-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Allocate struct backing_dev_info separately instead of embedding it
inside the superblock. This unifies handling of bdi among users.
CC: Petr Vandrovec <petr@vandrovec.name>
Acked-by: Petr Vandrovec <petr@vandrovec.name>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
Similarly to set_bdev_super() NILFS2 just used block device reference to
bdi. Convert it to properly getting bdi reference. The reference will
get automatically dropped on superblock destruction.
CC: linux-nilfs@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Acked-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Jens Axboe <axboe@fb.com>
Similarly to set_bdev_super() GFS2 just used block device reference to
bdi. Convert it to properly getting bdi reference. The reference will
get automatically dropped on superblock destruction.
CC: Steven Whitehouse <swhiteho@redhat.com>
CC: Bob Peterson <rpeterso@redhat.com>
CC: cluster-devel@redhat.com
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
It is not needed anymore since bdi is initialized whenever superblock
exists.
CC: Miklos Szeredi <miklos@szeredi.hu>
CC: linux-fsdevel@vger.kernel.org
Suggested-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
Allocate struct backing_dev_info separately instead of embedding it
inside the superblock. This unifies handling of bdi among users.
CC: Miklos Szeredi <miklos@szeredi.hu>
CC: linux-fsdevel@vger.kernel.org
Acked-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
Allocate struct backing_dev_info separately instead of embedding it
inside the superblock. This unifies handling of bdi among users.
CC: Boaz Harrosh <ooo@electrozaur.com>
CC: Benny Halevy <bhalevy@primarydata.com>
Acked-by: Boaz Harrosh <ooo@electrozaur.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
Allocate struct backing_dev_info separately instead of embedding it
inside the superblock. This unifies handling of bdi among users.
CC: Jan Harkes <jaharkes@cs.cmu.edu>
CC: coda@cs.cmu.edu
CC: codalist@coda.cs.cmu.edu
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
Allocate struct backing_dev_info separately instead of embedding it
inside the superblock. This unifies handling of bdi among users.
CC: David Howells <dhowells@redhat.com>
CC: linux-afs@lists.infradead.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
Allocate struct backing_dev_info separately instead of embedding it
inside the superblock. This unifies handling of bdi among users.
CC: Tyler Hicks <tyhicks@canonical.com>
CC: ecryptfs@vger.kernel.org
Acked-by: Tyler Hicks <tyhicks@canonical.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
Allocate struct backing_dev_info separately instead of embedding it
inside superblock. This unifies handling of bdi among users.
CC: Steve French <sfrench@samba.org>
CC: linux-cifs@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
Allocate struct backing_dev_info separately instead of embedding it
inside client structure. This unifies handling of bdi among users.
CC: Ilya Dryomov <idryomov@gmail.com>
CC: "Yan, Zheng" <zyan@redhat.com>
CC: Sage Weil <sage@redhat.com>
CC: ceph-devel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
Allocate struct backing_dev_info separately instead of embedding it
inside superblock. This unifies handling of bdi among users.
CC: Chris Mason <clm@fb.com>
CC: Josef Bacik <jbacik@fb.com>
CC: David Sterba <dsterba@suse.com>
CC: linux-btrfs@vger.kernel.org
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
Allocate struct backing_dev_info separately instead of embedding it
inside session. This unifies handling of bdi among users.
CC: Eric Van Hensbergen <ericvh@gmail.com>
CC: Ron Minnich <rminnich@sandia.gov>
CC: Latchesar Ionkov <lucho@ionkov.net>
CC: v9fs-developer@lists.sourceforge.net
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
So far we just relied on block device to hold a bdi reference for us
while the filesystem is mounted. While that works perfectly fine, it is
a bit awkward that we have a pointer to a refcounted structure in the
superblock without proper reference. So make s_bdi hold a proper
reference to block device's BDI. No filesystem using mount_bdev()
actually changes s_bdi so this is safe and will make bdev filesystems
work the same way as filesystems needing to set up their private bdi.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
Let's try to have it in a cacheline in nfs4_proc_pgio_rpc_prepare().
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Since commit 00bfa30abe ("NFS: Create a common pgio_alloc and
pgio_release function"), nfs_pgarray_set() has only a single caller. Let's
open code it.
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Prevent a deadlock that can occur if we wait on allocations
that try to write back our pages.
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Fixes: 00bfa30abe ("NFS: Create a common pgio_alloc and pgio_release...")
Cc: stable@vger.kernel.org # 3.16+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Commit a7d42ddb30 ("nfs: add mirroring
support to pgio layer") moved pg_cleanup out of the path when there was
non-sequental I/O that needed to be flushed. The result is that for
layouts that have more than one layout segment per file, the pg_lseg is not
cleared, so we can end up hitting the WARN_ON_ONCE(req_start >= seg_end) in
pnfs_generic_pg_test since the pg_lseg will be pointing to that
previously-flushed layout segment.
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Fixes: a7d42ddb30 ("nfs: add mirroring support to pgio layer")
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
When passed GFP flags that allow sleeping (such as
GFP_NOIO), mempool_alloc() will never return NULL, it will
wait until memory is available.
This means that we don't need to handle failure, but that we
do need to ensure one thread doesn't call mempool_alloc()
twice on the one pool without queuing or freeing the first
allocation. If multiple threads did this during times of
high memory pressure, the pool could be exhausted and a
deadlock could result.
pnfs_generic_alloc_ds_commits() attempts to allocate from
the nfs_commit_mempool while already holding an allocation
from that pool. This is not safe. So change
nfs_commitdata_alloc() to take a flag that indicates whether
failure is acceptable.
In pnfs_generic_alloc_ds_commits(), accept failure and
handle it as we currently do. Else where, do not accept
failure, and do not handle it.
Even when failure is acceptable, we want to succeed if
possible. That means both
- using an entry from the pool if there is one
- waiting for direct reclaim is there isn't.
We call mempool_alloc(GFP_NOWAIT) to achieve the first, then
kmem_cache_alloc(GFP_NOIO|__GFP_NORETRY) to achieve the
second. Each of these can fail, but together they do the
best they can without blocking indefinitely.
The objects returned by kmem_cache_alloc() will still be freed
by mempool_free(). This is safe as mempool_alloc() uses
exactly the same function to allocate objects (since the mempool
was created with mempool_create_slab_pool()). The object returned
by mempool_alloc() and kmem_cache_alloc() are indistinguishable
so mempool_free() will handle both identically, either adding to the
pool or calling kmem_cache_free().
Also, don't test for failure when allocating from
nfs_wdata_mempool.
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Returning errors directly even lets us remove the goto
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
If we cut out the dprintk()s, then we can return error codes directly
and cut out the goto.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
This puts all the common code in a single place for the
walk_client_list() functions.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Once again, we can remove the function and compare integer values
directly.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
If we cut out the dprintk()s, then we don't even need this to be a
separate function.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Just remove the function and have the caller use nfs_release_request()
instead.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
We always call nfs_mark_client_ready() even if nfs_create_rpc_client()
returns an error, so we can rearrange nfs_init_client() to mark the
client ready from a single place.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Let's cut out the goto and return any errors immedately
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Additionally, this change lets us cut out the goto by returning errors
immediately.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Removing the dprintk() lets us simplify the function by returning status
codes directly, rather than using a goto.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Removing the dprintk() lets us return the status value directly, rather
than jumping to a label if an error occurs.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
In addition to removing the dprintk(), this patch also initializes "res"
to the default return value instead of doing this through an else
condition.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Removing the dprintk()s lets us simplify the function by removing the
else condition entirely and returning the status of
initiate_{file,bulk}_draining() directly.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Flexfilelayout supports data servers which talk NFS v3 and v4.{0,1,2}.
However, this code path is disabled and v3 only servers are accepted.
This change removes this limitation.
Signed-off-by: Tigran Mkrtchyan <tigran.mkrtchyan@desy.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
NFS has some optimizations for readdir to choose between using READDIR or
READDIRPLUS based on workload, and which NFS operation to use is determined
by subsequent interactions with lookup, d_revalidate, and getattr.
Concurrent use of nfs_readdir() via ->iterate_shared() can cause those
optimizations to repeatedly invalidate the pagecache used to store
directory entries during readdir(), which causes some very bad performance
for directories with many entries (more than about 10000).
There's a couple ways to fix this in NFS, but no fix would be as simple as
going back to ->iterate() to serialize nfs_readdir(), and neither fix I
tested performed as well as going back to ->iterate().
The first required taking the directory's i_lock for each entry, with the
result of terrible contention.
The second way adds another flag to the nfs_inode, and so keeps the
optimizations working for large directories. The difference from using
->iterate() here is that much more memory is consumed for a given workload
without any performance gain.
The workings of nfs_readdir() are such that concurrent users are serialized
within read_cache_page() waiting to retrieve pages of entries from the
server. By serializing this work in iterate_dir() instead, contention for
cache pages is reduced. Waiting processes can have an uncontended pass at
the entirety of the directory's pagecache once previous processes have
completed filling it.
v2 - Keep the bits needed for parallel lookup
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
For overlay file open, check IS_APPEND() on the real upper inode
inside d_real(), because the overlay inode does not have the
S_APPEND flag and IS_APPEND() can only be checked at open time.
Note that because overlayfs does not copy up the chattr inode flags
(i.e. S_APPEND, S_IMMUTABLE), the IS_APPEND() check is only relevant
for upper inodes that were set with chattr +a and not to lower
inodes that had chattr +a before copy up.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
ftruncate an overlayfs inode was checking IS_APPEND() on
overlay inode, but overlay inode does not have the S_APPEND flag.
Check IS_APPEND() on real upper inode instead.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Prepare to mark sensitive kernel structures for randomization by making
sure they're using designated initializers. These were identified during
allyesconfig builds of x86, arm, and arm64, with most initializer fixes
extracted from grsecurity.
For these cases, use { }, which will be zero-filled, instead of
undesignated NULLs.
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
A function in kernel/bpf/syscall.c which got a bug fix in 'net'
was moved to kernel/bpf/verifier.c in 'net-next'.
Signed-off-by: David S. Miller <davem@davemloft.net>
When the kernel is running in secure boot mode, we lock down the kernel to
prevent userspace from modifying the running kernel image. Whilst this
includes prohibiting access to things like /dev/mem, it must also prevent
access by means of configuring driver modules in such a way as to cause a
device to access or modify the kernel image.
To this end, annotate module_param* statements that refer to hardware
configuration and indicate for future reference what type of parameter they
specify. The parameter parser in the core sees this information and can
skip such parameters with an error message if the kernel is locked down.
The module initialisation then runs as normal, but just sees whatever the
default values for those parameters is.
Note that we do still need to do the module initialisation because some
drivers have viable defaults set in case parameters aren't specified and
some drivers support automatic configuration (e.g. PNP or PCI) in addition
to manually coded parameters.
This patch annotates drivers in fs/pstore/.
Suggested-by: Alan Cox <gnomes@lxorguk.ukuu.org.uk>
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Kees Cook <keescook@chromium.org>
cc: Anton Vorontsov <anton@enomsg.org>
cc: Colin Cross <ccross@android.com>
cc: Tony Luck <tony.luck@intel.com>
Pull CIFS fix from Steve French:
"One more cifs fix for stable"
* 'for-next' of git://git.samba.org/sfrench/cifs-2.6:
cifs: Do not send echoes before Negotiate is complete
Andrey reported a use-after-free in __ns_get_path():
spin_lock include/linux/spinlock.h:299 [inline]
lockref_get_not_dead+0x19/0x80 lib/lockref.c:179
__ns_get_path+0x197/0x860 fs/nsfs.c:66
open_related_ns+0xda/0x200 fs/nsfs.c:143
sock_ioctl+0x39d/0x440 net/socket.c:1001
vfs_ioctl fs/ioctl.c:45 [inline]
do_vfs_ioctl+0x1bf/0x1780 fs/ioctl.c:685
SYSC_ioctl fs/ioctl.c:700 [inline]
SyS_ioctl+0x8f/0xc0 fs/ioctl.c:691
We are under rcu read lock protection at that point:
rcu_read_lock();
d = atomic_long_read(&ns->stashed);
if (!d)
goto slow;
dentry = (struct dentry *)d;
if (!lockref_get_not_dead(&dentry->d_lockref))
goto slow;
rcu_read_unlock();
but don't use a proper RCU API on the free path, therefore a parallel
__d_free() could free it at the same time. We need to mark the stashed
dentry with DCACHE_RCUACCESS so that __d_free() will be called after all
readers leave RCU.
Fixes: e149ed2b80 ("take the targets of /proc/*/ns/* symlinks to separate fs")
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
trivial fix to spelling mistake in JFFS2_ERROR message
Signed-off-by: Colin Ian King <colin.king@canonical.com>
[Brian: also fix 'an' -> 'a']
Signed-off-by: Brian Norris <computersforpeace@gmail.com>
This patch introduces an ASYNC IPU policy.
Under senario of large # of async updating(e.g. log writing in Android),
disk would be seriously fragmented, and higher frequent gc would be triggered.
This patch uses IPU to rewrite the async update writting, since async is
NOT sensitive to io latency.
Signed-off-by: Hou Pengyang <houpengyang@huawei.com>
For IPU writes, there won't be any udpates in dnode page since we
will reuse old block address instead of allocating new one, so we
don't need to lock cp_rwsem during IPU IO submitting.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Introduce __check_rb_tree_consistence to check consistence of rb-tree
based discard cache in runtime.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Add an even class f2fs_discard for introducing f2fs_queue_discard, then
use f2fs_{queue,issue}_discard to trace __{queue,submit}_discard_cmd.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Keep issuing big size discard in prior instead of the one with random
size, so that we expect that it will help to:
- be quick to recycle unused large space in flash storage device.
- give a chance for
a) wait to merge small piece discards into bigger one, or
b) avoid issuing discards while they have being reallocated by SSR.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Avoid long variable name in discard_cmd_control structure, no logic
change.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Introduce rb-tree based discard cache infrastructure to speed up lookup and
merge operation of discard entry.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
[Jaegeuk Kim: initialize dc to avoid build warning]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Implement truncate/delete as a non-recursive algorithm. The older
algorithm was implemented with recursion to strip off each layer
at a time (going by height, starting with the maximum height.
This version tries to do the same thing but without recursion,
and without needing to allocate new structures or lists in memory.
For example, say you want to truncate a very large file to 1 byte,
and its end-of-file metapath is: 0.505.463.428. The starting
metapath would be 0.0.0.0. Since it's a truncate to non-zero, it
needs to preserve that byte, and all metadata pointing to it.
So it would start at 0.0.0.0, look up all its metadata buffers,
then free all data blocks pointed to at the highest level.
After that buffer is "swept", it moves on to 0.0.0.1, then
0.0.0.2, etc., reading in buffers and sweeping them clean.
When it gets to the end of the 0.0.0 metadata buffer (for 4K
blocks the last valid one is 0.0.0.508), it backs up to the
previous height and starts working on 0.0.1.0, then 0.0.1.1,
and so forth. After it reaches the end and sweeps 0.0.1.508,
it continues with 0.0.2.0, and so on. When that height is
exhausted, and it reaches 0.0.508.508 it backs up another level,
to 0.1.0.0, then 0.1.0.1, through 0.1.0.508. So it has to keep
marching backwards and forwards through the metadata until it's
all swept clean. Once it has all the data blocks freed, it
lowers the strip height, and begins the process all over again,
but with one less height. This time it sweeps 0.0.0 through
0.505.463. When that's clean, it lowers the strip height again
and works to free 0.505. Eventually it strips the lowest height, 0.
For a delete or truncate to 0, all metadata for all heights of
0.0.0.0 would be freed. For a truncate to 1 byte, 0.0.0.0 would
be preserved.
This isn't much different from normal integer incrementing,
where an integer gets incremented from 0000 (0.0.0.0) to 3021
(3.0.2.1). So 0000 gets increments to 0001, 0002, up to 0009,
then on to 0010, 0011 up to 0099, then 0100 and so forth. It's
just that each "digit" goes from 0 to 508 (for a total of 509
pointers) rather than from 0 to 9.
Note that the dinode will only have 483 pointers due to the
dinode structure itself.
Also note: this is just an example. These numbers (509 and 483)
are based on a standard 4K block size. Smaller block sizes will
yield smaller numbers of indirect pointers accordingly.
The truncation process is accomplished with the help of two
major functions and a few helper functions.
Functions do_strip and recursive_scan are obsolete, so removed.
New function sweep_bh_for_rgrps cleans a buffer_head pointed to
by the given metapath and height. By cleaning, I mean it frees
all blocks starting at the offset passed in metapath. It starts
at the first block in the buffer pointed to by the metapath and
identifies its resource group (rgrp). From there it frees all
subsequent block pointers that lie within that rgrp. If it's
already inside a transaction, it stays within it as long as it
can. In other words, it doesn't close a transaction until it knows
it's freed what it can from the resource group. In this way,
multiple buffers may be cleaned in a single transaction, as long
as those blocks in the buffer all lie within the same rgrp.
If it's not in a transaction, it starts one. If the buffer_head
has references to blocks within multiple rgrps, it frees all the
blocks inside the first rgrp it finds, then closes the
transaction. Then it repeats the cycle: identifies the next
unfreed block, uses it to find its rgrp, then starts a new
transaction for that set. It repeats this process repeatedly
until the buffer_head contains no more references to any blocks
past the given metapath.
Function trunc_dealloc has been reworked into a finite state
automaton. It has basically 3 active states:
DEALLOC_MP_FULL, DEALLOC_MP_LOWER, and DEALLOC_FILL_MP:
The DEALLOC_MP_FULL state implies the metapath has a full set
of buffers out to the "shrink height", and therefore, it can
call function sweep_bh_for_rgrps to free the blocks within the
highest height of the metapath. If it's just swept the lowest
level (or an error has occurred) the state machine is ended.
Otherwise it proceeds to the DEALLOC_MP_LOWER state.
The DEALLOC_MP_LOWER state implies we are finished with a given
buffer_head, which may now be released, and therefore we are
then missing some buffer information from the metapath. So we
need to find more buffers to read in. In most cases, this is
just a matter of releasing the buffer_head and moving to the
next pointer from the previous height, so it may be read in and
swept as well. If it can't find another non-null pointer to
process, it checks whether it's reached the end of a height
and needs to lower the strip height, or whether it still needs
move forward through the previous height's metadata. In this
state, all zero-pointers are skipped. From this state, it can
only loop around (once more backing up another height) or,
once a valid metapath is found (one that has non-zero
pointers), proceed to state DEALLOC_FILL_MP.
The DEALLOC_FILL_MP state implies that we have a metapath
but not all its buffers are read in. So we must proceed to read
in buffer_heads until the metapath has a valid buffer for every
height. If the previous state backed us up 3 heights, we may
need to read in a buffer, increment the height, then repeat the
process until buffers have been read in for all required heights.
If it's successful reading a buffer, and it's at the highest
height we need, it proceeds back to the DEALLOC_MP_FULL state.
If it's unable to fill in a buffer, (encounters a hole, etc.)
it tries to find another non-zero block pointer. If they're all
zero, it lowers the height and returns to the DEALLOC_MP_LOWER
state. If it finds a good non-null pointer, it loops around and
reads it in, while keeping the metapath in lock-step with the
pointers it examines.
The state machine runs until the truncation request is
satisfied. Then any transactions are ended, the quota and
statfs data are updated, and the function is complete.
Helper function metaptr1 was introduced to be an easy way to
determine the start of a buffer_head's indirect pointers.
Helper function lookup_mp_height was introduced to find a
metapath index and read in the buffer that corresponds to it.
In this way, function lookup_metapath becomes a simple loop to
call it for every height.
Helper function fillup_metapath is similar to lookup_metapath
except it can do partial lookups. If the state machine
backed up multiple levels (like 2999 wrapping to 3000) it
needs to find out the next starting point and start issuing
metadata reads at that point.
Helper function hptrs is a shortcut to determine how many
pointers should be expected in a buffer. Height 0 is the dinode
which has fewer pointers than the others.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Now that all places setting inode->i_flags that should be reflected in
on-disk flags are gone, we can remove i_attrs_to_sd_attrs() call.
Signed-off-by: Jan Kara <jack@suse.cz>
reiserfs_new_inode() clears IMMUTABLE and APPEND flags from a symlink
i_flags however a few lines below in sd_attrs_to_i_attrs() we will
happily overwrite i_flags with whatever we inherited from the directory.
Since this behavior is there for ages just remove the useless setting of
i_flags.
Signed-off-by: Jan Kara <jack@suse.cz>
Now that all places setting inode->i_flags that should be reflected in
on-disk flags are gone, we can remove jfs_get_inode_flags() call.
Signed-off-by: Jan Kara <jack@suse.cz>
Now that all places setting inode->i_flags that should be reflected in
on-disk flags are gone, we can remove ext2_get_inode_flags() call.
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Jan Kara <jack@suse.cz>
Now that all places setting inode->i_flags that should be reflected in
on-disk flags are gone, we can remove ext4_get_inode_flags() call.
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Jan Kara <jack@suse.cz>
Currently we set IMMUTABLE and NOATIME flags on quota files to stop
userspace from messing with them. Now that all filesystems set these
flags in their quota_on handlers, we can stop setting the flags in
generic quota code. This will allow filesystems to stop copying i_flags
to their on-disk flags on various occasions.
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Jan Kara <jack@suse.cz>
Currently immutable and noatime flags on quota files are set by quota
code which requires us to copy inode->i_flags to our on disk version
of quota flags in GETFLAGS ioctl and copy_to_dinode(). Move to
setting / clearing these on-disk flags directly to save that copying.
Signed-off-by: Jan Kara <jack@suse.cz>
Currently immutable and noatime flags on quota files are set by quota
code which requires us to copy inode->i_flags to our on disk version of
quota flags in GETFLAGS ioctl and __ext2_write_inode(). Move to setting
/ clearing these on-disk flags directly to save that copying.
Signed-off-by: Jan Kara <jack@suse.cz>
Currently immutable and noatime flags on quota files are set by quota
code which requires us to copy inode->i_flags to our on disk version of
quota flags in GETFLAGS ioctl and when writing stat item. Move to
setting / clearing these on-disk flags directly to save that copying.
Signed-off-by: Jan Kara <jack@suse.cz>
Currently immutable and noatime flags on quota files are set by quota
code which requires us to copy inode->i_flags to our on disk version of
quota flags in GETFLAGS ioctl and ext4_do_update_inode(). Move to
setting / clearing these on-disk flags directly to save that copying.
Signed-off-by: Jan Kara <jack@suse.cz>
The WARN_ON and warning from report_reserved_underflow can become very
noisy and is visible unconditionally although this is namely for
debugging. The patch "btrfs: Add WARN_ON for qgroup reserved underflow"
(18dc22c19b) went to 4.11-rc1 and the plan
was to get the fix as well, but this hasn't happened.
CC: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It is perfectly fine to link a tmpfile back using linkat().
Since tmpfiles are created with a link count of 0 they appear
on the orphan list, upon re-linking the inode has to be removed
from the orphan list again.
Ralph faced a filesystem corruption in combination with overlayfs
due to this bug.
Cc: <stable@vger.kernel.org>
Cc: Ralph Sennhauser <ralph.sennhauser@gmail.com>
Cc: Amir Goldstein <amir73il@gmail.com>
Reported-by: Ralph Sennhauser <ralph.sennhauser@gmail.com>
Tested-by: Ralph Sennhauser <ralph.sennhauser@gmail.com>
Reported-by: Amir Goldstein <amir73il@gmail.com>
Fixes: 474b93704f ("ubifs: Implement O_TMPFILE")
Signed-off-by: Richard Weinberger <richard@nod.at>
A group of Linux kernel hackers reported chasing a bug that resulted
from their assumption that SLAB_DESTROY_BY_RCU provided an existence
guarantee, that is, that no block from such a slab would be reallocated
during an RCU read-side critical section. Of course, that is not the
case. Instead, SLAB_DESTROY_BY_RCU only prevents freeing of an entire
slab of blocks.
However, there is a phrase for this, namely "type safety". This commit
therefore renames SLAB_DESTROY_BY_RCU to SLAB_TYPESAFE_BY_RCU in order
to avoid future instances of this sort of confusion.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: <linux-mm@kvack.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
[ paulmck: Add comments mentioning the old name, as requested by Eric
Dumazet, in order to help people familiar with the old name find
the new one. ]
Acked-by: David Rientjes <rientjes@google.com>
When the userspace process servicing fuse requests is running in
a pid namespace then pids passed via the fuse fd are not being
translated into that process' namespace. Translation is necessary
for the pid to be useful to that process.
Since no use case currently exists for changing namespaces all
translations can be done relative to the pid namespace in use
when fuse_conn_init() is called. For fuse this translates to
mount time, and for cuse this is when /dev/cuse is opened. IO for
this connection from another namespace will return errors.
Requests from processes whose pid cannot be translated into the
target namespace will have a value of 0 for in.h.pid.
File locking changes based on previous work done by Eric
Biederman.
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.
Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.
Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.
Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
The block layer call chain from submit_bio will check if the write cache
is enabled for the given queue before submitting the flush. This will
add a code to fail fast if its not.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ updated changelog to reflect current code stat, blkdev_issue_flush is
not used yet ]
Signed-off-by: David Sterba <dsterba@suse.com>
The last consumer of nobarriers is removed by the commit [1] and sync
won't fail with EOPNOTSUPP anymore. Thus, now when write cache is write
through it just return success without actually transpiring such a
request to the block device/lun.
[1]
commit b25de9d6da
block: remove BIO_EOPNOTSUPP
And, as the device/lun write cache state may change dynamically saving
such as state won't help either. So deleting the member nobarriers.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When scrubbing a RAID5 which has recoverable data corruption (only one
data stripe is corrupted), sometimes scrub will report more csum errors
than expected. Sometimes even unrecoverable error will be reported.
The problem can be easily reproduced by the following steps:
1) Create a btrfs with RAID5 data profile with 3 devs
2) Mount it with nospace_cache or space_cache=v2
To avoid extra data space usage.
3) Create a 128K file and sync the fs, unmount it
Now the 128K file lies at the beginning of the data chunk
4) Locate the physical bytenr of data chunk on dev3
Dev3 is the 1st data stripe.
5) Corrupt the first 64K of the data chunk stripe on dev3
6) Mount the fs and scrub it
The correct csum error number should be 16 (assuming using x86_64).
Larger csum error number can be reported in a 1/3 chance.
And unrecoverable error can also be reported in a 1/10 chance.
The root cause of the problem is RAID5/6 recover code has race
condition, due to the fact that full scrub is initiated per device.
While for other mirror based profiles, each mirror is independent with
each other, so race won't cause any big problem.
For example:
Corrupted | Correct | Correct |
| Scrub dev3 (D1) | Scrub dev2 (D2) | Scrub dev1(P) |
------------------------------------------------------------------------
Read out D1 |Read out D2 |Read full stripe |
Check csum |Check csum |Check parity |
Csum mismatch |Csum match, continue |Parity mismatch |
handle_errored_block | |handle_errored_block |
Read out full stripe | | Read out full stripe|
D1 csum error(err++) | | D1 csum error(err++)|
Recover D1 | | Recover D1 |
So D1's csum error is accounted twice, just because
handle_errored_block() doesn't have enough protection, and race can happen.
On even worse case, for example D1's recovery code is re-writing
D1/D2/P, and P's recovery code is just reading out full stripe, then we
can cause unrecoverable error.
This patch will use previously introduced lock_full_stripe() and
unlock_full_stripe() to protect the whole scrub_handle_errored_block()
function for RAID56 recovery.
So no extra csum error nor unrecoverable error.
Reported-by: Goffredo Baroncelli <kreijack@libero.it>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Unlike mirror based profiles, RAID5/6 recovery needs to read out the
whole full stripe.
And if we don't do proper protection, it can easily cause race condition.
Introduce 2 new functions: lock_full_stripe() and unlock_full_stripe()
for RAID5/6.
Which store a rb_tree of mutexes for full stripes, so scrub callers can
use them to lock a full stripe to avoid race.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ minor comment adjustments ]
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_root_item maintains the ctime for root updates. This is not part
of vfs_inode.
Since current_time() uses struct inode* as an argument as Linus
suggested, this cannot be used to update root times unless, we modify
the signature to use inode.
Since btrfs uses nanosecond time granularity, it can also use
ktime_get_real_ts directly to obtain timestamp for the root. It is
necessary to use the timespec time api here because the same
btrfs_set_stack_timespec_*() apis are used for vfs inode times as well.
These can be transitioned to using timespec64 when btrfs internally
changes to use timespec64 as well.
Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Acked-by: David Sterba <dsterba@suse.com>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_get_extent() never returns NULL pointers, so this code introduces
a static checker warning.
The btrfs_get_extent() is a bit complex, but trust me that it doesn't
return NULLs and also if it did we would trigger the BUG_ON(!em) before
the last return statement.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
[ updated subject ]
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
The easist way to reproduce the bug is:
------
# mkfs.btrfs -f $dev -n 16K
# mount $dev $mnt -o inode_cache
# btrfs quota enable $mnt
# btrfs quota rescan -w $mnt
# btrfs qgroup show $mnt
qgroupid rfer excl
-------- ---- ----
0/5 32.00KiB 32.00KiB
^^ Twice the correct value
------
And fstests/btrfs qgroup test group can easily detect them with
inode_cache mount option.
Although some of them are false alerts since old test cases are using
fixed golden output.
While new test cases will use "btrfs check" to detect qgroup mismatch.
[CAUSE]
Inode_cache mount option will make commit_fs_roots() to call
btrfs_save_ino_cache() to update fs/subvol trees, and generate new
delayed refs.
However we call btrfs_qgroup_prepare_account_extents() too early, before
commit_fs_roots().
This makes the "old_roots" for newly generated extents are always NULL.
For freeing extent case, this makes both new_roots and old_roots to be
empty, while correct old_roots should not be empty.
This causing qgroup numbers not decreased correctly.
[FIX]
Modify the timing of calling btrfs_qgroup_prepare_account_extents() to
just before btrfs_qgroup_account_extents(), and add needed delayed_refs
handler.
So qgroup can handle inode_map mount options correctly.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have already assigned q from bdev_get_queue() so use it.
And rearrange the code for better view.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is fixing code pieces where we use div_u64 when passing a u64 divisor.
Cc: David Sterba <dsterba@suse.cz>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Commit 3d8da67817 ("Btrfs: fix divide error upon chunk's stripe_len")
changed stripe_len in struct map_lookup to u64, but didn't update
stripe_len in struct scrub_parity.
This updates the type and switches to div64_u64_rem to match u64 divisor.
Cc: David Sterba <dsterba@suse.cz>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that scrub can fix data errors with the help of parity for raid56
profile, repair during read is able to as well.
Although the mirror num in raid56 scenario has different meanings, i.e.
0 or 1: read data directly
> 1: do recover with parity,
it could be fit into how we repair bad block during read.
The trick is to use BTRFS_MAP_READ instead of BTRFS_MAP_WRITE to get the
device and position on it.
Cc: David Sterba <dsterba@suse.cz>
Tested-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There's a helper to clear whole page, with a arch-specific optimized
code. The replaced cases do not seem to be in performace critical code,
but we still might get some percent gain.
Signed-off-by: David Sterba <dsterba@suse.com>
scrub_setup_recheck_block() calls btrfs_map_sblock() and then accesses
bbio without protection of bio_counter.
This can lead to use-after-free if racing with dev replace cancel.
Fix it by increasing bio_counter before calling btrfs_map_sblock() and
decreasing the bio_counter when corresponding recover is finished.
Cc: Liu Bo <bo.li.liu@oracle.com>
Reported-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When raid56 dev-replace is cancelled by running scrub, we will free
target device without waiting for in-flight bios, causing the following
NULL pointer deference or general protection failure.
BUG: unable to handle kernel NULL pointer dereference at 00000000000005e0
IP: generic_make_request_checks+0x4d/0x610
CPU: 1 PID: 11676 Comm: kworker/u4:14 Tainted: G O 4.11.0-rc2 #72
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.10.2-20170228_101828-anatol 04/01/2014
Workqueue: btrfs-endio-raid56 btrfs_endio_raid56_helper [btrfs]
task: ffff88002875b4c0 task.stack: ffffc90001334000
RIP: 0010:generic_make_request_checks+0x4d/0x610
Call Trace:
? generic_make_request+0xc7/0x360
generic_make_request+0x24/0x360
? generic_make_request+0xc7/0x360
submit_bio+0x64/0x120
? page_in_rbio+0x4d/0x80 [btrfs]
? rbio_orig_end_io+0x80/0x80 [btrfs]
finish_rmw+0x3f4/0x540 [btrfs]
validate_rbio_for_rmw+0x36/0x40 [btrfs]
raid_rmw_end_io+0x7a/0x90 [btrfs]
bio_endio+0x56/0x60
end_workqueue_fn+0x3c/0x40 [btrfs]
btrfs_scrubparity_helper+0xef/0x620 [btrfs]
btrfs_endio_raid56_helper+0xe/0x10 [btrfs]
process_one_work+0x2af/0x720
? process_one_work+0x22b/0x720
worker_thread+0x4b/0x4f0
kthread+0x10f/0x150
? process_one_work+0x720/0x720
? kthread_create_on_node+0x40/0x40
ret_from_fork+0x2e/0x40
RIP: generic_make_request_checks+0x4d/0x610 RSP: ffffc90001337bb8
In btrfs_dev_replace_finishing(), we will call
btrfs_rm_dev_replace_blocked() to wait bios before destroying the target
device when scrub is finished normally.
However when dev-replace is aborted, either due to error or cancelled by
scrub, we didn't wait for bios, this can lead to use-after-free if there
are bios holding the target device.
Furthermore, for raid56 scrub, at least 2 places are calling
btrfs_map_sblock() without protection of bio_counter, leading to the
problem.
This patch fixes the problem:
1) Wait for bio_counter before freeing target device when canceling
replace
2) When calling btrfs_map_sblock() for raid56, use bio_counter to
protect the call.
Cc: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In the following situation, scrub will calculate wrong parity to
overwrite the correct one:
RAID5 full stripe:
Before
| Dev 1 | Dev 2 | Dev 3 |
| Data stripe 1 | Data stripe 2 | Parity Stripe |
--------------------------------------------------- 0
| 0x0000 (Bad) | 0xcdcd | 0x0000 |
--------------------------------------------------- 4K
| 0xcdcd | 0xcdcd | 0x0000 |
...
| 0xcdcd | 0xcdcd | 0x0000 |
--------------------------------------------------- 64K
After scrubbing dev3 only:
| Dev 1 | Dev 2 | Dev 3 |
| Data stripe 1 | Data stripe 2 | Parity Stripe |
--------------------------------------------------- 0
| 0xcdcd (Good) | 0xcdcd | 0xcdcd (Bad) |
--------------------------------------------------- 4K
| 0xcdcd | 0xcdcd | 0x0000 |
...
| 0xcdcd | 0xcdcd | 0x0000 |
--------------------------------------------------- 64K
The reason is that after raid56 read rebuild rbio->stripe_pages are all
correctly recovered (0xcd for data stripes).
However when we check and repair parity in
scrub_parity_check_and_repair(), we will append pages in sparity->spages
list to rbio->bio_pages[], which contains old on-disk data.
And when we submit parity data to disk, we calculate parity using
rbio->bio_pages[] first, if rbio->bio_pages[] not found, then fallback
to rbio->stripe_pages[].
The patch fix it by not appending pages from sparity->spages.
So finish_parity_scrub() will use rbio->stripe_pages[] which is correct.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Newly introduced qgroup reserved space trace points are normally nested
into several common qgroup operations.
While some other trace points are not well placed to co-operate with
them, causing confusing output.
This patch re-arrange trace_btrfs_qgroup_release_data() and
trace_btrfs_qgroup_free_delayed_ref() trace points so they are triggered
before reserved space ones.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Introduce the following trace points:
qgroup_update_reserve
qgroup_meta_reserve
These trace points are handy to trace qgroup reserve space related
problems.
Also export btrfs_qgroup structure, as now we directly pass btrfs_qgroup
structure to trace points, so that structure needs to be exported.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In raid56 scenario, after trying parity recovery, we didn't set
mirror_num for btrfs_bio with failed mirror_num, hence
end_bio_extent_readpage() will report a random mirror_num in dmesg
log.
Cc: David Sterba <dsterba@suse.cz>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Scrub repairs data by the unit called scrub_block, which may contain
several pages. Scrub always tries to look up a good copy of a whole
block, but if there's no such copy, it tries to do repair page by page.
If we don't set page's io_error when checking this bad copy, in the last
step, we may skip this page when repairing bad copy from good copy.
Cc: David Sterba <dsterba@suse.cz>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There are several operations, usually started from ioctls, that cannot
run concurrently. The status is tracked in
mutually_exclusive_operation_running as an atomic_t. We can easily track
the status as one of the per-filesystem flag bits with same
synchronization guarantees.
The conversion replaces:
* atomic_xchg(..., 1) -> test_and_set_bit(FLAG, ...)
* atomic_set(..., 0) -> clear_bit(FLAG, ...)
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We are facing the same problem with EDQUOT which was experienced with
ENOSPC. Not sure if we require a full ticketing system such as ENOSPC, but
here is a quick fix, which may be too big a hammer.
Quotas are reserved during the start of an operation, incrementing
qg->reserved. However, it is written to disk in a commit_transaction
which could take as long as commit_interval. In the meantime there
could be deletions which are not accounted for because deletions are
accounted for only while committed (free_refroot). So, when we get
a EDQUOT flush the data to disk and try again.
This fixes fstests btrfs/139.
Here is a sample script which shows this issue.
DEVICE=/dev/vdb
MOUNTPOINT=/mnt
TESTVOL=$MOUNTPOINT/tmp
QUOTA=5
PROG=btrfs
DD_BS="4k"
DD_COUNT="256"
RUN_TIMES=5000
mkfs.btrfs -f $DEVICE
mount -o commit=240 $DEVICE $MOUNTPOINT
$PROG subvolume create $TESTVOL
$PROG quota enable $TESTVOL
$PROG qgroup limit ${QUOTA}G $TESTVOL
typeset -i DD_RUN_GOOD
typeset -i QUOTA
function _check_cmd() {
if [[ ${?} > 0 ]]; then
echo -n "$(date) E: Running previous command"
echo ${*}
echo "Without sync"
$PROG qgroup show -pcreFf ${TESTVOL}
echo "With sync"
$PROG qgroup show -pcreFf --sync ${TESTVOL}
exit 1
fi
}
while true; do
DD_RUN_GOOD=$RUN_TIMES
while (( ${DD_RUN_GOOD} != 0 )); do
dd if=/dev/zero of=${TESTVOL}/quotatest${DD_RUN_GOOD} bs=${DD_BS} count=${DD_COUNT}
_check_cmd "dd if=/dev/zero of=${TESTVOL}/quotatest${DD_RUN_GOOD} bs=${DD_BS} count=${DD_COUNT}"
DD_RUN_GOOD=(${DD_RUN_GOOD}-1)
done
$PROG qgroup show -pcref $TESTVOL
echo "----------- Cleanup ---------- "
rm $TESTVOL/quotatest*
done
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Define the SEQ_LAST macro to replace (u64)-1 in places where said
value triggers a special-case ref search behavior.
Signed-off-by: Edmund Nadolski <enadolski@suse.com>
Reviewed-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Replace hardcoded numeric values for __merge_refs 'mode' argument
with descriptive constants.
Signed-off-by: Edmund Nadolski <enadolski@suse.com>
Reviewed-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The members have been effectively unused since "Btrfs: rework qgroup
accounting" (fcebe4562d), there's no substitute for
assert_qgroups_uptodate so it's removed as well.
Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The name is misleading and the local variable serves no purpose.
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We can preallocate the node so insertion does not have to do that under
the lock. The GFP flags for the global radix tree are initialized to
GFP_NOFS & ~__GFP_DIRECT_RECLAIM
but we can use GFP_KERNEL, because readahead is optional and not on any
critical writeout path.
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We can preallocate the node so insertion does not have to do that under
the lock. The GFP flags for the per-device radix tree are initialized to
GFP_NOFS & ~__GFP_DIRECT_RECLAIM
but we can use GFP_KERNEL, same as an allocation above anyway, but also
because readahead is optional and not on any critical writeout path.
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Code cleanup.
The code block is for !(*flags & MS_RDONLY). We don't need
to check it again.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We also don't bother to flush free space cache while with free space
tree.
Cc: David Sterba <dsterba@suse.cz>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
These two BUG_ON()s would never be true, ensured by callers' logic.
Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This adds a helper to show directly whether ops require full stripe.
Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
With this, we can avoid allocating memory for dev replace copies if the
target dev is not available.
Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since this part is mostly independent, this moves it to a separate
function.
Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
As the part of getting extra mirror in __btrfs_map_block is
independent, this puts it into a separate function.
Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since DISCARD is not as important as an operation like write, we don't
copy it to target device during replace, and it makes __btrfs_map_block
less complex.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have similar code here and there, this merges them into a helper.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
While debugging truncate problems, I found that these tracepoints could
help us quickly know what went wrong.
Two sets of tracepoints are created to track regular/prealloc file item
and inline file item respectively, I put inline as a separate one since
what inline file items cares about are way less than the regular one.
This adds four tracepoints:
- btrfs_get_extent_show_fi_regular
- btrfs_get_extent_show_fi_inline
- btrfs_truncate_show_fi_regular
- btrfs_truncate_show_fi_inline
Cc: David Sterba <dsterba@suse.cz>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ formatting adjustments ]
Signed-off-by: David Sterba <dsterba@suse.com>
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.
Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.
Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.
Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.
Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.
Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.
Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.
Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.
Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.
Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.
Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.
Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.
Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.
Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.
Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.
Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.
Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
After 76b42abbf7 ("Btrfs: fix data loss after truncate when using the
no-holes feature"),
For either NO_HOLES or inline extents, we've set last_size to newsize to
avoid data loss after remount or inode got evicted and read again, thus,
we don't need this check anymore.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If your filesystem has, eg, data:raid0 metadata:raid1, and you run "btrfs
balance -dconvert=raid1", the meta.target field will be uninitialized.
That's otherwise ok, as it's unused except for this warning.
Thus, let's use the existing set of raid levels for the comparison.
As a side effect, non-convert balances will now nag about data>metadata.
Signed-off-by: Adam Borowski <kilobyte@angband.pl>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
commit 4fcd1813e6 ("Fix reconnect to not defer smb3 session reconnect
long after socket reconnect") added support for Negotiate requests to
be initiated by echo calls.
To avoid delays in calling echo after a reconnect, I added the patch
introduced by the commit b8c600120f ("Call echo service immediately
after socket reconnect").
This has however caused a regression with cifs shares which do not have
support for echo calls to trigger Negotiate requests. On connections
which need to call Negotiation, the echo calls trigger an error which
triggers a reconnect which in turn triggers another echo call. This
results in a loop which is only broken when an operation is performed on
the cifs share. For an idle share, it can DOS a server.
The patch uses the smb_operation can_echo() for cifs so that it is
called only if connection has been already been setup.
kernel bz: 194531
Signed-off-by: Sachin Prabhu <sprabhu@redhat.com>
Tested-by: Jonathan Liu <net147@gmail.com>
Acked-by: Pavel Shilovsky <pshilov@microsoft.com>
CC: Stable <stable@vger.kernel.org>
Signed-off-by: Steve French <smfrench@gmail.com>
It leaves the iterator advanced by the amount of IO it has requested
instead of the amount actually transferred. Among other things,
that confuses the hell out of generic_file_splice_read().
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
short copy here should mean instant EFAULT, not "move to the
next page and hope it fails there, this time with nothing
copied"
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Unlike normal compat syscall variants, it is needed only for
biarch architectures that have different alignement requirements for
u64 in 32bit and 64bit ABI *and* have __put_user() that won't handle
a store of 64bit value at 32bit-aligned address. We used to have one
such (ia64), but its biarch support has been gone since 2010 (after
being broken in 2008, which went unnoticed since nobody had been using
it).
It had escaped removal at the same time only because back in 2004
a patch that switched several syscalls on amd64 from private wrappers to
generic compat ones had switched to use of compat_sys_getdents64(), which
hadn't needed (or used) a compat wrapper on amd64.
Let's bury it - it's at least 7 years overdue.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The function no longer does anything. The is only a single caller of
register_sysctl_root when semantically there should be two. Remove
this function so that if someone decides this functionality is needed
again it will be obvious all of the callers of setup_sysctl_set need
to be audited and modified appropriately.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Conflicts were simply overlapping changes. In the net/ipv4/route.c
case the code had simply moved around a little bit and the same fix
was made in both 'net' and 'net-next'.
In the net/sched/sch_generic.c case a fix in 'net' happened at
the same time that a new argument was added to qdisc_hash_add().
Signed-off-by: David S. Miller <davem@davemloft.net>
Otherwise lockdep says:
[ 1337.483798] ================================================
[ 1337.483999] [ BUG: lock held when returning to user space! ]
[ 1337.484252] 4.11.0-rc6 #19 Not tainted
[ 1337.484423] ------------------------------------------------
[ 1337.484626] mount/14766 is leaving the kernel with locks still held!
[ 1337.484841] 1 lock held by mount/14766:
[ 1337.485017] #0: (&type->s_umount_key#33/1){+.+.+.}, at: [<ffffffff8124171f>] sget_userns+0x2af/0x520
Caught by xfstests generic/413 which tried to mount with the unsupported
mount option dax. Then xfstests generic/422 ran sync which deadlocks.
Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Acked-by: Mike Marshall <hubcap@omnibond.com>
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Normal pathname lookup doesn't allow empty pathnames, but using
AT_EMPTY_PATH (with name_to_handle_at() or fstatat(), for example) you
can trigger an empty pathname lookup.
And not only is the RCU lookup in that case entirely unnecessary
(because we'll obviously immediately finalize the end result), it is
actively wrong.
Why? An empth path is a special case that will return the original
'dirfd' dentry - and that dentry may not actually be RCU-free'd,
resulting in a potential use-after-free if we were to initialize the
path lazily under the RCU read lock and depend on complete_walk()
finalizing the dentry.
Found by syzkaller and KASAN.
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Reported-by: Vegard Nossum <vegard.nossum@gmail.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull btrfs fixes from Chris Mason:
"Dave Sterba collected a few more fixes for the last rc.
These aren't marked for stable, but I'm putting them in with a batch
were testing/sending by hand for this release"
* 'for-linus-4.11' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: fix potential use-after-free for cloned bio
Btrfs: fix segmentation fault when doing dio read
Btrfs: fix invalid dereference in btrfs_retry_endio
btrfs: drop the nossd flag when remounting with -o ssd
Pull more CIFS fixes from Steve French:
"As promised, here is the remaining set of cifs/smb3 fixes for stable
(and a fix for one regression) now that they have had additional
review and testing"
* 'for-next' of git://git.samba.org/sfrench/cifs-2.6:
CIFS: Fix SMB3 mount without specifying a security mechanism
CIFS: store results of cifs_reopen_file to avoid infinite wait
CIFS: remove bad_network_name flag
CIFS: reconnect thread reschedule itself
CIFS: handle guest access errors to Windows shares
CIFS: Fix null pointer deref during read resp processing
If mmap() maps a file, it can be passed an offset into the file at which
the mapping is to start. Offset could be a negative value when
represented as a loff_t. The offset plus length will be used to update
the file size (i_size) which is also a loff_t.
Validate the value of offset and offset + length to make sure they do
not overflow and appear as negative.
Found by syzcaller with commit ff8c0c53c4 ("mm/hugetlb.c: don't call
region_abort if region_chg fails") applied. Prior to this commit, the
overflow would still occur but we would luckily return ENOMEM.
To reproduce:
mmap(0, 0x2000, 0, 0x40021, 0xffffffffffffffffULL, 0x8000000000000000ULL);
Resulted in,
kernel BUG at mm/hugetlb.c:742!
Call Trace:
hugetlbfs_evict_inode+0x80/0xa0
evict+0x24a/0x620
iput+0x48f/0x8c0
dentry_unlink_inode+0x31f/0x4d0
__dentry_kill+0x292/0x5e0
dput+0x730/0x830
__fput+0x438/0x720
____fput+0x1a/0x20
task_work_run+0xfe/0x180
exit_to_usermode_loop+0x133/0x150
syscall_return_slowpath+0x184/0x1c0
entry_SYSCALL_64_fastpath+0xab/0xad
Fixes: ff8c0c53c4 ("mm/hugetlb.c: don't call region_abort if region_chg fails")
Link: http://lkml.kernel.org/r/1491951118-30678-1-git-send-email-mike.kravetz@oracle.com
Reported-by: Vegard Nossum <vegard.nossum@oracle.com>
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Yet another instance of the same race.
Fix is identical to change_huge_pmd().
See "thp: fix MADV_DONTNEED vs. numa balancing race" for more details.
Link: http://lkml.kernel.org/r/20170302151034.27829-5-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
I'm hitting the BUG in nfsd4_max_reply() at fs/nfsd/nfs4proc.c:2495 when
client sends an operation the server doesn't support.
in nfsd4_max_reply() it checks for NULL rsize_bop but a non-supported
operation wouldn't have that set.
Cc: Kinglong Mee <kinglongmee@gmail.com>
Fixes: 2282cd2c05 "NFSD: Get response size before operation..."
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Commit ef65aaede2 ("smb2: Enforce sec= mount option") changed the
behavior of a mount command to enforce a specified security mechanism
during mounting. On another hand according to the spec if SMB3 server
doesn't respond with a security context it implies that it supports
NTLMSSP. The current code doesn't keep it in mind and fails a mount
for such servers if no security mechanism is specified. Fix this by
indicating that a server supports NTLMSSP if a security context isn't
returned during negotiate phase. This allows the code to use NTLMSSP
by default for SMB3 mounts.
Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Otherwise, we can see stale fsync/dentry mark given by previous calls, resulting
in giving up roll-forward recovery due to wrong dentry mark.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Don't use blk plug covering area where there won't be any IOs being issued.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Lockdep complains about use of the iolock in inode reclaim context
because it doesn't understand that reclaim has the last reference to
the inode, and thus an iolock->reclaim->iolock deadlock is not
possible.
The iolock is technically not necessary in xfs_inactive() and was
only added to appease an assert in xfs_free_eofblocks(), which can
be called from other non-reclaim contexts. Therefore, just kill the
assert and drop the use of the iolock from reclaim context to quiet
lockdep.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Long ago, all this gunk was added with a lament about problems
with gcc's do_div, and a fun recommendation in the changelog:
egcs-2.91.66 is the recommended compiler version for building XFS.
All this special stuff was needed to work around an old gcc bug,
apparently, and it's been there ever since.
There should be no need for this anymore, so remove it.
Remove the special 32-bit xfs_do_mod as well; just let the
kernel's do_div() handle all this.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
ndquots is a 32-bit value, and we don't care
about the remainder; there is no reason to use do_div
here, it seems to be the result of a decade+ historical
accident.
Worse, the do_div implementation in userspace breaks
when fed a 32-bit dividend, so we commented it out there
in any case.
Change to simple division, and then we can change
userspace to match, and mandate a 64-bit dividend in
the do_div() in userspace as well.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
rb-tree lookup/update functions are deeply coupled into extent cache
codes, it's very hard to reuse these basic functions, this patch
extracts common rb-tree operation infrastructure for latter reusing.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Now we're doing SSR aggressively more than ever before, so once we reach to
the reserved_segment, f2fs_balance_fs will call f2fs_gc, which triggers
checkpoint everytime. We actually must avoid that.
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
KASAN reports that there is a use-after-free case of bio in btrfs_map_bio.
If we need to submit IOs to several disks at a time, the original bio
would get cloned and mapped to the destination disk, but we really should
use the original bio instead of a cloned bio to do the sanity check
because cloned bios are likely to be freed by its endio.
Reported-by: Diego <diegocg@gmail.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Commit 2dabb32484 ("Btrfs: Direct I/O read: Work on sectorsized blocks")
introduced this bug during iterating bio pages in dio read's endio hook,
and it could end up with segment fault of the dio reading task.
So the reason is 'if (nr_sectors--)', and it makes the code assume that
there is one more block in the same page, so page offset is increased and
the bio which is created to repair the bad block then has an incorrect
bvec.bv_offset, and a later access of the page content would throw a
segmentation fault.
This also adds ASSERT to check page offset against page size.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When doing directIO repair, we have this oops:
[ 1458.532816] general protection fault: 0000 [#1] SMP
...
[ 1458.536291] Workqueue: btrfs-endio-repair btrfs_endio_repair_helper [btrfs]
[ 1458.536893] task: ffff88082a42d100 task.stack: ffffc90002b3c000
[ 1458.537499] RIP: 0010:btrfs_retry_endio+0x7e/0x1a0 [btrfs]
...
[ 1458.543261] Call Trace:
[ 1458.543958] ? rcu_read_lock_sched_held+0xc4/0xd0
[ 1458.544374] bio_endio+0xed/0x100
[ 1458.544750] end_workqueue_fn+0x3c/0x40 [btrfs]
[ 1458.545257] normal_work_helper+0x9f/0x900 [btrfs]
[ 1458.545762] btrfs_endio_repair_helper+0x12/0x20 [btrfs]
[ 1458.546224] process_one_work+0x34d/0xb70
[ 1458.546570] ? process_one_work+0x29e/0xb70
[ 1458.546938] worker_thread+0x1cf/0x960
[ 1458.547263] ? process_one_work+0xb70/0xb70
[ 1458.547624] kthread+0x17d/0x180
[ 1458.547909] ? kthread_create_on_node+0x70/0x70
[ 1458.548300] ret_from_fork+0x31/0x40
It turns out that btrfs_retry_endio is trying to get inode from a directIO
page.
This fixes the problem by using the saved inode pointer, done->inode.
btrfs_retry_endio_nocsum has the same problem, and it's fixed as well.
Also cleanup unused @start (which is too trivial for a separate patch).
Cc: David Sterba <dsterba@suse.cz>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The opposite case was already handled right in the very next switch entry.
And also when turning on nossd, drop ssd_spread.
Reported-by: Hans van Kranenburg <hans.van.kranenburg@mendix.com>
Signed-off-by: Adam Borowski <kilobyte@angband.pl>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It is not safe for one thread to modify the ->flags
of another thread as there is no locking that can protect
the update.
So tsk_restore_flags(), which takes a task pointer and modifies
the flags, is an invitation to do the wrong thing.
All current users pass "current" as the task, so no developers have
accepted that invitation. It would be best to ensure it remains
that way.
So rename tsk_restore_flags() to current_restore_flags() and don't
pass in a task_struct pointer. Always operate on current->flags.
Signed-off-by: NeilBrown <neilb@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This fixes Continuous Availability when errors during
file reopen are encountered.
cifs_user_readv and cifs_user_writev would wait for ever if
results of cifs_reopen_file are not stored and for later inspection.
In fact, results are checked and, in case of errors, a chain
of function calls leading to reads and writes to be scheduled in
a separate thread is skipped.
These threads will wake up the corresponding waiters once reads
and writes are done.
However, given the return value is not stored, when rc is checked
for errors a previous one (always zero) is inspected instead.
This leads to pending reads/writes added to the list, making
cifs_user_readv and cifs_user_writev wait for ever.
Signed-off-by: Germano Percossi <germano.percossi@citrix.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
CC: Stable <stable@vger.kernel.org>
Signed-off-by: Steve French <smfrench@gmail.com>
STATUS_BAD_NETWORK_NAME can be received during node failover,
causing the flag to be set and making the reconnect thread
always unsuccessful, thereafter.
Once the only place where it is set is removed, the remaining
bits are rendered moot.
Removing it does not prevent "mount" from failing when a non
existent share is passed.
What happens when the share really ceases to exist while the
share is mounted is undefined now as much as it was before.
Signed-off-by: Germano Percossi <germano.percossi@citrix.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
CC: Stable <stable@vger.kernel.org>
Signed-off-by: Steve French <smfrench@gmail.com>
In case of error, smb2_reconnect_server reschedule itself
with a delay, to avoid being too aggressive.
Signed-off-by: Germano Percossi <germano.percossi@citrix.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
CC: Stable <stable@vger.kernel.org>
Signed-off-by: Steve French <smfrench@gmail.com>
Commit 1a967d6c9b ("correctly to
anonymous authentication for the NTLM(v2) authentication") introduces
a regression in handling errors related to attempting a guest
connection to a Windows share which requires authentication. This
should result in a permission denied error but actually causes the
kernel module to enter a never-ending loop trying to follow a DFS
referal which doesn't exist.
The base cause of this is the failure now occurs later in the process
during tree connect and not at the session setup setup and all errors
in tree connect are interpreted as needing to follow the DFS paths
which isn't in this case correct. So, check the returned error against
EACCES and fail if this is returned error.
Feedback from Aurelien:
PS> net user guest /activate:no
PS> mkdir C:\guestshare
PS> icacls C:\guestshare /grant 'Everyone:(OI)(CI)F'
PS> new-smbshare -name guestshare -path C:\guestshare -fullaccess Everyone
I've tested v3.10, v4.4, master, master+your patch using default options
(empty or no user "NU") and user=abc (U).
NT_LOGON_FAILURE in session setup: LF
This is what you seem to have in 3.10.
NT_ACCESS_DENIED in tree connect to the share: AD
This is what you get before your infinite loop.
| NU U
--------------------------------
3.10 | LF LF
4.4 | LF LF
master | AD LF
master+patch | AD LF
No infinite DFS loop :(
All these issues result in mount failing very fast with permission denied.
I guess it could be from either the Windows version or the share/folder
ACL. A deeper analysis of the packets might reveal more.
In any case I did not notice any issues for on a basic DFS setup with
the patch so I don't think it introduced any regressions, which is
probably all that matters. It still bothers me a little I couldn't hit
the bug.
I've included kernel output w/ debugging output and network capture of
my tests if anyone want to have a look at it. (master+patch = ml-guestfix).
Signed-off-by: Mark Syms <mark.syms@citrix.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Tested-by: Aurelien Aptel <aaptel@suse.com>
Acked-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Currently during receiving a read response mid->resp_buf can be
NULL when it is being passed to cifs_discard_remaining_data() from
cifs_readv_discard(). Fix it by always passing server->smallbuf
instead and initializing mid->resp_buf at the end of read response
processing.
Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com>
CC: Stable <stable@vger.kernel.org>
Acked-by: Sachin Prabhu <sprabhu@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Add braces around variables used within macros for those make sense
to do it. Many of the macros in f2fs already do this. What this commit
doesn't do is anything that changes line# as a result of adding braces,
which usually affects the binary via __LINE__.
Confirmed no diff in fs/f2fs/f2fs.ko before/after this commit on x86_64,
to make sure this has no functional change as well as there's been no
unexpected side effect due to callers' arithmetics within the existing
code.
Signed-off-by: Tomohiro Kusumi <tkusumi@tuxera.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Callers are to unlock the page on failure after 86531d6b.
Signed-off-by: Tomohiro Kusumi <tkusumi@tuxera.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In f2fs_submit_discard_endio, we will wake up waiter before setting
discard command states, so waiter may use incorrect states. Change
the order between complete() and states setting to fix this issue.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Split f2fs_wait_discard_bios from f2fs_wait_discard_bio, just for cleanup,
no logic change.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Split discard_cmd_list to discard_{pend,wait}_list, so while sending/waiting
discard command, we can avoid traversing unneeded entries in original list.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This reverts commit 3436c4bdb3.
This makes a leak to register dirty segments. I reproduced the issue by
modified postmark which injects a lot of file create/delete/update and
finally triggers huge number of SSR allocations.
Cc: <stable@vger.kernel.org> # v4.10+
[Jaegeuk Kim: Change missing incorrect comment]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Pointer to ->free_mark callback unnecessarily occupies one long in each
fsnotify_mark although they are the same for all marks from one
notification group. Move the callback pointer to fsnotify_ops.
Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Currently we initialize mark->group only in fsnotify_add_mark_lock().
However we will need to access fsnotify_ops of corresponding group from
fsnotify_put_mark() so we need mark->group initialized earlier. Do that
in fsnotify_init_mark() which has a consequence that once
fsnotify_init_mark() is called on a mark, the mark has to be destroyed
by fsnotify_put_mark().
Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
inode_mark.c now contains only a single function. Move it to
fs/notify/fsnotify.c and remove inode_mark.c.
Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
These are very thin wrappers, just remove them. Drop
fs/notify/vfsmount_mark.c as it is empty now.
Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
The function is already mostly contained in what
fsnotify_clear_marks_by_group() does. Just update that function to not
select marks when all of them should be destroyed and remove
fsnotify_detach_group_marks().
Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
The _flags() suffix in the function name was more confusing than
explaining so just remove it. Also rename the argument from 'flags' to
'type' to better explain what the function expects.
Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Suggested-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Inline these helpers as they are very thin. We still keep them as we
don't want to expose details about how list type is determined.
Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
These helpers are just very thin wrappers now. Remove them.
Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
These helpers are now only a simple assignment and just obfuscate
what is going on. Remove them.
Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
When userspace task processing fanotify permission events screws up and
does not respond, fsnotify_mark_srcu SRCU is held indefinitely which
causes further hangs in the whole notification subsystem. Although we
cannot easily solve the problem of operations blocked waiting for
response from userspace, we can at least somewhat localize the damage by
dropping SRCU lock before waiting for userspace response and reacquiring
it when userspace responds.
Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Pass fsnotify_iter_info into ->handle_event() handler so that it can
release and reacquire SRCU lock via fsnotify_prepare_user_wait() and
fsnotify_finish_user_wait() functions. These functions also make sure
current marks are appropriately pinned so that iteration protected by
srcu in fsnotify() stays safe.
Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
fanotify wants to drop fsnotify_mark_srcu lock when waiting for response
from userspace so that the whole notification subsystem is not blocked
during that time. This patch provides a framework for safely getting
mark reference for a mark found in the object list which pins the mark
in that list. We can then drop fsnotify_mark_srcu, wait for userspace
response and then safely continue iteration of the object list once we
reaquire fsnotify_mark_srcu.
Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Currently we queue all marks for destruction on group shutdown and then
destroy them from fsnotify_destroy_group() instead from a worker thread
which is the usual path. However worker can already be processing some
list of marks to destroy so this does not make 100% all marks are really
destroyed by the time group is shut down. This isn't a big problem as
each mark holds group reference and thus group stays partially alive
until all marks are really freed but there's no point in complicating
our lives - just wait for the delayed work to be finished instead.
Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Instead of removing mark from object list from fsnotify_detach_mark(),
remove the mark when last reference to the mark is dropped. This will
allow fanotify to wait for userspace response to event without having to
hold onto fsnotify_mark_srcu.
To avoid pinning inodes by elevated refcount (and thus e.g. delaying
file deletion) while someone holds mark reference, we detach connector
from the object also from fsnotify_destroy_marks() and not only after
removing last mark from the list as it was now.
Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Currently we queue mark into a list of marks for destruction in
__fsnotify_free_mark() and keep the last mark reference dangling. After the
worker waits for SRCU period, it drops the last reference to the mark
which frees it. This scheme has the disadvantage that if we hold
reference to a mark and drop and reacquire SRCU lock, the mark can get
freed immediately which is slightly inconvenient and we will need to
avoid this in the future.
Move to a scheme where queueing of mark into a list of marks for
destruction happens when the last reference to the mark is dropped. Also
drop reference to the mark held by group list already when mark is
removed from that list instead of dropping it only from the destruction
worker.
Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Dropping mark reference can result in mark being freed. Although it
should not happen in inotify_remove_from_idr() since caller should hold
another reference, just don't risk lock up just after WARN_ON
unnecessarily. Also fold do_inotify_remove_from_idr() into the single
callsite as that function really is just two lines of real code.
Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Currently we free fsnotify_mark_connector structure only when inode /
vfsmount is getting freed. This can however impose noticeable memory
overhead when marks get attached to inodes only temporarily. So free the
connector structure once the last mark is detached from the object.
Since notification infrastructure can be working with the connector
under the protection of fsnotify_mark_srcu, we have to be careful and
free the fsnotify_mark_connector only after SRCU period passes.
Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
So far list of marks attached to an object (inode / vfsmount) was
protected by i_lock or mnt_root->d_lock. This dictates that the list
must be empty before the object can be destroyed although the list is
now anchored in the fsnotify_mark_connector structure. Protect the list
by a spinlock in the fsnotify_mark_connector structure to decouple
lifetime of a list of marks from a lifetime of the object. This also
simplifies the code quite a bit since we don't have to differentiate
between inode and vfsmount lists in quite a few places anymore.
Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
After removing all the indirection it is clear that
hlist_del_init_rcu(&mark->obj_list);
in fsnotify_destroy_marks() is not needed as the mark gets removed from
the list shortly afterwards in fsnotify_destroy_mark() ->
fsnotify_detach_mark() -> fsnotify_detach_from_object(). Also there is
no problem with mark being visible on object list while we call
fsnotify_destroy_mark() as parallel destruction of marks from several
places is properly handled (as mentioned in the comment in
fsnotify_destroy_marks(). So just remove the list removal and also the
stale comment.
Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
We lock object list lock in fsnotify_detach_from_object() twice - once
to detach mark and second time to recalculate mask. That is unnecessary
and later it will become problematic as we will free the connector as
soon as there is no mark in it. So move recalculation of fsnotify mask
into the same critical section that is detaching mark.
This also removes recalculation of child dentry flags from
fsnotify_detach_from_object(). That is however fine. Those marks will
get recalculated once some event happens on a child.
Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
fsnotify_detach_mark() calls fsnotify_destroy_inode_mark() or
fsnotify_destroy_vfsmount_mark() to remove mark from object list. These
two functions are however very similar and differ only in the lock they
use to protect the object list of marks. Simplify the code by removing
the indirection and removing mark from the object list in a common
function.
Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Instead of passing spinlock into fsnotify_destroy_marks() determine it
directly in that function from the connector type. This will reduce code
churn when changing lock protecting list of marks.
Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Move locking of a mark list into fsnotify_find_mark(). This reduces code
churn in the following patch changing lock protecting the list.
Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Move locking of locks protecting a list of marks into
fsnotify_recalc_mask(). This reduces code churn in the following patch
which changes the lock protecting the list of marks.
Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Move fsnotify_destroy_marks() to be later in the fs/notify/mark.c. It
will need some functions that are declared after its current
declaration. No functional change.
Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Adding notification mark to object list has been currently done through
fsnotify_add_{inode|vfsmount}_mark() helpers from
fsnotify_add_mark_locked() which call fsnotify_add_mark_list(). Remove
this unnecessary indirection to simplify the code.
Pushing all the locking to fsnotify_add_mark_list() also allows us to
allocate the connector structure with GFP_KERNEL mode.
Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Currently inode reference is held by fsnotify marks. Change the rules so
that inode reference is held by fsnotify_mark_connector structure
whenever the list is non-empty. This simplifies the code and is more
logical.
Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Move pointer to inode / vfsmount from mark itself to the
fsnotify_mark_connector structure. This is another step on the path
towards decoupling inode / vfsmount lifetime from notification mark
lifetime.
Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Currently notification marks are attached to object (inode or vfsmnt) by
a hlist_head in the object. The list is also protected by a spinlock in
the object. So while there is any mark attached to the list of marks,
the object must be pinned in memory (and thus e.g. last iput() deleting
inode cannot happen). Also for list iteration in fsnotify() to work, we
must hold fsnotify_mark_srcu lock so that mark itself and
mark->obj_list.next cannot get freed. Thus we are required to wait for
response to fanotify events from userspace process with
fsnotify_mark_srcu lock held. That causes issues when userspace process
is buggy and does not reply to some event - basically the whole
notification subsystem gets eventually stuck.
So to be able to drop fsnotify_mark_srcu lock while waiting for
response, we have to pin the mark in memory and make sure it stays in
the object list (as removing the mark waiting for response could lead to
lost notification events for groups later in the list). However we don't
want inode reclaim to block on such mark as that would lead to system
just locking up elsewhere.
This commit is the first in the series that paves way towards solving
these conflicting lifetime needs. Instead of anchoring the list of marks
directly in the object, we anchor it in a dedicated structure
(fsnotify_mark_connector) and just point to that structure from the
object. The following commits will also add spinlock protecting the list
and object pointer to the structure.
Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Add a comment that lifetime of a notification mark is protected by SRCU
and remove a comment about clearing of marks attached to the inode. It
is stale and more uptodate version is at fsnotify_destroy_marks() which
is the function handling this case.
Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Pull CIFS fixes from Steve French:
"This is a set of CIFS/SMB3 fixes for stable.
There is another set of four SMB3 reconnect fixes for stable in
progress but they are still being reviewed/tested, so didn't want to
wait any longer to send these five below"
* 'for-next' of git://git.samba.org/sfrench/cifs-2.6:
Reset TreeId to zero on SMB2 TREE_CONNECT
CIFS: Fix build failure with smb2
Introduce cifs_copy_file_range()
SMB3: Rename clone_range to copychunk_range
Handle mismatched open calls
Here are 3 small fixes for 4.11-rc6. One resolves a reported issue with
sysfs files that NeilBrown found, one is a documenatation fix for the
stable kernel rules, and the last is a small MAINTAINERS file update for
kernfs.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCWOnrMw8cZ3JlZ0Brcm9h
aC5jb20ACgkQMUfUDdst+yk/JQCfQKjOpGDAR9Hs6u4YQ4hJrAHFneYAn1F4MLDW
3b0ZMnlZHkDq834UwKnB
=iiei
-----END PGP SIGNATURE-----
Merge tag 'driver-core-4.11-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core fixes from Greg KH:
"Here are 3 small fixes for 4.11-rc6.
One resolves a reported issue with sysfs files that NeilBrown found,
one is a documenatation fix for the stable kernel rules, and the last
is a small MAINTAINERS file update for kernfs"
* tag 'driver-core-4.11-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
MAINTAINERS: separate out kernfs maintainership
sysfs: be careful of error returns from ops->show()
Documentation: stable-kernel-rules: fix stable-tag format
Pull VFS fixes from Al Viro:
"statx followup fixes and a fix for stack-smashing on alpha"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
alpha: fix stack smashing in old_adjtimex(2)
statx: Include a mask for stx_attributes in struct statx
statx: Reserve the top bit of the mask for future struct expansion
xfs: report crtime and attribute flags to statx
ext4: Add statx support
statx: optimize copy of struct statx to userspace
statx: remove incorrect part of vfs_statx() comment
statx: reject unknown flags when using NULL path
Documentation/filesystems: fix documentation for ->getattr()
This gets us support for non-discard efficient write of zeroes (e.g. NVMe)
and prepares for removing the discard_zeroes_data flag.
Also remove a pointless discard support check, which is done in
blkdev_issue_discard already.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Turn the existing discard flag into a new BLKDEV_ZERO_UNMAP flag with
similar semantics, but without referring to diѕcard.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
ops->show() can return a negative error code.
Commit 65da3484d9 ("sysfs: correctly handle short reads on PREALLOC attrs.")
(in v4.4) caused this to be stored in an unsigned 'size_t' variable, so errors
would look like large numbers.
As a result, if an error is returned, sysfs_kf_read() will return the
value of 'count', typically 4096.
Commit 17d0774f80 ("sysfs: correctly handle read offset on PREALLOC attrs")
(in v4.8) extended this error to use the unsigned large 'len' as a size for
memmove().
Consequently, if ->show returns an error, then the first read() on the
sysfs file will return 4096 and could return uninitialized memory to
user-space.
If the application performs a subsequent read, this will trigger a memmove()
with extremely large count, and is likely to crash the machine is bizarre ways.
This bug can currently only be triggered by reading from an md
sysfs attribute declared with __ATTR_PREALLOC() during the
brief period between when mddev_put() deletes an mddev from
the ->all_mddevs list, and when mddev_delayed_delete() - which is
scheduled on a workqueue - completes.
Before this, an error won't be returned by the ->show()
After this, the ->show() won't be called.
I can reproduce it reliably only by putting delay like
usleep_range(500000,700000);
early in mddev_delayed_delete(). Then after creating an
md device md0 run
echo clear > /sys/block/md0/md/array_state; cat /sys/block/md0/md/array_state
The bug can be triggered without the usleep.
Fixes: 65da3484d9 ("sysfs: correctly handle short reads on PREALLOC attrs.")
Fixes: 17d0774f80 ("sysfs: correctly handle read offset on PREALLOC attrs")
Cc: stable@vger.kernel.org
Signed-off-by: NeilBrown <neilb@suse.com>
Acked-by: Tejun Heo <tj@kernel.org>
Reported-and-tested-by: Miroslav Benes <mbenes@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Merge misc fixes from Andrew Morton:
"10 fixes"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
mm: move pcp and lru-pcp draining into single wq
mailmap: update Yakir Yang email address
mm, swap_cgroup: reschedule when neeed in swap_cgroup_swapoff()
dax: fix radix tree insertion race
mm, thp: fix setting of defer+madvise thp defrag mode
ptrace: fix PTRACE_LISTEN race corrupting task->state
vmlinux.lds: add missing VMLINUX_SYMBOL macros
mm/page_alloc.c: fix print order in show_free_areas()
userfaultfd: report actual registered features in fdinfo
mm: fix page_vma_mapped_walk() for ksm pages
While running generic/340 in my test setup I hit the following race. It
can happen with kernels that support FS DAX PMDs, so v4.10 thru
v4.11-rc5.
Thread 1 Thread 2
-------- --------
dax_iomap_pmd_fault()
grab_mapping_entry()
spin_lock_irq()
get_unlocked_mapping_entry()
'entry' is NULL, can't call lock_slot()
spin_unlock_irq()
radix_tree_preload()
dax_iomap_pmd_fault()
grab_mapping_entry()
spin_lock_irq()
get_unlocked_mapping_entry()
...
lock_slot()
spin_unlock_irq()
dax_pmd_insert_mapping()
<inserts a PMD mapping>
spin_lock_irq()
__radix_tree_insert() fails with -EEXIST
<fall back to 4k fault, and die horribly
when inserting a 4k entry where a PMD exists>
The issue is that we have to drop mapping->tree_lock while calling
radix_tree_preload(), but since we didn't have a radix tree entry to
lock (unlike in the pmd_downgrade case) we have no protection against
Thread 2 coming along and inserting a PMD at the same index. For 4k
entries we handled this with a special-case response to -EEXIST coming
from the __radix_tree_insert(), but this doesn't save us for PMDs
because the -EEXIST case can also mean that we collided with a 4k entry
in the radix tree at a different index, but one that is covered by our
PMD range.
So, correctly handle both the 4k and 2M collision cases by explicitly
re-checking the radix tree for an entry at our index once we reacquire
mapping->tree_lock.
This patch has made it through a clean xfstests run with the current
v4.11-rc5 based linux/master, and it also ran generic/340 500 times in a
loop. It used to fail within the first 10 iterations.
Link: http://lkml.kernel.org/r/20170406212944.2866-1-ross.zwisler@linux.intel.com
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: <stable@vger.kernel.org> [4.10+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
fdinfo for userfault file descriptor reports UFFD_API_FEATURES. Up
until recently, the UFFD_API_FEATURES was defined as 0, therefore
corresponding field in fdinfo always contained zero. Now, with
introduction of several additional features, UFFD_API_FEATURES is not
longer 0 and it seems better to report actual features requested for the
userfaultfd object described by the fdinfo.
First, the applications that were using userfault will still see zero at
the features field in fdinfo. Next, reporting actual features rather
than available features, gives clear indication of what userfault
features are used by an application.
Link: http://lkml.kernel.org/r/1491140181-22121-1-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Pavel Emelyanov <xemul@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Without this fix (and another to the userspace component itself
described later), the kernel will be unable to process any OrangeFS
requests after the userspace component is restarted (due to a crash or
at the administrator's behest).
The bug here is that inside orangefs_remount, the orangefs_request_mutex
is locked. When the userspace component restarts while the filesystem
is mounted, it sends a ORANGEFS_DEV_REMOUNT_ALL ioctl to the device,
which causes the kernel to send it a few requests aimed at synchronizing
the state between the two. While this is happening the
orangefs_request_mutex is locked to prevent any other requests going
through.
This is only half of the bugfix. The other half is in the userspace
component which outright ignores(!) requests made before it considers
the filesystem remounted, which is after the ioctl returns. Of course
the ioctl doesn't return until after the userspace component responds to
the request it ignores. The userspace component has been changed to
allow ORANGEFS_VFS_OP_FEATURES regardless of the mount status.
Mike Marshall says:
"I've tested this patch against the fixed userspace part. This patch is
real important, I hope it can make it into 4.11...
Here's what happens when the userspace daemon is restarted, without
the patch:
=============================================
[ INFO: possible recursive locking detected ]
[ 4.10.0-00007-ge98bdb3 #1 Not tainted ]
---------------------------------------------
pvfs2-client-co/29032 is trying to acquire lock:
(orangefs_request_mutex){+.+.+.}, at: service_operation+0x3c7/0x7b0 [orangefs]
but task is already holding lock:
(orangefs_request_mutex){+.+.+.}, at: dispatch_ioctl_command+0x1bf/0x330 [orangefs]
CPU: 0 PID: 29032 Comm: pvfs2-client-co Not tainted 4.10.0-00007-ge98bdb3 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.3-1.fc25 04/01/2014
Call Trace:
__lock_acquire+0x7eb/0x1290
lock_acquire+0xe8/0x1d0
mutex_lock_killable_nested+0x6f/0x6e0
service_operation+0x3c7/0x7b0 [orangefs]
orangefs_remount+0xea/0x150 [orangefs]
dispatch_ioctl_command+0x227/0x330 [orangefs]
orangefs_devreq_ioctl+0x29/0x70 [orangefs]
do_vfs_ioctl+0xa3/0x6e0
SyS_ioctl+0x79/0x90"
Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Acked-by: Mike Marshall <hubcap@omnibond.com>
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We've added a considerable amount of fixes for stalls and issues
with the blk-mq scheduling in the 4.11 series since forking
off the for-4.12/block branch. We need to do improvements on
top of that for 4.12, so pull in the previous fixes to make
our lives easier going forward.
Signed-off-by: Jens Axboe <axboe@fb.com>
Commit e7d316a02f ("sysctl: handle error writing UINT_MAX to u32
fields") introduced the proc_douintvec helper function, but it forgot to
add the related sanity check when doing register_sysctl_table. So add
it now.
Signed-off-by: Liping Zhang <zlpnobody@gmail.com>
Cc: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently the cifs module breaks the CIFS specs on reconnect as
described in http://msdn.microsoft.com/en-us/library/cc246529.aspx:
"TreeId (4 bytes): Uniquely identifies the tree connect for the
command. This MUST be 0 for the SMB2 TREE_CONNECT Request."
Signed-off-by: Jan-Marek Glogowski <glogow@fbihome.de>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Tested-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <smfrench@gmail.com>
CC: Stable <stable@vger.kernel.org>
I saw the following build error during a randconfig build:
fs/cifs/smb2ops.c: In function 'smb2_new_lease_key':
fs/cifs/smb2ops.c:1104:2: error: implicit declaration of function 'generate_random_uuid' [-Werror=implicit-function-declaration]
Explicit include the right header to fix this issue.
Signed-off-by: Tobias Regnery <tobias.regnery@gmail.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <smfrench@gmail.com>
The earlier changes to copy range for cifs unintentionally disabled the more
common form of server side copy.
The patch introduces the file_operations helper cifs_copy_file_range()
which is used by the syscall copy_file_range. The new file operations
helper allows us to perform server side copies for SMB2.0 and 2.1
servers as well as SMB 3.0+ servers which do not support the ioctl
FSCTL_DUPLICATE_EXTENTS_TO_FILE.
The new helper uses the ioctl FSCTL_SRV_COPYCHUNK_WRITE to perform
server side copies. The helper is called by vfs_copy_file_range() only
once an attempt to clone the file using the ioctl
FSCTL_DUPLICATE_EXTENTS_TO_FILE has failed.
Signed-off-by: Sachin Prabhu <sprabhu@redhat.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
CC: Stable <stable@vger.kernel.org>
Signed-off-by: Steve French <smfrench@gmail.com>
Server side copy is one of the most important mechanisms smb2/smb3
supports and it was unintentionally disabled for most use cases.
Renaming calls to reflect the underlying smb2 ioctl called. This is
similar to the name duplicate_extents used for a similar ioctl which is
also used to duplicate files by reusing fs blocks. The name change is to
avoid confusion.
Signed-off-by: Sachin Prabhu <sprabhu@redhat.com>
CC: Stable <stable@vger.kernel.org>
Signed-off-by: Steve French <smfrench@gmail.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
A signal can interrupt a SendReceive call which result in incoming
responses to the call being ignored. This is a problem for calls such as
open which results in the successful response being ignored. This
results in an open file resource on the server.
The patch looks into responses which were cancelled after being sent and
in case of successful open closes the open fids.
For this patch, the check is only done in SendReceive2()
RH-bz: 1403319
Signed-off-by: Sachin Prabhu <sprabhu@redhat.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Cc: Stable <stable@vger.kernel.org>
Apparently FIEMAP for xattrs has been broken since we switched to
the iomap backend because of an incorrect check for xattr presence.
Also fix the broken locking.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
No one cares about the low-level helper anymore.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
- Rework the inline directory verifier to avoid crashes on disk corruption
- Don't change file size when punching holes w/ KEEP_SIZE
- Close a kernel memory exposure bug
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABCgAGBQJY4qGYAAoJEPh/dxk0SrTr3mwQAIDqc1RlZThYETn5Mru9BeQ0
NmiDbgO1394OSxSxpBVZCVW1jU23j3eXbOgO0oD8iEOySdTOwAoCxb78aYUkHsbS
wKKvix2kprIfsAYfGW264MZjX2JYiBUTjhV8dYw9UB5Kot3TPvc+IgC7ICJtcb08
y+/ycjbgMwlWxOQwkReaPhVYhKXa3/vqLBNN0E9oacSdpT10Mkb95RDxwFpe5zX2
RjPYode8RHsoI9+OdvVwvbmPrvylxEt3jKZdnStRMjmAX/X2aUNVDYSCOSpNg+sz
9kXec96kkoJ4oQVfiaEo0k4LWlTZnDSFe25CX8Gxu7yPgGt8XsnDADkieGBdVcYq
2q3AfEpmQrYgnLFRu4D3A8KEX8IrP8MLAh/zY/heVp9fTx9uhDvnjF+f+F6mMYXF
wzEf27RN+DeaXRLbHp35NOKPriT63JpRB/Hsgew6ibEiVPdy8J1+VYoJSohixr4d
yryDiwtph25spMYxeH6HIiPCeWa6fMr08vwuasWJTEZirxYmF/ZEHTK/bU+ydlDp
R9/UDTmiP0V10xO09bjLgFiREIO/b1fiOoniAMhhrT4tW2FL31hRpzl8jPZ++2X/
W2xLMJ1iyVw/Cf1NDWGFIOFX6Y+lnN6dZHftNUno6MFlmdmajNaTbRTBiSGnPH4U
URH7Dxpmkgp7xuNs8xqv
=QdXx
-----END PGP SIGNATURE-----
Merge tag 'xfs-4.11-fixes-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull XFS fixes from Darrick Wong:
"Here are three more fixes for 4.11.
The first one reworks the inline directory verifier to check the
working copy of the directory metadata and to avoid triggering a
periodic crash in xfs/348. The second patch fixes a regression in hole
punching at EOF that corrupts files; and the third patch closes a
kernel memory disclosure bug.
Summary:
- rework the inline directory verifier to avoid crashes on disk
corruption
- don't change file size when punching holes w/ KEEP_SIZE
- close a kernel memory exposure bug"
* tag 'xfs-4.11-fixes-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: fix kernel memory exposure problems
xfs: Honor FALLOC_FL_KEEP_SIZE when punching ends of files
xfs: rework the inline directory verifiers
-----BEGIN PGP SIGNATURE-----
iQIVAwUAWOYUWfSw1s6N8H32AQIqdg/9Hi+47eues/TBbogP8eRrqVEoNHFy75e/
MMTFe0/Qio7ps78VuOSThbqh96dzIX5K5/7JdiHZyQk2QCTaJ2BvheCUISQovhFl
yuAJcBhkO5iiQkR0agYdHVjIQGRth3usNIEyD1rm1DS/lr8ec9/iyjoipKpsZmxt
WlRF3eGgqA+cLpH4K+k4x/LJwIl8868MBz58p6XXW2yZFRygQzYHmMobhDwgLoC2
C2lHPEyllK7qcIaZD7SI/a2/bMwh7QTx1tJuQK3DgtJrAHigx96uxH3jqECk7fLg
EhjLqIFmWVCUcrBbUqjlNtcuevzxCZTCCB0LAZgmOTyEyCFJzgoQmQo97VhMPbG5
JF9bKg+JE6P5iwqtTBEW9p+LoyM7VAt6SzeuKH/vNAVGHc0ULDMB8XPYF3Nvqa3L
RGIwcxWCAItZFdDCUvWgTyEuZtVXu6LbuvnU1HkaXlXsLLi4041MQgcekY58k6kv
z4YnXojy0+mciJ4WV/7CLfNMyP36G0gwLugjLAsJwigxJoOTtsfphkGcpWlP9hBm
IyTFJw0qbbuQD7fprfw//e+IgfjDQbxYMQKxQaJZflXzYDCab8PQkFTl1kWPrJSR
yR0rk8wKb7Z1fyl/zpUNbw7KdFganqhkZ6jbOrr9G8Hp8jyubnwMCI5B2rSZfe1V
CSOtCbEt8Is=
=JnXV
-----END PGP SIGNATURE-----
Merge tag 'rxrpc-rewrite-20170406' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
David Howells says:
====================
rxrpc: Miscellany
Here's a set of patches that make some minor changes to AF_RXRPC:
(1) Store error codes in struct rxrpc_call::error as negative codes and
only convert to positive in recvmsg() to avoid confusion inside the
kernel.
(2) Note the result of trying to abort a call (this fails if the call is
already 'completed').
(3) Don't abort on temporary errors whilst processing challenge and
response packets, but rather drop the packet and wait for
retransmission.
And also adds some more tracing:
(4) Protocol errors.
(5) Received abort packets.
(6) Changes in the Rx window size due to ACK packet information.
(7) Client call initiation (to allow the rxrpc_call struct pointer, the
wire call ID and the user ID/afs_call pointer to be cross-referenced).
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Mostly simple cases of overlapping changes (adding code nearby,
a function whose name changes, for example).
Signed-off-by: David S. Miller <davem@davemloft.net>
Use negative error codes in struct rxrpc_call::error because that's what
the kernel normally deals with and to make the code consistent. We only
turn them positive when transcribing into a cmsg for userspace recvmsg.
Signed-off-by: David Howells <dhowells@redhat.com>
Since callers statically know which type to use, make_dentry_ptr()
can simply be splitted into two inline functions. This way, the code
has less inlined, fewer arguments, and no cast.
Signed-off-by: Tomohiro Kusumi <tkusumi@tuxera.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch tries to split in-place-update bios from sequential bios.
Suggested-by: Yunlei He <heyunlei@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The variable 'i' has been defined before, so here we can
use it directly.
Signed-off-by: Kaixu Xia <xiakaixu@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If two threads try to flush dirty pages in different inodes respectively,
f2fs_write_data_pages() will produce WRITE and WRITE_SYNC one at a time,
resulting in a lot of 4KB seperated IOs.
So, this patch gives higher priority to WB_SYNC_ALL IOs and gathers write
IOs with a big WRITE_SYNC'ed bio.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
It would better split small and large IOs separately in order to get more
consecutive big writes.
The default threshold is set to 64KB, but configurable by sysfs/min_hot_blocks.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch changes to use bitmap instead of extent in struct discard_entry
to indicate discard range in one segment, for fragmented space, this
implementation can save memory footprint.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Adds to count discard command entry and show the number in debugfs,
also fix to add cost of discard command cache into total comsumed
memory footprint.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Show historical count of flush command and discard command.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Commit 86066914ed "gfs2: Don't support
fallocate on jdata files" removed the ability of gfs2_grow to reserve
space at the end of the rindex, which could prevent a second gfs2_grow
from succeeding if the fs is full. Allow fallocate to work on the rindex
once again.
Signed-off-by: Andrew Price <anprice@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
dquot_writeback_dquots() expects s_umount semaphore to be held to
protect it from other concurrent quota operations. reiserfs_sync_fs()
can call dquot_writeback_dquots() without holding s_umount semaphore
when called from flush_old_commits().
Fix the problem by grabbing s_umount in flush_old_commits(). However we
have to be careful and use only trylock since reiserfs_cancel_old_sync()
can be waiting for flush_old_commits() to complete while holding
s_umount semaphore. Possible postponing of sync work is not a big deal
though as that is only an opportunistic flush.
Fixes: 9d1ccbe70e
Reported-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Currently canceling of delayed work that flushes old data using
cancel_old_flush() does not prevent work from being requeued. Thus
in theory new work can be queued after cancel_old_flush() from
reiserfs_freeze() has run. This will become larger problem once
flush_old_commits() can requeue the work itself.
Fix the problem by recording in sbi->work_queue that flushing work is
canceled and should not be requeued.
Signed-off-by: Jan Kara <jack@suse.cz>
ext2_sync_fs() could be called without s_umount semaphore held when
called through ext2_write_super() from __ext2_write_inode(). This
function then calls dquot_writeback_dquots() which relies on s_umount to
be held for protection against other quota operations.
In fact __ext2_write_inode() does not need all the functionality
ext2_write_super() provides. It is enough to just write the superblock.
So use ext2_sync_super() instead.
Fixes: 9d1ccbe70e
Reported-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Use the realtime bitmap to return free space information via getfsmap.
Eventually this will be superseded by the realtime rmapbt code.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
If the reverse-mapping btree isn't available, fall back to the
free space btrees to provide partial reverse mapping information.
The online scrub tool can make use of even partial information to
speed up the data block scan.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Introduce a new ioctl that uses the reverse mapping btree to return
information about the physical layout of the filesystem.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Add _query_range and _query_all functions to the realtime bitmap
allocator. These two functions are similar in usage to the btree
functions with the same name and will be used for getfsmap and scrub.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Create a helper function that will query all records in a btree.
This will be used by the online repair functions to examine every
record in a btree to rebuild a second btree.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Implement a query_range function for the bnobt and cntbt. This will
be used for getfsmap fallback if there is no rmapbt and by the online
scrub and repair code.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Plumb in the pieces (init_high_key, diff_two_keys) necessary to call
query_range on the free space btrees. Remove the debugging asserts
so that we can make queries starting from block 0.
While we're at it, merge the redundant "if (btnum ==" hunks.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
In xfs_ioc_getbmap, we should only copy the fields of struct getbmap
from userspace, or else we end up copying random stack contents into the
kernel. struct getbmap is a strict subset of getbmapx, so a partial
structure copy should work fine.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
This function has been removed ever since at least 3.12-era. No need to
keep its declaration in the header so nuke it.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
"xfs_iread: validation failed for inode 96 failed"
One "failed" seems like enough.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Reviewed-by: Bill O'Donnell <billodo@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Opencoding the trivial checks makes it much easier to read (and grep..).
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
This checks for all the non-normal extent types, including handling both
encodings of delayed allocations.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The log covering background task used to be part of the xfssyncd
workqueue. That workqueue was removed as of commit 5889608df ("xfs:
syncd workqueue is no more") and the associated work item scheduled
to the xfs-log wq. The latter is used for log buffer I/O completion.
Since xfs_log_worker() can invoke a log flush, a deadlock is
possible between the xfs-log and xfs-cil workqueues. Consider the
following codepath from xfs_log_worker():
xfs_log_worker()
xfs_log_force()
_xfs_log_force()
xlog_cil_force()
xlog_cil_force_lsn()
xlog_cil_push_now()
flush_work()
The above is in xfs-log wq context and blocked waiting on the
completion of an xfs-cil work item. Concurrently, the cil push in
progress can end up blocked here:
xlog_cil_push_work()
xlog_cil_push()
xlog_write()
xlog_state_get_iclog_space()
xlog_wait(&log->l_flush_wait, ...)
The above is in xfs-cil context waiting on log buffer I/O
completion, which executes in xfs-log wq context. In this scenario
both workqueues are deadlocked waiting on eachother.
Add a new workqueue specifically for the high level log covering and
ail pushing worker, as was the case prior to commit 5889608df.
Diagnosed-by: David Jeffery <djeffery@redhat.com>
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>